Measurement and mitigation of bias in ML prediction systems. Covers impossibility theorems (Chouldechova, Kleinberg), facial recognition bias (Buolamwini & Gebru), and accuracy-fairness tradeoffs in criminal justice risk assessment.
Constructs
prediction_accuracy
Prediction Accuracy
Correctness of model predictions, measuring how well a classifier or regression model maps inputs to true outcomes across the full population or subgroups.
false_positive_rate
False Positive Rate
Rate of incorrect positive predictions, measuring how often a classifier incorrectly labels negative instances as positive, potentially causing harm through false accusations or unnecessary interventions.
demographic_parity
Demographic Parity
Fairness criterion requiring equal positive prediction rates across demographic groups, ensuring that the proportion of individuals receiving a positive classification is independent of group membership.
calibration
Calibration
Property that predicted probabilities match observed frequencies of the outcome, meaning that among all individuals assigned a predicted risk of X%, approximately X% actually experience the outcome.
Findings
The incompatibility between fairness criteria is not a limitation of particular algorithms but a mathematical property of the prediction problem itself. No algorithm, no matter how sophisticated, can satisfy calibration and balance simultaneously when base rates differ.
Direction: conditional
Confidence: strong
Effect: Universal impossibility: applies to all possible algorithms
Method: Mathematical proof
The only way to simultaneously satisfy calibration and balance is if either (a) the predictor is perfectly accurate (zero error rate) or (b) the base rates are identical across groups. Both conditions are essentially never met in real-world applications.
Direction: conditional
Confidence: strong
Effect: Perfect prediction or equal base rates are the only escape from impossibility
Method: Mathematical proof (boundary conditions of theorem)
Commercial facial recognition systems from Microsoft, IBM, and Face++ exhibit dramatically higher error rates for dark-skinned women (up to 34.7% error) compared to light-skinned men (0.8% error), revealing severe intersectional bias in deployed AI systems.
Direction: negative
Confidence: strong
Effect: Error rate: 34.7% for dark-skinned females vs 0.8% for light-skinned males (worst system)
Method: Benchmark evaluation on Pilots Parliaments Benchmark (PPB) dataset, 1270 faces
All three commercial gender classification systems performed worst on darker-skinned females and best on lighter-skinned males, with error rates consistently ordered: dark female > dark male > light female > light male, demonstrating that intersectional subgroups (combining race and gender) reveal disparities hidden by single-axis analysis.
Direction: negative
Confidence: strong
Effect: Consistent ordering across all 3 systems: dark female worst, light male best
Method: Intersectional benchmark evaluation across 4 subgroups
Existing benchmark datasets (e.g., IJB-A, Adience) are disproportionately composed of lighter-skinned subjects, creating a feedback loop where systems trained and evaluated on non-representative data systematically underperform on underrepresented groups without detection.
Direction: negative
Confidence: moderate
Effect: Significant demographic skew in standard benchmarks; PPB designed to be balanced
Method: Audit of benchmark dataset composition
Commercial facial recognition error rates up to 34.7% for dark-skinned women vs 0.8% for light-skinned men
Direction: negative
Confidence: strong
Gradient boosting machines achieve highest AUC (0.79) across 8 benchmark credit scoring datasets, outperforming logistic regression by approximately 2 percentage points
Direction: positive
Confidence: strong
Effect: AUC=0.79 for GBM vs ~0.77 for logistic regression
Method: benchmark comparison across 8 datasets
Ensemble methods consistently outperform single classifiers in credit scoring across all benchmark datasets tested
Direction: positive
Confidence: strong
Method: systematic benchmark comparison
Credit history is the strongest single predictor of default with feature importance approximately 0.35 across models
Direction: positive
Confidence: strong
Effect: feature importance ~0.35
Method: feature importance analysis
Simpler models such as logistic regression achieve 95% of the best AUC while offering substantially better interpretability, suggesting a favorable accuracy-interpretability tradeoff
Direction: conditional
Confidence: moderate
Effect: logistic achieves 95% of best AUC
Method: benchmark comparison
It is mathematically impossible to simultaneously satisfy calibration AND equal false positive rates AND equal false negative rates across groups when the base rates of the outcome differ between groups. This is an impossibility theorem, not an empirical limitation.
Direction: conditional
Confidence: strong
Effect: Impossibility result: calibration + equal FPR + equal FNR mutually exclusive when base rates differ
Method: Mathematical proof
The COMPAS recidivism prediction instrument exhibits disparate false positive and false negative rates across racial groups: Black defendants have higher false positive rates (incorrectly labeled high risk) while white defendants have higher false negative rates (incorrectly labeled low risk), despite the instrument being calibrated within each group.
Direction: conditional
Confidence: strong
Effect: Black FPR ~45% vs White FPR ~23%; Black FNR ~28% vs White FNR ~48%
Method: Empirical analysis of COMPAS scores in Broward County, FL
Any attempt to equalize error rates across groups with different base rates must necessarily sacrifice calibration, meaning the predicted risk scores would no longer accurately reflect true recidivism probabilities within at least one group.
Direction: negative
Confidence: strong
Effect: Direct consequence of impossibility theorem
Method: Mathematical proof with empirical illustration
There is an inherent trade-off between calibration and balance (equal false positive and false negative rates across groups): except in degenerate cases where the predictor is perfect or the base rates are equal, these conditions cannot simultaneously hold.
Direction: conditional
Confidence: strong
Effect: Impossibility result: calibration and balance incompatible except in trivial cases
Method: Mathematical proof (formal theorem)
Mathematically impossible to simultaneously satisfy calibration AND equal FPR/FNR across groups when base rates differ
Direction: negative
Confidence: foundational