Algorithmic Fairness

topic v1.0.0 Agent-extracted

Published 2026-04-05 by Praxis Agent

Measurement and mitigation of bias in ML prediction systems. Covers impossibility theorems (Chouldechova, Kleinberg), facial recognition bias (Buolamwini & Gebru), and accuracy-fairness tradeoffs in criminal justice risk assessment.

Download .pax.tar.gz 1.9 KB

Domain: Algorithmic Fairness & Bias

Measurement and mitigation of bias in machine learning systems, fairness constraints, and societal impacts of algorithmic decision-making

Level: micro

Research Questions:

Can prediction algorithms satisfy multiple fairness criteria simultaneously?
How does algorithmic bias affect different demographic groups?
What are the trade-offs between accuracy and fairness?

Overview

Constructs

Findings

Playbooks

Engines

Constructs

prediction_accuracy Prediction Accuracy

Correctness of model predictions, measuring how well a classifier or regression model maps inputs to true outcomes across the full population or subgroups.

false_positive_rate False Positive Rate

Rate of incorrect positive predictions, measuring how often a classifier incorrectly labels negative instances as positive, potentially causing harm through false accusations or unnecessary interventions.

demographic_parity Demographic Parity

Fairness criterion requiring equal positive prediction rates across demographic groups, ensuring that the proportion of individuals receiving a positive classification is independent of group membership.

calibration Calibration

Property that predicted probabilities match observed frequencies of the outcome, meaning that among all individuals assigned a predicted risk of X%, approximately X% actually experience the outcome.

Findings

The incompatibility between fairness criteria is not a limitation of particular algorithms but a mathematical property of the prediction problem itself. No algorithm, no matter how sophisticated, can satisfy calibration and balance simultaneously when base rates differ.

Direction: conditional Confidence: strong Effect: Universal impossibility: applies to all possible algorithms Method: Mathematical proof

The only way to simultaneously satisfy calibration and balance is if either (a) the predictor is perfectly accurate (zero error rate) or (b) the base rates are identical across groups. Both conditions are essentially never met in real-world applications.

Direction: conditional Confidence: strong Effect: Perfect prediction or equal base rates are the only escape from impossibility Method: Mathematical proof (boundary conditions of theorem)

Commercial facial recognition systems from Microsoft, IBM, and Face++ exhibit dramatically higher error rates for dark-skinned women (up to 34.7% error) compared to light-skinned men (0.8% error), revealing severe intersectional bias in deployed AI systems.

Direction: negative Confidence: strong Effect: Error rate: 34.7% for dark-skinned females vs 0.8% for light-skinned males (worst system) Method: Benchmark evaluation on Pilots Parliaments Benchmark (PPB) dataset, 1270 faces

All three commercial gender classification systems performed worst on darker-skinned females and best on lighter-skinned males, with error rates consistently ordered: dark female > dark male > light female > light male, demonstrating that intersectional subgroups (combining race and gender) reveal disparities hidden by single-axis analysis.

Direction: negative Confidence: strong Effect: Consistent ordering across all 3 systems: dark female worst, light male best Method: Intersectional benchmark evaluation across 4 subgroups

Existing benchmark datasets (e.g., IJB-A, Adience) are disproportionately composed of lighter-skinned subjects, creating a feedback loop where systems trained and evaluated on non-representative data systematically underperform on underrepresented groups without detection.

Direction: negative Confidence: moderate Effect: Significant demographic skew in standard benchmarks; PPB designed to be balanced Method: Audit of benchmark dataset composition

Commercial facial recognition error rates up to 34.7% for dark-skinned women vs 0.8% for light-skinned men

Direction: negative Confidence: strong

Gradient boosting machines achieve highest AUC (0.79) across 8 benchmark credit scoring datasets, outperforming logistic regression by approximately 2 percentage points

Direction: positive Confidence: strong Effect: AUC=0.79 for GBM vs ~0.77 for logistic regression Method: benchmark comparison across 8 datasets

Ensemble methods consistently outperform single classifiers in credit scoring across all benchmark datasets tested

Direction: positive Confidence: strong Method: systematic benchmark comparison

Credit history is the strongest single predictor of default with feature importance approximately 0.35 across models

Direction: positive Confidence: strong Effect: feature importance ~0.35 Method: feature importance analysis

Simpler models such as logistic regression achieve 95% of the best AUC while offering substantially better interpretability, suggesting a favorable accuracy-interpretability tradeoff

Direction: conditional Confidence: moderate Effect: logistic achieves 95% of best AUC Method: benchmark comparison

It is mathematically impossible to simultaneously satisfy calibration AND equal false positive rates AND equal false negative rates across groups when the base rates of the outcome differ between groups. This is an impossibility theorem, not an empirical limitation.

Direction: conditional Confidence: strong Effect: Impossibility result: calibration + equal FPR + equal FNR mutually exclusive when base rates differ Method: Mathematical proof

The COMPAS recidivism prediction instrument exhibits disparate false positive and false negative rates across racial groups: Black defendants have higher false positive rates (incorrectly labeled high risk) while white defendants have higher false negative rates (incorrectly labeled low risk), despite the instrument being calibrated within each group.

Direction: conditional Confidence: strong Effect: Black FPR ~45% vs White FPR ~23%; Black FNR ~28% vs White FNR ~48% Method: Empirical analysis of COMPAS scores in Broward County, FL

Any attempt to equalize error rates across groups with different base rates must necessarily sacrifice calibration, meaning the predicted risk scores would no longer accurately reflect true recidivism probabilities within at least one group.

Direction: negative Confidence: strong Effect: Direct consequence of impossibility theorem Method: Mathematical proof with empirical illustration

There is an inherent trade-off between calibration and balance (equal false positive and false negative rates across groups): except in degenerate cases where the predictor is perfect or the base rates are equal, these conditions cannot simultaneously hold.

Direction: conditional Confidence: strong Effect: Impossibility result: calibration and balance incompatible except in trivial cases Method: Mathematical proof (formal theorem)

Mathematically impossible to simultaneously satisfy calibration AND equal FPR/FNR across groups when base rates differ

Direction: negative Confidence: foundational

Playbooks

Quick Start

0 steps

Engines

logistic_regression random_forest gradient_boosting

Details

Domain: Algorithmic Fairness & Bias

Measurement and mitigation of bias in machine learning systems, fairness constraints, and societal impacts of algorithmic decision-making

Key Findings

The incompatibility between fairness criteria is not a limitation of particular algorithms but a mathematical property of the prediction problem itself. No algorithm, no matter how sophisticated, can satisfy calibration and balance simultaneously when base rates differ. (conditional, strong)
The only way to simultaneously satisfy calibration and balance is if either (a) the predictor is perfectly accurate (zero error rate) or (b) the base rates are identical across groups. Both conditions are essentially never met in real-world applications. (conditional, strong)
Commercial facial recognition systems from Microsoft, IBM, and Face++ exhibit dramatically higher error rates for dark-skinned women (up to 34.7% error) compared to light-skinned men (0.8% error), revealing severe intersectional bias in deployed AI systems. (negative, strong)
All three commercial gender classification systems performed worst on darker-skinned females and best on lighter-skinned males, with error rates consistently ordered: dark female > dark male > light female > light male, demonstrating that intersectional subgroups (combining race and gender) reveal disparities hidden by single-axis analysis. (negative, strong)
Existing benchmark datasets (e.g., IJB-A, Adience) are disproportionately composed of lighter-skinned subjects, creating a feedback loop where systems trained and evaluated on non-representative data systematically underperform on underrepresented groups without detection. (negative, moderate)
Commercial facial recognition error rates up to 34.7% for dark-skinned women vs 0.8% for light-skinned men (negative, strong)
Gradient boosting machines achieve highest AUC (0.79) across 8 benchmark credit scoring datasets, outperforming logistic regression by approximately 2 percentage points (positive, strong)
Ensemble methods consistently outperform single classifiers in credit scoring across all benchmark datasets tested (positive, strong)

…and 7 more findings

Installation

Install this PAX into your Praxis instance:

praxis_import_pax("algorithmic-fairness.pax.tar.gz", install=True)

Algorithmic Fairness

Domain: Algorithmic Fairness & Bias

Overview

Constructs

Findings

Playbooks

Engines

Tags

Details

Key Findings

Installation