Fairness Diagnostics
Subgroup error rates across race, gender, and age for the best Y1 XGBoost model. Gaps depend heavily on the classification threshold chosen.
Threshold Dependence
Race gapsFPR and FNR gaps between Black and White subjects vary substantially across classification thresholds. At t = 0.50 the race FPR gap is near zero (0.002), but at lower thresholds it widens. Conversely, the FNR gap peaks at t = 0.50 and is smaller at the extremes.
Race FPR / FNR Gap vs Threshold (Y1)
Absolute gap between Black and White subgroups at each threshold
Key finding: The FPR gap peaks around t = 0.25–0.30 and nearly vanishes at t = 0.50. The FNR gap is highest at t = 0.50. No single threshold minimizes both gaps simultaneously.
Subgroup Brier Scores (Y1)
CalibrationBrier score measures the mean squared difference between predicted probabilities and actual outcomes. Lower is better. Scores are broadly comparable across subgroups, with the largest spread between Female (0.149) and Black (0.195).
Brier Score by Subgroup
Lower values indicate better-calibrated probability estimates
Error Rates at Threshold 0.5 (Y1)
t = 0.50At the standard 0.50 classification threshold, the model exhibits a small FPR gap (0.002) between Black and White subjects but a moderate FNR gap (0.063). Gender differences are also present, driven partly by differing base rates.
| Group | FPR | FNR | PPV | Selection Rate |
|---|---|---|---|---|
| Black | 0.082 | 0.786 | 0.64 | 0.109 |
| White | 0.080 | 0.849 | 0.52 | 0.085 |
| Gap (B − W) | 0.002 | 0.063 | — | — |
| Male | 0.087 | 0.798 | 0.61 | 0.107 |
| Female | 0.040 | 0.884 | 0.47 | 0.044 |
With vs Without Race Feature
AblationRemoving the race indicator from the feature set does not eliminate performance disparities. Other features—geography, offense history, supervision patterns—correlate with race and carry much of the same signal.
| Variant | Brier | AUROC | Race FPR Gap (t=0.5) |
|---|---|---|---|
| With race | 0.188 | 0.702 | 0.002 |
| Without race | ~0.189 | ~0.700 | similar |
Key finding: Removing the race feature does not meaningfully change aggregate performance or fairness gaps. Other features (geography, prior record patterns) carry correlated signal.
Operational Policy Comparison
Trade-offsThe choice of classification threshold directly affects how many individuals are flagged as high-risk and the resulting error profile. A lower cutoff (or a top-K% selection rule) increases sensitivity but also increases false positives.
| Policy | FP (Black) | FN (Black) | FP (White) | FN (White) |
|---|---|---|---|---|
| t = 0.50 | ~370 | ~3,540 | ~170 | ~1,440 |
| Top 10% | ~680 | ~2,890 | ~350 | ~1,150 |
| Top 20% | ~1,250 | ~2,210 | ~640 | ~860 |
Choosing a more aggressive selection threshold (lower cutoff or top-K%) flags more people as high-risk, reducing misses but increasing false positives. Counts are approximate and based on the Y1 test set.