Fairness Diagnostics

Subgroup error rates across race, gender, and age for the best Y1 XGBoost model. Gaps depend heavily on the classification threshold chosen.

Threshold Dependence

Race gaps

FPR and FNR gaps between Black and White subjects vary substantially across classification thresholds. At t = 0.50 the race FPR gap is near zero (0.002), but at lower thresholds it widens. Conversely, the FNR gap peaks at t = 0.50 and is smaller at the extremes.

Race FPR / FNR Gap vs Threshold (Y1)

Absolute gap between Black and White subgroups at each threshold

Key finding: The FPR gap peaks around t = 0.25–0.30 and nearly vanishes at t = 0.50. The FNR gap is highest at t = 0.50. No single threshold minimizes both gaps simultaneously.

Subgroup Brier Scores (Y1)

Calibration

Brier score measures the mean squared difference between predicted probabilities and actual outcomes. Lower is better. Scores are broadly comparable across subgroups, with the largest spread between Female (0.149) and Black (0.195).

Brier Score by Subgroup

Lower values indicate better-calibrated probability estimates

Error Rates at Threshold 0.5 (Y1)

t = 0.50

At the standard 0.50 classification threshold, the model exhibits a small FPR gap (0.002) between Black and White subjects but a moderate FNR gap (0.063). Gender differences are also present, driven partly by differing base rates.

Group	FPR	FNR	PPV	Selection Rate
Black	0.082	0.786	0.64	0.109
White	0.080	0.849	0.52	0.085
Gap (B − W)	0.002	0.063	—	—
Male	0.087	0.798	0.61	0.107
Female	0.040	0.884	0.47	0.044

With vs Without Race Feature

Ablation

Removing the race indicator from the feature set does not eliminate performance disparities. Other features—geography, offense history, supervision patterns—correlate with race and carry much of the same signal.

Variant	Brier	AUROC	Race FPR Gap (t=0.5)
With race	0.188	0.702	0.002
Without race	~0.189	~0.700	similar

Key finding: Removing the race feature does not meaningfully change aggregate performance or fairness gaps. Other features (geography, prior record patterns) carry correlated signal.

Operational Policy Comparison

Trade-offs

The choice of classification threshold directly affects how many individuals are flagged as high-risk and the resulting error profile. A lower cutoff (or a top-K% selection rule) increases sensitivity but also increases false positives.

Policy	FP (Black)	FN (Black)	FP (White)	FN (White)
t = 0.50	~370	~3,540	~170	~1,440
Top 10%	~680	~2,890	~350	~1,150
Top 20%	~1,250	~2,210	~640	~860

Choosing a more aggressive selection threshold (lower cutoff or top-K%) flags more people as high-risk, reducing misses but increasing false positives. Counts are approximate and based on the Y1 test set.