Model Performance

XGBoost with Platt calibration evaluated on NIJ Georgia parole cohort. Primary metric: Brier score (proper scoring rule rewarding both discrimination and calibration).

XGBoost Leaderboard

Best per horizon

Dataset	Horizon	Calibration	Brier	AUROC	AUPRC	ECE
dynamic	Y1	platt	0.188	0.702	0.479	0.007
dynamic	Y2	isotonic	0.172	0.705	0.426	0.017
dynamic	Y3	isotonic	0.143	0.696	0.316	0.012

Baseline Comparison

XGBoost vs Baselines (Brier Score)

Grouped by prediction horizon. Lower is better.

Insight: XGBoost improvement over logistic regression is real but small: 0.001–0.006 Brier points. The biggest gain is at Y2 where dynamic features add discriminative power.

Seed Stability

Brier Score Stability (Seeds 42/43/44)

Mean and standard deviation across three random seeds.

AUROC Stability (Seeds 42/43/44)

Horizon	AUROC Mean	AUROC Std
Y1	0.70193	± 0.00363
Y2	0.71549	± 0.01278
Y3	0.70392	± 0.00796

Insight: Aggregate metrics are very stable across seeds. Subgroup gaps are more volatile: race FPR gap std ranges from 0.006 to 0.018.

COMPAS Benchmark

External reference

Model	Brier	AUROC	Note
COMPAS XGBoost	0.173	0.808	Broward County, FL

Different jurisdiction (Broward County), different features, different outcome definition. Not directly comparable to NIJ Georgia results.