Model Performance
XGBoost with Platt calibration evaluated on NIJ Georgia parole cohort. Primary metric: Brier score (proper scoring rule rewarding both discrimination and calibration).
XGBoost Leaderboard
Best per horizon| Dataset | Horizon | Calibration | Brier | AUROC | AUPRC | ECE |
|---|---|---|---|---|---|---|
| dynamic | Y1 | platt | 0.188 | 0.702 | 0.479 | 0.007 |
| dynamic | Y2 | isotonic | 0.172 | 0.705 | 0.426 | 0.017 |
| dynamic | Y3 | isotonic | 0.143 | 0.696 | 0.316 | 0.012 |
Baseline Comparison
XGBoost vs Baselines (Brier Score)
Grouped by prediction horizon. Lower is better.
Insight: XGBoost improvement over logistic regression is real but small: 0.001–0.006 Brier points. The biggest gain is at Y2 where dynamic features add discriminative power.
Seed Stability
Brier Score Stability (Seeds 42/43/44)
Mean and standard deviation across three random seeds.
AUROC Stability (Seeds 42/43/44)
| Horizon | AUROC Mean | AUROC Std |
|---|---|---|
| Y1 | 0.70193 | ± 0.00363 |
| Y2 | 0.71549 | ± 0.01278 |
| Y3 | 0.70392 | ± 0.00796 |
Insight: Aggregate metrics are very stable across seeds. Subgroup gaps are more volatile: race FPR gap std ranges from 0.006 to 0.018.
COMPAS Benchmark
External reference| Model | Brier | AUROC | Note |
|---|---|---|---|
| COMPAS XGBoost | 0.173 | 0.808 | Broward County, FL |
Different jurisdiction (Broward County), different features, different outcome definition. Not directly comparable to NIJ Georgia results.