Model Performance

XGBoost with Platt calibration evaluated on NIJ Georgia parole cohort. Primary metric: Brier score (proper scoring rule rewarding both discrimination and calibration).

XGBoost Leaderboard

Dataset Horizon Calibration Brier AUROC AUPRC ECE
dynamic Y1 platt 0.188 0.702 0.479 0.007
dynamic Y2 isotonic 0.172 0.705 0.426 0.017
dynamic Y3 isotonic 0.143 0.696 0.316 0.012

Baseline Comparison

XGBoost vs Baselines (Brier Score)

Grouped by prediction horizon. Lower is better.

Insight: XGBoost improvement over logistic regression is real but small: 0.001–0.006 Brier points. The biggest gain is at Y2 where dynamic features add discriminative power.

Seed Stability

Brier Score Stability (Seeds 42/43/44)

Mean and standard deviation across three random seeds.

AUROC Stability (Seeds 42/43/44)

Horizon AUROC Mean AUROC Std
Y1 0.70193 ± 0.00363
Y2 0.71549 ± 0.01278
Y3 0.70392 ± 0.00796

Insight: Aggregate metrics are very stable across seeds. Subgroup gaps are more volatile: race FPR gap std ranges from 0.006 to 0.018.

COMPAS Benchmark

Model Brier AUROC Note
COMPAS XGBoost 0.173 0.808 Broward County, FL

Different jurisdiction (Broward County), different features, different outcome definition. Not directly comparable to NIJ Georgia results.