Methodology
Technical details on models, calibration, evaluation, and fairness diagnostics.
| Model |
Type |
Details |
| XGBoost |
Gradient boosting |
500 trees, lr=0.05, depth=6, Optuna-tuned (32 trials, TPE sampler, minimizing Brier) |
| Logistic Regression |
Linear |
Custom batch gradient descent, L2=1e-3 |
| Lasso-Logistic |
L1-regularized |
L1=5e-4, proximal gradient descent |
| Base Rate |
Naive |
Predicts training-set positive rate for all |
- Raw: no post-processing
- Platt scaling: sigmoid fit on validation predictions
- Isotonic regression: non-parametric monotone fit
| Metric |
Role |
Notes |
| Brier score |
Primary |
Proper scoring rule (lower is better). Rewards calibration AND discrimination |
| AUROC |
Ranking |
How well model ranks individuals (higher is better) |
| AUPRC |
Positive-class ranking |
Important when base rate is moderate |
| ECE |
Calibration check |
Expected calibration error |
| FPR/FNR |
Threshold-dependent |
False positive/negative rates at chosen threshold |
- Y1: full sample (N=18,028)
- Y2: only individuals NOT rearrested in Y1 (N=12,651)
- Y3: only individuals NOT rearrested in Y1 OR Y2 (N=9,398)
This prevents label leakage across horizons.
- Subgroups: Race (Black/White), Gender (M/F), Age (7 bands)
- Metrics computed at multiple thresholds: 0 to 1, step 0.05
- Bootstrap CIs: 2,000 iterations overall, 200 for subgroups
- With/without race feature comparison
- Operational policy evaluation: t=0.5, top 10%, top 20%
- TreeExplainer on XGBoost
- Up to 2,000 samples
- Mean absolute SHAP value per feature
| Library |
Version |
| scikit-learn |
≥1.4 |
| xgboost |
≥2.0 |
| optuna |
≥3.6 |
| shap |
≥0.44 |
| Python |
3.11+ |
- NIJ Recidivism Forecasting Challenge (Georgia parole cohort, ~18,000)
- COMPAS benchmark (Broward County, ~6,200) — separate validation
- Labels are rearrest outcomes (observable justice system events, not comprehensive behavior)
- Raw data not redistributed; available from NIJ