Methodology

Technical details on models, calibration, evaluation, and fairness diagnostics.

1. Models

Model	Type	Details
XGBoost	Gradient boosting	500 trees, lr=0.05, depth=6, Optuna-tuned (32 trials, TPE sampler, minimizing Brier)
Logistic Regression	Linear	Custom batch gradient descent, L2=1e-3
Lasso-Logistic	L1-regularized	L1=5e-4, proximal gradient descent
Base Rate	Naive	Predicts training-set positive rate for all

2. Calibration

Raw: no post-processing
Platt scaling: sigmoid fit on validation predictions
Isotonic regression: non-parametric monotone fit

3. Evaluation Metrics

Metric	Role	Notes
Brier score	Primary	Proper scoring rule (lower is better). Rewards calibration AND discrimination
AUROC	Ranking	How well model ranks individuals (higher is better)
AUPRC	Positive-class ranking	Important when base rate is moderate
ECE	Calibration check	Expected calibration error
FPR/FNR	Threshold-dependent	False positive/negative rates at chosen threshold

4. Horizon Conditioning

Y1: full sample (N=18,028)
Y2: only individuals NOT rearrested in Y1 (N=12,651)
Y3: only individuals NOT rearrested in Y1 OR Y2 (N=9,398)

This prevents label leakage across horizons.

5. Fairness Diagnostics

Subgroups: Race (Black/White), Gender (M/F), Age (7 bands)
Metrics computed at multiple thresholds: 0 to 1, step 0.05
Bootstrap CIs: 2,000 iterations overall, 200 for subgroups
With/without race feature comparison
Operational policy evaluation: t=0.5, top 10%, top 20%

6. SHAP

TreeExplainer on XGBoost
Up to 2,000 samples
Mean absolute SHAP value per feature

7. Libraries

Library	Version
scikit-learn	≥1.4
xgboost	≥2.0
optuna	≥3.6
shap	≥0.44
Python	3.11+

8. Data

NIJ Recidivism Forecasting Challenge (Georgia parole cohort, ~18,000)
COMPAS benchmark (Broward County, ~6,200) — separate validation
Labels are rearrest outcomes (observable justice system events, not comprehensive behavior)
Raw data not redistributed; available from NIJ