On 2026-05-20, a deep audit of Voidly Atlas surfaced a problem with the Sentinel shutdown forecast that had been hiding in plain sight for at least 30 days: the model was severely miscalibrated — by roughly 15× — in the prediction range where 99% of forecasts actually live.
The bombshell
Voidly's own /v1/sentinel/accuracy endpoint already
published the prod_rolling block showing this. Sorted by predicted
probability bucket:
- Predicted [0.02, 0.04): 231 forecasts, mean predicted = 0.034, mean observed = 0.602
- Predicted [0.04, 0.06): 451 forecasts, mean predicted = 0.047, mean observed = 0.647
- Predicted [0.06, 0.08): 85 forecasts, mean predicted = 0.068, mean observed = 0.824
- Predicted [0.08, 0.10): 29 forecasts, mean predicted = 0.088, mean observed = 0.793
The base XGBoost ranks correctly (higher prediction ⇒ higher observed) but says “5% risk” when the actual rate is 60-80%. Brier 0.59, calibration MAE 0.60. The shipped forecasts looked falsely confident in safety.
The fix
The cleanest correction for this kind of systematic
underestimation is isotonic regression: fit a
monotonic map from predicted probability to observed probability,
then apply it after the base model. We refit on 810 live (forecast,
outcome) pairs spanning the last ~30 days from the
sentinel_outcomes table — every prediction that's had
time to settle against a real outcome.
Implementation took ~30 lines of Python (sklearn
IsotonicRegression + a lookup table) plus four small
edits to forecast_api.py: load the calibrator at
startup, apply CALIBRATOR.predict([raw_prob]) after
MODEL.predict_proba(), gate by the 30-country watched
set (so US/JP/DE/GB don't get extrapolated). Restart the service.
Results
- Brier score: 0.5904 → 0.2231 (−62%)
- Calibration MAE: 0.6040 → 0.0000 (in-sample; will drift a little out-of-sample but the magnitude of improvement is real)
- Iran 7-day forecast: 0.146 → 0.74 (matches the observed 85% incident rate for IR in the eval window)
- China: 0.73, Russia: 0.81, Venezuela: 0.81, Egypt: 0.81, Turkey: 0.86 — all up in the 0.7-0.9 range where they belong
- US/JP/DE/GB/CA/FI: 0.03-0.05 — uncalibrated (gated out of the watched set) so they stay honest
Honest caveats
- The in-sample Brier of 0.22 is overfit-flavored — same data fit + evaluated. Real out-of-sample Brier will be somewhat worse. But the base miscalibration was 15×, so even a 50% efficiency loss leaves us dramatically better off.
- The calibrator's training data is the 30 watched censoring countries. Applying it to countries outside that set would extrapolate inappropriately. So we gate: only apply calibration when the country is in the watched set AND the raw prediction is ≥ 0.019 (the lowest seen in training).
-
Out-of-distribution risk: if the world changes (new event types,
new country added to surveillance), the calibrator can drift back
to miscalibrated. Solution: refit nightly from the latest
sentinel_outcomes. Cron job coming. -
The base XGBoost still uses
risk_tieras a feature — a partial leakage we know about. A v2 retrain without risk_tier is queued; this refit is the cheap-and-effective patch while the deeper retrain bakes.
Reproducibility
Code is at scripts/refit-isotonic-calibration.py in
the public repo. Inputs: sentinel_outcomes table.
Output: forecast_calibrator_v2_isotonic_prod.pkl. The
full refit metrics (before/after Brier, MAE, per-bin breakdown)
are at ml-deploy/forecast_calibration_refit.json on
the Vultr ML server.