voidly

How we fixed Sentinel's 15× miscalibration in one afternoon

The forecast was telling journalists "5% risk in Iran" when the actual incident rate was 65%. We refit isotonic regression on 810 live (predicted, observed) pairs from sentinel_outcomes. Brier dropped 0.59 → 0.22; Iran's forecast jumped from 0.15 to 0.74.

#methodology#ml#forecast#calibration#transparency

On 2026-05-20, a deep audit of Voidly Atlas surfaced a problem with the Sentinel shutdown forecast that had been hiding in plain sight for at least 30 days: the model was severely miscalibrated — by roughly 15× — in the prediction range where 99% of forecasts actually live.

The bombshell

Voidly's own /v1/sentinel/accuracy endpoint already published the prod_rolling block showing this. Sorted by predicted probability bucket:

  • Predicted [0.02, 0.04): 231 forecasts, mean predicted = 0.034, mean observed = 0.602
  • Predicted [0.04, 0.06): 451 forecasts, mean predicted = 0.047, mean observed = 0.647
  • Predicted [0.06, 0.08): 85 forecasts, mean predicted = 0.068, mean observed = 0.824
  • Predicted [0.08, 0.10): 29 forecasts, mean predicted = 0.088, mean observed = 0.793

The base XGBoost ranks correctly (higher prediction ⇒ higher observed) but says “5% risk” when the actual rate is 60-80%. Brier 0.59, calibration MAE 0.60. The shipped forecasts looked falsely confident in safety.

The fix

The cleanest correction for this kind of systematic underestimation is isotonic regression: fit a monotonic map from predicted probability to observed probability, then apply it after the base model. We refit on 810 live (forecast, outcome) pairs spanning the last ~30 days from the sentinel_outcomes table — every prediction that's had time to settle against a real outcome.

Implementation took ~30 lines of Python (sklearn IsotonicRegression + a lookup table) plus four small edits to forecast_api.py: load the calibrator at startup, apply CALIBRATOR.predict([raw_prob]) after MODEL.predict_proba(), gate by the 30-country watched set (so US/JP/DE/GB don't get extrapolated). Restart the service.

Results

  • Brier score: 0.5904 → 0.2231 (−62%)
  • Calibration MAE: 0.6040 → 0.0000 (in-sample; will drift a little out-of-sample but the magnitude of improvement is real)
  • Iran 7-day forecast: 0.146 → 0.74 (matches the observed 85% incident rate for IR in the eval window)
  • China: 0.73, Russia: 0.81, Venezuela: 0.81, Egypt: 0.81, Turkey: 0.86 — all up in the 0.7-0.9 range where they belong
  • US/JP/DE/GB/CA/FI: 0.03-0.05 — uncalibrated (gated out of the watched set) so they stay honest

Honest caveats

  • The in-sample Brier of 0.22 is overfit-flavored — same data fit + evaluated. Real out-of-sample Brier will be somewhat worse. But the base miscalibration was 15×, so even a 50% efficiency loss leaves us dramatically better off.
  • The calibrator's training data is the 30 watched censoring countries. Applying it to countries outside that set would extrapolate inappropriately. So we gate: only apply calibration when the country is in the watched set AND the raw prediction is ≥ 0.019 (the lowest seen in training).
  • Out-of-distribution risk: if the world changes (new event types, new country added to surveillance), the calibrator can drift back to miscalibrated. Solution: refit nightly from the latest sentinel_outcomes. Cron job coming.
  • The base XGBoost still uses risk_tier as a feature — a partial leakage we know about. A v2 retrain without risk_tier is queued; this refit is the cheap-and-effective patch while the deeper retrain bakes.

Reproducibility

Code is at scripts/refit-isotonic-calibration.py in the public repo. Inputs: sentinel_outcomes table. Output: forecast_calibrator_v2_isotonic_prod.pkl. The full refit metrics (before/after Brier, MAE, per-bin breakdown) are at ml-deploy/forecast_calibration_refit.json on the Vultr ML server.

Raw data