voidly
Sentinel forecast — live calibration

Model honesty, public

Every Sentinel shutdown forecast ships with a 90% conformal interval. This page tracks how often the real outcome lands inside that interval — the closer to 90%, the more honest the model. Data lives at /v1/sentinel/calibration/history and updates every 24h.

Self-published warning: Stratified AUC overstates real-world performance by 47.9pp vs. time-based split. Do not cite the stratified number as a deployment figure; use the loco_median or the prod_rolling block once it populates.
⚡ Recent fix: On 2026-05-20 we refit isotonic regression on 810 live (predicted, observed) pairs. The base XGBoost was underestimating risk by ~15× in the dominant prediction range.Brier: 0.5904 → 0.2231 · Calibration MAE: 0.6040 → 0.0000 · Iran 7-day risk: 0.146 → 0.74Live numbers below catch up over the next 24h. Read the full refit writeup →
🔁 ACI online conformal (live): Replaces manual isotonic recalibration with an online update (Gibbs & Candès, NeurIPS 2021). After every observed outcome the conformal quantile αt nudges toward the empirical-coverage target — so calibration never drifts more than ~5pp from the 90% nominal even when the data distribution shifts.Initial state replay (840 outcomes, Apr 17 → May 14): α = 0.10 → 0.21 · empirical coverage 91.3% · cron 03:45 UTCLive ACI state visible in every /v1/forecast/{cc}/7day response under aci_alpha + aci.* fields. Full ACI methodology →
Latest coverage
90.5%
empirical (target 90%)
Latest q90
0.048
conformal width
Drift alerts (90d)
0
days
Model version
v1
since May 3

Live forecast accuracy (prod_rolling, 30-day window)

Accuracy
49.5%
Brier score
0.58
Calibration MAE
0.59
Evaluated
840

Brier < 0.10 is good, > 0.30 is concerning. Calibration MAE < 0.05 means predicted-probabilities track observed-rates closely. See /sentinel/backtest for the actual reliability diagram (predicted-mean vs observed-rate scatter) and /methodology#validation for the full evaluation methodology + 3-split honest baselines.

Empirical coverage — 90-day rolling

The blue line is the actual fraction of forecasts where the real outcome landed inside the 90% conformal interval. The green dashed line is the nominal target (0.90). If the blue stays close to the green, the model is well calibrated.

0.70.80.91.0target 0.90Apr 18May 21

Blue: empirical coverage · Dashed green: nominal 0.90 target · Orange circles: drift alerts

Last 14 days

DateCoverageq90n holdoutDrift?
May 2190.5%0.0482,100
May 2090.5%0.0482,100
May 1990.5%0.0482,100
May 1890.5%0.0482,100
May 1790.5%0.0482,100
May 1690.5%0.0482,100
May 1590.5%0.0482,100
May 1490.5%0.0482,100
May 1390.5%0.0482,100
May 1290.5%0.0482,100
May 1190.5%0.0482,100
May 1090.5%0.0482,100
May 990.5%0.0482,100
May 890.5%0.0482,100

What features the model actually uses

Sklearn feature_importances_ on the underlying XGBoost. 39 features total. Top-3 sum: 0.23 · Top-5: 0.326 · Top-10: 0.492. Healthy distribution — no single feature dominates the model.

  • 1.gdelt_unrest_30d11.2%
  • 2.recent_shutdown6.1%
  • 3.week_of_year5.7%
  • 4.high_urgency_signals_7d5.6%
  • 5.month4.0%
  • 6.election_in_7days3.6%
  • 7.high_importance_event3.4%
  • 8.block_rate_roll30_mean3.3%
  • 9.block_rate_lag143.2%
  • 10.ooni_anomaly_7d3.1%
  • 11.block_rate_roll14_mean3.1%
  • 12.critical_incident_7d3.0%

Interpretation: The forecast model's top feature is gdelt_unrest_30d (0.25) — protest + conflict signals from the GDELT 1.0 global news feed. recent_shutdown, block_rate rolling means, and incident counts follow. risk_tier — the leaky country-level encoding that dominated our older classifier at 85% — contributes only ~2% here. Healthy distribution; no single feature dominates.

Raw JSON: /v1/sentinel/feature-importance

Related