The honest calibration plot
When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 1,140 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.
Updated every 30 min · last refresh Jul 5, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes
Reliability diagram
Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.
Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated
Per-bin breakdown
| Bin | Predicted mean | Observed rate | Δ | n |
|---|---|---|---|---|
| [0.0, 0.1) | 0.042 | 0.168 | +0.126 | 845 |
| [0.1, 0.2) | 0.159 | 0.339 | +0.181 | 56 |
| [0.2, 0.3) | 0.246 | 0.300 | +0.054 | 10 |
| [0.3, 0.4) | 0.328 | 0.333 | +0.006 | 3 |
| [0.4, 0.5) | 0.451 | 0.633 | +0.182 | 49 |
| [0.5, 0.6) | 0.570 | 0.263 | -0.307 | 19 |
| [0.6, 0.7) | 0.644 | 0.267 | -0.377 | 75 |
| [0.7, 0.8) | 0.757 | 0.207 | -0.550 | 29 |
| [0.8, 0.9) | 0.851 | 0.063 | -0.789 | 16 |
| [0.9, 1.0) | 0.949 | 0.210 | -0.738 | 38 |
Δ = observed − predicted. The 0.1 bin holds 845 of the 1,140 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.
Per-country backtest (worst Brier first, n ≥ 5)
Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.
| Country | Brier | Accuracy | P | R | n | Pos rate |
|---|---|---|---|---|---|---|
| BangladeshBD | 0.000 | 0% | 0.00 | 0.00 | 38 | 0% |
| BrazilBR | 0.000 | 0% | 0.00 | — | 38 | 0% |
| BelarusBY | 0.000 | 0% | 0.36 | 0.50 | 38 | 0% |
| ChinaCN | 0.000 | 0% | 0.00 | — | 38 | 0% |
| CubaCU | 0.000 | 0% | 0.15 | 0.22 | 38 | 0% |
| EgyptEG | 0.000 | 0% | 0.76 | 0.96 | 38 | 0% |
| ERER | 0.000 | 0% | 0.00 | — | 38 | 0% |
| EthiopiaET | 0.000 | 0% | 0.29 | 1.00 | 38 | 0% |
| IndonesiaID | 0.000 | 0% | 0.00 | — | 38 | 0% |
| IndiaIN | 0.000 | 0% | 0.57 | 0.35 | 38 | 0% |
| IranIR | 0.000 | 0% | 0.73 | 0.46 | 38 | 0% |
| North KoreaKP | 0.000 | 0% | 0.00 | — | 38 | 0% |
| KazakhstanKZ | 0.000 | 0% | 0.06 | 0.25 | 38 | 0% |
| LebanonLB | 0.000 | 0% | 0.00 | 0.00 | 38 | 0% |
| MyanmarMM | 0.000 | 0% | 0.30 | 0.30 | 38 | 0% |
| MalaysiaMY | 0.000 | 0% | 0.00 | — | 38 | 0% |
| NigeriaNG | 0.000 | 0% | 0.00 | — | 38 | 0% |
| NicaraguaNI | 0.000 | 0% | 0.00 | — | 38 | 0% |
| PhilippinesPH | 0.000 | 0% | 0.00 | — | 38 | 0% |
| PakistanPK | 0.000 | 0% | 0.26 | 0.90 | 38 | 0% |
How to read these numbers
- Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
- Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
- Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
- F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
- The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.
Related
- /sentinel/calibration — 90-day time series of empirical coverage vs the 90% conformal target
- /methodology#validation — the three honest accuracy splits (LOCO, stratified, time-based)
- /atlas/forecast/IR — per-country calibrated forecast detail with SHAP drivers