The honest calibration plot
When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 570 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.
Updated every 30 min · last refresh Jun 8, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes
Reliability diagram
Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.
Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated
Per-bin breakdown
| Bin | Predicted mean | Observed rate | Δ | n |
|---|---|---|---|---|
| [0.0, 0.1) | 0.051 | 0.344 | +0.294 | 520 |
| [0.1, 0.2) | 0.139 | 0.115 | -0.024 | 26 |
| [0.3, 0.4) | 0.328 | 0.833 | +0.505 | 6 |
| [0.6, 0.7) | 0.691 | 0.167 | -0.524 | 6 |
| [0.7, 0.8) | 0.703 | 0.000 | -0.703 | 1 |
| [0.8, 0.9) | 0.823 | 0.273 | -0.550 | 11 |
Δ = observed − predicted. The 0.1 bin holds 520 of the 570 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.
Per-country backtest (worst Brier first, n ≥ 5)
Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.
| Country | Brier | Accuracy | P | R | n | Pos rate |
|---|---|---|---|---|---|---|
| BangladeshBD | 0.000 | 0% | 0.40 | 0.29 | 19 | 0% |
| BrazilBR | 0.000 | 0% | 0.00 | 0.00 | 19 | 0% |
| BelarusBY | 0.000 | 0% | 0.60 | 0.50 | 19 | 0% |
| ChinaCN | 0.000 | 0% | 0.29 | 0.83 | 19 | 0% |
| CubaCU | 0.000 | 0% | 0.25 | 0.17 | 19 | 0% |
| EgyptEG | 0.000 | 0% | 0.85 | 0.65 | 19 | 0% |
| ERER | 0.000 | 0% | 0.00 | — | 19 | 0% |
| EthiopiaET | 0.000 | 0% | 0.58 | 1.00 | 19 | 0% |
| IndonesiaID | 0.000 | 0% | 0.71 | 0.56 | 19 | 0% |
| IndiaIN | 0.000 | 0% | 1.00 | 0.53 | 19 | 0% |
| IranIR | 0.000 | 0% | 0.67 | 0.92 | 19 | 0% |
| North KoreaKP | 0.000 | 0% | 0.00 | — | 19 | 0% |
| KazakhstanKZ | 0.000 | 0% | 0.00 | — | 19 | 0% |
| LebanonLB | 0.000 | 0% | 0.00 | — | 19 | 0% |
| MyanmarMM | 0.000 | 0% | 0.00 | — | 19 | 0% |
| MalaysiaMY | 0.000 | 0% | 0.00 | — | 19 | 0% |
| NigeriaNG | 0.000 | 0% | 0.33 | 0.50 | 19 | 0% |
| NicaraguaNI | 0.000 | 0% | 0.36 | 0.67 | 19 | 0% |
| PhilippinesPH | 0.000 | 0% | 0.00 | 0.00 | 19 | 0% |
| PakistanPK | 0.000 | 0% | 1.00 | 0.37 | 19 | 0% |
How to read these numbers
- Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
- Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
- Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
- F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
- The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.
Related
- /sentinel/calibration — 90-day time series of empirical coverage vs the 90% conformal target
- /methodology#validation — the three honest accuracy splits (LOCO, stratified, time-based)
- /atlas/forecast/IR — per-country calibrated forecast detail with SHAP drivers