The honest calibration plot
When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 840 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.
Updated every 30 min · last refresh May 21, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes
Reliability diagram
Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.
Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated
Per-bin breakdown
| Bin | Predicted mean | Observed rate | Δ | n |
|---|---|---|---|---|
| [0.0, 0.1) | 0.047 | 0.644 | +0.597 | 828 |
| [0.1, 0.2) | 0.155 | 0.083 | -0.071 | 12 |
Δ = observed − predicted. The 0.1 bin holds 828 of the 840 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.
Per-country backtest (worst Brier first, n ≥ 5)
Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.
| Country | Brier | Accuracy | P | R | n | Pos rate |
|---|---|---|---|---|---|---|
| BangladeshBD | 0.000 | 0% | 0.89 | 0.38 | 28 | 0% |
| BrazilBR | 0.000 | 0% | 0.67 | 0.21 | 28 | 0% |
| BelarusBY | 0.000 | 0% | 0.67 | 0.13 | 28 | 0% |
| ChinaCN | 0.000 | 0% | 0.71 | 0.55 | 28 | 0% |
| CubaCU | 0.000 | 0% | 0.50 | 0.17 | 28 | 0% |
| EgyptEG | 0.000 | 0% | 1.00 | 0.21 | 28 | 0% |
| ERER | 0.000 | 0% | 0.00 | — | 28 | 0% |
| EthiopiaET | 0.000 | 0% | 0.95 | 0.77 | 28 | 0% |
| IndonesiaID | 0.000 | 0% | 0.50 | 0.05 | 28 | 0% |
| IndiaIN | 0.000 | 0% | 1.00 | 0.32 | 28 | 0% |
| IranIR | 0.000 | 0% | 0.84 | 0.88 | 28 | 0% |
| North KoreaKP | 0.000 | 0% | 0.00 | — | 28 | 0% |
| KazakhstanKZ | 0.000 | 0% | 0.50 | 0.13 | 28 | 0% |
| LebanonLB | 0.000 | 0% | 0.54 | 1.00 | 28 | 0% |
| MyanmarMM | 0.000 | 0% | 0.44 | 0.25 | 28 | 0% |
| MalaysiaMY | 0.000 | 0% | 0.00 | 0.00 | 28 | 0% |
| NigeriaNG | 0.000 | 0% | 0.92 | 0.50 | 28 | 0% |
| NicaraguaNI | 0.000 | 0% | 0.78 | 0.33 | 28 | 0% |
| PhilippinesPH | 0.000 | 0% | 0.67 | 0.11 | 28 | 0% |
| PakistanPK | 0.000 | 0% | 1.00 | 0.14 | 28 | 0% |
How to read these numbers
- Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
- Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
- Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
- F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
- The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.
Related
- /sentinel/calibration — 90-day time series of empirical coverage vs the 90% conformal target
- /methodology#validation — the three honest accuracy splits (LOCO, stratified, time-based)
- /atlas/forecast/IR — per-country calibrated forecast detail with SHAP drivers