voidly
Sentinel · 30-day backtest

The honest calibration plot

When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 840 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.

Updated every 30 min · last refresh May 21, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes

Brier score
0.577
lower is better
Calibration MAE
0.589
0 = perfect
Accuracy
49.5%
840 evaluated
F1 (binary 0.5)
0.47
P=0.70 R=0.36

Reliability diagram

Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.

0.000.000.250.250.500.500.750.751.001.00perfectn=828n=12Predicted probability (bin mean)Observed positive rate

Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated

Per-bin breakdown

BinPredicted meanObserved rateΔn
[0.0, 0.1)0.0470.644+0.597828
[0.1, 0.2)0.1550.083-0.07112

Δ = observed − predicted. The 0.1 bin holds 828 of the 840 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.

Per-country backtest (worst Brier first, n ≥ 5)

Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.

CountryBrierAccuracyPRnPos rate
BangladeshBD0.0000%0.890.38280%
BrazilBR0.0000%0.670.21280%
BelarusBY0.0000%0.670.13280%
ChinaCN0.0000%0.710.55280%
CubaCU0.0000%0.500.17280%
EgyptEG0.0000%1.000.21280%
ERER0.0000%0.00280%
EthiopiaET0.0000%0.950.77280%
IndonesiaID0.0000%0.500.05280%
IndiaIN0.0000%1.000.32280%
IranIR0.0000%0.840.88280%
North KoreaKP0.0000%0.00280%
KazakhstanKZ0.0000%0.500.13280%
LebanonLB0.0000%0.541.00280%
MyanmarMM0.0000%0.440.25280%
MalaysiaMY0.0000%0.000.00280%
NigeriaNG0.0000%0.920.50280%
NicaraguaNI0.0000%0.780.33280%
PhilippinesPH0.0000%0.670.11280%
PakistanPK0.0000%1.000.14280%

How to read these numbers

  • Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
  • Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
  • Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
  • F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
  • The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.

Related