voidly
Sentinel · 30-day backtest

The honest calibration plot

When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 1,140 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.

Updated every 30 min · last refresh Jul 5, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes

Brier score
0.214
lower is better
Calibration MAE
0.190
0 = perfect
Accuracy
60.5%
1,140 evaluated
F1 (binary 0.5)
0.34
P=0.26 R=0.50

Reliability diagram

Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.

0.000.000.250.250.500.500.750.751.001.00perfectn=845n=56n=10n=3n=49n=19n=75n=29n=16n=38Predicted probability (bin mean)Observed positive rate

Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated

Per-bin breakdown

BinPredicted meanObserved rateΔn
[0.0, 0.1)0.0420.168+0.126845
[0.1, 0.2)0.1590.339+0.18156
[0.2, 0.3)0.2460.300+0.05410
[0.3, 0.4)0.3280.333+0.0063
[0.4, 0.5)0.4510.633+0.18249
[0.5, 0.6)0.5700.263-0.30719
[0.6, 0.7)0.6440.267-0.37775
[0.7, 0.8)0.7570.207-0.55029
[0.8, 0.9)0.8510.063-0.78916
[0.9, 1.0)0.9490.210-0.73838

Δ = observed − predicted. The 0.1 bin holds 845 of the 1,140 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.

Per-country backtest (worst Brier first, n ≥ 5)

Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.

CountryBrierAccuracyPRnPos rate
BangladeshBD0.0000%0.000.00380%
BrazilBR0.0000%0.00380%
BelarusBY0.0000%0.360.50380%
ChinaCN0.0000%0.00380%
CubaCU0.0000%0.150.22380%
EgyptEG0.0000%0.760.96380%
ERER0.0000%0.00380%
EthiopiaET0.0000%0.291.00380%
IndonesiaID0.0000%0.00380%
IndiaIN0.0000%0.570.35380%
IranIR0.0000%0.730.46380%
North KoreaKP0.0000%0.00380%
KazakhstanKZ0.0000%0.060.25380%
LebanonLB0.0000%0.000.00380%
MyanmarMM0.0000%0.300.30380%
MalaysiaMY0.0000%0.00380%
NigeriaNG0.0000%0.00380%
NicaraguaNI0.0000%0.00380%
PhilippinesPH0.0000%0.00380%
PakistanPK0.0000%0.260.90380%

How to read these numbers

  • Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
  • Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
  • Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
  • F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
  • The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.

Related