Sentinel · 30-day backtest

The honest calibration plot

When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 1,140 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.

Updated every 30 min · last refresh Jul 5, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes

Brier score

0.214

lower is better

Calibration MAE

0.190

0 = perfect

Accuracy

60.5%

1,140 evaluated

F1 (binary 0.5)

0.34

P=0.26 R=0.50

Reliability diagram

Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.

Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated

Per-bin breakdown

Bin	Predicted mean	Observed rate	Δ	n
[0.0, 0.1)	0.042	0.168	+0.126	845
[0.1, 0.2)	0.159	0.339	+0.181	56
[0.2, 0.3)	0.246	0.300	+0.054	10
[0.3, 0.4)	0.328	0.333	+0.006	3
[0.4, 0.5)	0.451	0.633	+0.182	49
[0.5, 0.6)	0.570	0.263	-0.307	19
[0.6, 0.7)	0.644	0.267	-0.377	75
[0.7, 0.8)	0.757	0.207	-0.550	29
[0.8, 0.9)	0.851	0.063	-0.789	16
[0.9, 1.0)	0.949	0.210	-0.738	38

Δ = observed − predicted. The 0.1 bin holds 845 of the 1,140 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.

Per-country backtest (worst Brier first, n ≥ 5)

Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.

Country	Accuracy	P	R	n	Pos rate
BangladeshBD	0%	0.00	0.00	38	0%
BrazilBR	0%	0.00	—	38	0%
BelarusBY	0%	0.36	0.50	38	0%
ChinaCN	0%	0.00	—	38	0%
CubaCU	0%	0.15	0.22	38	0%
EgyptEG	0%	0.76	0.96	38	0%
ERER	0%	0.00	—	38	0%
EthiopiaET	0%	0.29	1.00	38	0%
IndonesiaID	0%	0.00	—	38	0%
IndiaIN	0%	0.57	0.35	38	0%
IranIR	0%	0.73	0.46	38	0%
North KoreaKP	0%	0.00	—	38	0%
KazakhstanKZ	0%	0.06	0.25	38	0%
LebanonLB	0%	0.00	0.00	38	0%
MyanmarMM	0%	0.30	0.30	38	0%
MalaysiaMY	0%	0.00	—	38	0%
NigeriaNG	0%	0.00	—	38	0%
NicaraguaNI	0%	0.00	—	38	0%
PhilippinesPH	0%	0.00	—	38	0%
PakistanPK	0%	0.26	0.90	38	0%

How to read these numbers

Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.

/sentinel/calibration — 90-day time series of empirical coverage vs the 90% conformal target
/methodology#validation — the three honest accuracy splits (LOCO, stratified, time-based)
/atlas/forecast/IR — per-country calibrated forecast detail with SHAP drivers

Reliability diagram

Per-bin breakdown

Per-country backtest (worst Brier first, n ≥ 5)

How to read these numbers

Related