voidly
Sentinel · 30-day backtest

The honest calibration plot

When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 570 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.

Updated every 30 min · last refresh Jun 8, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes

Brier score
0.307
lower is better
Calibration MAE
0.292
0 = perfect
Accuracy
54.4%
570 evaluated
F1 (binary 0.5)
0.44
P=0.37 R=0.53

Reliability diagram

Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.

0.000.000.250.250.500.500.750.751.001.00perfectn=520n=26n=6n=6n=1n=11Predicted probability (bin mean)Observed positive rate

Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated

Per-bin breakdown

BinPredicted meanObserved rateΔn
[0.0, 0.1)0.0510.344+0.294520
[0.1, 0.2)0.1390.115-0.02426
[0.3, 0.4)0.3280.833+0.5056
[0.6, 0.7)0.6910.167-0.5246
[0.7, 0.8)0.7030.000-0.7031
[0.8, 0.9)0.8230.273-0.55011

Δ = observed − predicted. The 0.1 bin holds 520 of the 570 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.

Per-country backtest (worst Brier first, n ≥ 5)

Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.

CountryBrierAccuracyPRnPos rate
BangladeshBD0.0000%0.400.29190%
BrazilBR0.0000%0.000.00190%
BelarusBY0.0000%0.600.50190%
ChinaCN0.0000%0.290.83190%
CubaCU0.0000%0.250.17190%
EgyptEG0.0000%0.850.65190%
ERER0.0000%0.00190%
EthiopiaET0.0000%0.581.00190%
IndonesiaID0.0000%0.710.56190%
IndiaIN0.0000%1.000.53190%
IranIR0.0000%0.670.92190%
North KoreaKP0.0000%0.00190%
KazakhstanKZ0.0000%0.00190%
LebanonLB0.0000%0.00190%
MyanmarMM0.0000%0.00190%
MalaysiaMY0.0000%0.00190%
NigeriaNG0.0000%0.330.50190%
NicaraguaNI0.0000%0.360.67190%
PhilippinesPH0.0000%0.000.00190%
PakistanPK0.0000%1.000.37190%

How to read these numbers

  • Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
  • Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
  • Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
  • F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
  • The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.

Related