Our shutdown-forecast model, Voidly Sentinel, was previously advertised at 99.8% F1 / 1.000 AUC. Those numbers come from a stratified random holdout — and they're inflated by a known statistical issue called within-country temporal leakage: events from the same country end up in both train and test, so the model effectively memorizes country-specific patterns instead of generalizing.
Our own /v1/sentinel/accuracy endpoint flagged this in
a self-published warning months ago: "Stratified AUC overstates
real-world performance by 47.9pp vs. time-based split. Do not cite
the stratified number as a deployment figure." But the website
kept citing 99.8%.
On May 20, 2026 we fixed the contradiction. The /methodology page now publishes all three splits, lead with the honest LOCO median (0.91 AUC, 0.55 F1), and keeps the stratified number visible with a "do not cite" tag.
The three splits, explained
- Stratified random 15% holdout: AUC 0.98, F1 0.79. The classical ML baseline. Inflated for this problem because it allows within-country temporal leakage. Don't cite.
- Time-based split (train pre-T, test post-T): AUC 0.50, F1 0.00. The honest deployment baseline. The model is random on novel events because it hasn't seen enough new-pattern data yet. This is the floor.
- Leave-country-out (LOCO) median: AUC 0.91, F1 0.55. For each country, train on the other 18, test on the held-out one. Take the median across 19 holdouts. This is the honest generalization figure — and what we cite for the model now. Individual countries range from F1 0.19 (Cuba) to 0.84 (Indonesia).
Why this matters
Forecast models in this space get used to brief journalists, threat-intel teams, and on-call infrastructure operators. Citing an inflated F1 leads them to over-trust the model. Citing the honest LOCO number lets them set realistic confidence intervals.
Live calibration metrics are now published nightly at /sentinel/calibration — a 90-day rolling time series of empirical conformal-interval coverage vs the nominal 90% target. As of the latest snapshot, prod_rolling shows accuracy 49%, Brier 0.59, calibration MAE 0.60. The model is over-confident on low-risk predictions (66% of "low risk" predictions had real incidents); a recalibration is queued.
How to cite our model
If you're citing Voidly Sentinel in a paper, news story, or threat-intel report, use one of these depending on your tolerance:
- For headline academic claims → LOCO median (0.91 AUC, 0.55 F1)
- For deployment-tolerance claims → time-based (0.50 AUC, 0.00 F1) as the floor
- For "the model agrees with our existing labels" → stratified 0.98 with the inflation caveat
Code, labels, and the full LOCO per-country breakdown are CC BY 4.0 and
reproducible — the per-country LOCO JSON is at
/v1/sentinel/accuracy.