What the gate did, and why it lied
Every Sunday at 02:00 UTC our forecast retrain ran four steps:
ingest, build features, train XGBoost, and gate-check the new
model against a frozen 1,462-row holdout built April 17, 2026.
The gate compared F1 on that holdout — accept if
new_f1 ≥ old_f1 - 0.02,
otherwise reject and keep the previous model.
On May 3 the new model scored F1 0.491 on the frozen holdout against the old model's F1 0.802. Reject. Same story May 10 and May 17. The April 17 model has been frozen in production for 33 days while the live world has changed underneath it.
Root cause: the holdout aged out of the training distribution
We dug into the per-month positive rate of the forecast training
data (the target_7day label —
“does this country-day fall within 7 days of a confirmed
incident?”):
- Oct 2024 – Dec 2025: ~2% positive rate (sparse positives)
- Jan 2026: 30% (large jump)
- Mar–Apr 2026: 60–79% (continued surge)
- May 2026: 36% (high baseline)
Causes are layered: incident dedup in Q1 raised the labeled positives per country-day; ingest expansion (50 → 80 OONI countries) and a CensoredPlanet expansion in February added many more measurable positives; election-cycle activity in the Americas + MENA contributed real-world signal. The result: the new model correctly learns “positives are common,” which produces broad predictions; the frozen holdout still has a 6.9% positive rate, so broad predictions become false positives there.
The fix: dual gate (don't regress on recent, don't catastrophize legacy)
New script scripts/build-forecast-holdout-temporal.py
builds a 1,260-row temporal holdout from the last 60 days of
the same training pipeline. Built AFTER training so it reflects
the current label distribution. The previous frozen holdout is
kept untouched.
New gate logic in scripts/compare-models.py:
- Hard rule:
new_f1_temporal ≥ old_f1_temporal - 0.02— the new model cannot lose ground on recent reality. - Bounded legacy regression:
new_f1_legacy ≥ old_f1_legacy - 0.10— legacy is allowed to drift 10pp as label distribution evolves, but anything worse than that is treated as catastrophic and rejected. - Degenerate prediction guard preserved: reject if positive rate is 0 or 1.
- Backwards-compatible: legacy
--holdoutonly invocation still works for any other caller.
Current state on the gate
On the live model (April 17 frozen), the dual gate now reports
legacy_f1=0.840 and
temporal_f1=0.864. Next Sunday's
retrain will produce a new model and be evaluated under the
dual gate. If the new model recovers temporal F1 ≥ 0.86 while
keeping legacy F1 ≥ 0.74, it will promote.
What this fixes — and what it doesn't
This unblocks the retrain pipeline. It does not address the
underlying labeling-distribution shift in
target_7day. The per-month
positive-rate jump in Jan 2026 is plausibly correct (incident
dedup recovered real signal) but the magnitude in March–April
is suspicious. A separate investigation is warranted: does our
labeler over-count consecutive-day positives when an incident
spans >1 day? Is the 7-day window the right unit? That work
is out of scope here — the goal of this fix is “ship a
retrain when ground truth shifts,” not “perfect the
ground truth.”
Reproducibility
scripts/build-forecast-holdout-temporal.py — produces
the temporal holdout parquet + metadata.
scripts/compare-models.py — dual-holdout
gate. Accepts --holdout (legacy),
--holdout-temporal (recent),
--threshold (temporal drop tolerance, default 0.02),
--max-legacy-drop (legacy drop tolerance, default 0.10).
weekly-retrain.sh patched to
build the temporal holdout after training and pass both
holdouts into the gate. Bash syntax-check passes. The old
single-holdout invocation still works for backwards
compatibility — the temporal holdout is opt-in.