Forecast retrain unblocked: dual-holdout gate (legacy + temporal)

What the gate did, and why it lied

Every Sunday at 02:00 UTC our forecast retrain ran four steps: ingest, build features, train XGBoost, and gate-check the new model against a frozen 1,462-row holdout built April 17, 2026. The gate compared F1 on that holdout — accept if new_f1 ≥ old_f1 - 0.02, otherwise reject and keep the previous model.

On May 3 the new model scored F1 0.491 on the frozen holdout against the old model's F1 0.802. Reject. Same story May 10 and May 17. The April 17 model has been frozen in production for 33 days while the live world has changed underneath it.

Root cause: the holdout aged out of the training distribution

We dug into the per-month positive rate of the forecast training data (the target_7day label — “does this country-day fall within 7 days of a confirmed incident?”):

Oct 2024 – Dec 2025: ~2% positive rate (sparse positives)
Jan 2026: 30% (large jump)
Mar–Apr 2026: 60–79% (continued surge)
May 2026: 36% (high baseline)

Causes are layered: incident dedup in Q1 raised the labeled positives per country-day; ingest expansion (50 → 80 OONI countries) and a CensoredPlanet expansion in February added many more measurable positives; election-cycle activity in the Americas + MENA contributed real-world signal. The result: the new model correctly learns “positives are common,” which produces broad predictions; the frozen holdout still has a 6.9% positive rate, so broad predictions become false positives there.

The fix: dual gate (don't regress on recent, don't catastrophize legacy)

New script scripts/build-forecast-holdout-temporal.py builds a 1,260-row temporal holdout from the last 60 days of the same training pipeline. Built AFTER training so it reflects the current label distribution. The previous frozen holdout is kept untouched.

New gate logic in scripts/compare-models.py:

Hard rule: new_f1_temporal ≥ old_f1_temporal - 0.02 — the new model cannot lose ground on recent reality.
Bounded legacy regression: new_f1_legacy ≥ old_f1_legacy - 0.10 — legacy is allowed to drift 10pp as label distribution evolves, but anything worse than that is treated as catastrophic and rejected.
Degenerate prediction guard preserved: reject if positive rate is 0 or 1.
Backwards-compatible: legacy --holdout only invocation still works for any other caller.

Current state on the gate

On the live model (April 17 frozen), the dual gate now reports legacy_f1=0.840 and temporal_f1=0.864. Next Sunday's retrain will produce a new model and be evaluated under the dual gate. If the new model recovers temporal F1 ≥ 0.86 while keeping legacy F1 ≥ 0.74, it will promote.

What this fixes — and what it doesn't

This unblocks the retrain pipeline. It does not address the underlying labeling-distribution shift in target_7day. The per-month positive-rate jump in Jan 2026 is plausibly correct (incident dedup recovered real signal) but the magnitude in March–April is suspicious. A separate investigation is warranted: does our labeler over-count consecutive-day positives when an incident spans >1 day? Is the 7-day window the right unit? That work is out of scope here — the goal of this fix is “ship a retrain when ground truth shifts,” not “perfect the ground truth.”

Reproducibility

scripts/build-forecast-holdout-temporal.py — produces the temporal holdout parquet + metadata.

scripts/compare-models.py — dual-holdout gate. Accepts --holdout (legacy), --holdout-temporal (recent), --threshold (temporal drop tolerance, default 0.02), --max-legacy-drop (legacy drop tolerance, default 0.10).

weekly-retrain.sh patched to build the temporal holdout after training and pass both holdouts into the gate. Bash syntax-check passes. The old single-holdout invocation still works for backwards compatibility — the temporal holdout is opt-in.