voidly

Forecast v2 contagion: huge aggregate wins, IR regresses 27pp. Held back.

Applied the classifier v3.3 regime-weighted-contagion playbook to the XGBoost forecast model. Stratified F1 +4.9pp, LOCO median F1 +17.8pp (! — bigger than classifier got), 15 of 19 countries improve. But Iran — a flagship country — regresses 27.4pp F1 because its neighbors have no positive correlation. Honest no-promote.

#methodology#ml#forecast#contagion#iran#honest-no-promote

Classifier v3.3 added regime-similarity-weighted neighbor features and lifted LOCO median F1 +5pp. The natural next move: try the same approach on the XGBoost forecast model.

Forecast v2 contagion added the same 3 features (neighbor_block_rate_7d, neighbor_incident_count_7d, neighbor_max_anomaly_7d), same UN M49 + hand-added adjacency, same regime weighting (Pearson r on daily block_rate). 5,948 of 14,620 training rows have non-zero contagion.

Aggregate metrics — bigger wins than the classifier

Splitv1 AUCv2 AUCv1 F1v2 F1Δ F1
Stratified0.98030.98240.79450.8430+4.9pp
Time-based0.50090.50220.00000.6797+68pp*
LOCO median0.90510.90830.55280.7305+17.8pp

* v1 time-based F1=0 because thresholding broke on the contiguous post-T block; v2 found a usable threshold. Not a fair comparison.

Forecast benefits MORE from contagion than classifier (+17.8pp LOCO median vs classifier's +5pp). Reason: forecast has ~731 rows per country (sparse signal). Cross-country borrowing helps more here than on the classifier's 32-sample-per- country distribution.

15 of 19 countries improve

Including the v1 worst-performers:

  • Turkmenistan (TM): +10.3pp F1 (was lowest LOCO at 0.12)
  • Cuba (CU): +11.2pp
  • Belarus (BY): +19.9pp
  • Russia (RU): +17.5pp

The deal-breaker: Iran regresses 27.4pp

Promote gate: "no country regresses >5pp F1". IR fails by 22pp. IR's F1 collapsed from 0.51 to 0.235; recall went from 67% to 17%.

Why: IR's neighbors in the 2-year evidence window have essentially NO positive correlation with IR's daily block_rate (best PK r=+0.099, six others negative, max negative SY r=−0.26). When the model trains on 19 other countries — most with strongly-correlated neighbors — it learns to weight contagion features as a primary signal. In LOCO with IR held out, IR's test rows arrive with mostly-zero contagion, and the model under-predicts because it's been taught no-contagion ≈ no-event.

This is the same failure mode classifier v3.3 surfaced for OM/ UZ/TN/LY/YE (MENA + former Soviet) — but with one critical country (IR, our flagship case) instead of a long tail.

Decision: DO NOT PROMOTE

The aggregate story is genuinely impressive — +17.8pp LOCO median F1 is the biggest single move we've made. But IR regressing 27pp is unacceptable. Production stays on v1 forecast.

What we learned

  1. Forecast benefits more than classifier from regime-weighted contagion (+17.8 vs +5pp).
  2. The failure mode is consistent: countries whose censorship dynamics are uncorrelated with their geographic neighbors get hurt by global contagion-feature training.
  3. Iran is the cleanest such case — Persian/Shia regime surrounded by Arab/Sunni states with isolated information policy.

Future v3 forecast iteration

Two options to try:

  1. Per-country opt-out gate — train a feature mask that drops contagion to 0 for countries where the source's max neighbor r < 0.10.
  2. Learned mixing weight — let the model learn a per-country contagion-weight rather than using global feature importance. Stronger countries can discount it.

Both deferred — first iteration must demonstrate it doesn't regress IR.

Reproducibility

Build script: scripts/build-forecast-v2-features.py. Train script: scripts/train-forecast-v2.py. Artifacts at /opt/voidly-ai/ml-deploy/censorship_forecast_v2_contagion.pkl + sidecar. v1 model UNTOUCHED — voidly-forecast.service still serving v1 predictions.

Raw data