Classifier v3.3 added regime-similarity-weighted neighbor features and lifted LOCO median F1 +5pp. The natural next move: try the same approach on the XGBoost forecast model.
Forecast v2 contagion added the same 3 features (neighbor_block_rate_7d, neighbor_incident_count_7d, neighbor_max_anomaly_7d), same UN M49 + hand-added adjacency, same regime weighting (Pearson r on daily block_rate). 5,948 of 14,620 training rows have non-zero contagion.
Aggregate metrics — bigger wins than the classifier
| Split | v1 AUC | v2 AUC | v1 F1 | v2 F1 | Δ F1 |
|---|---|---|---|---|---|
| Stratified | 0.9803 | 0.9824 | 0.7945 | 0.8430 | +4.9pp |
| Time-based | 0.5009 | 0.5022 | 0.0000 | 0.6797 | +68pp* |
| LOCO median | 0.9051 | 0.9083 | 0.5528 | 0.7305 | +17.8pp |
* v1 time-based F1=0 because thresholding broke on the contiguous post-T block; v2 found a usable threshold. Not a fair comparison.
Forecast benefits MORE from contagion than classifier (+17.8pp LOCO median vs classifier's +5pp). Reason: forecast has ~731 rows per country (sparse signal). Cross-country borrowing helps more here than on the classifier's 32-sample-per- country distribution.
15 of 19 countries improve
Including the v1 worst-performers:
- Turkmenistan (TM): +10.3pp F1 (was lowest LOCO at 0.12)
- Cuba (CU): +11.2pp
- Belarus (BY): +19.9pp
- Russia (RU): +17.5pp
The deal-breaker: Iran regresses 27.4pp
Promote gate: "no country regresses >5pp F1". IR fails by 22pp. IR's F1 collapsed from 0.51 to 0.235; recall went from 67% to 17%.
Why: IR's neighbors in the 2-year evidence window have essentially NO positive correlation with IR's daily block_rate (best PK r=+0.099, six others negative, max negative SY r=−0.26). When the model trains on 19 other countries — most with strongly-correlated neighbors — it learns to weight contagion features as a primary signal. In LOCO with IR held out, IR's test rows arrive with mostly-zero contagion, and the model under-predicts because it's been taught no-contagion ≈ no-event.
This is the same failure mode classifier v3.3 surfaced for OM/ UZ/TN/LY/YE (MENA + former Soviet) — but with one critical country (IR, our flagship case) instead of a long tail.
Decision: DO NOT PROMOTE
The aggregate story is genuinely impressive — +17.8pp LOCO median F1 is the biggest single move we've made. But IR regressing 27pp is unacceptable. Production stays on v1 forecast.
What we learned
- Forecast benefits more than classifier from regime-weighted contagion (+17.8 vs +5pp).
- The failure mode is consistent: countries whose censorship dynamics are uncorrelated with their geographic neighbors get hurt by global contagion-feature training.
- Iran is the cleanest such case — Persian/Shia regime surrounded by Arab/Sunni states with isolated information policy.
Future v3 forecast iteration
Two options to try:
- Per-country opt-out gate — train a feature mask that drops contagion to 0 for countries where the source's max neighbor r < 0.10.
- Learned mixing weight — let the model learn a per-country contagion-weight rather than using global feature importance. Stronger countries can discount it.
Both deferred — first iteration must demonstrate it doesn't regress IR.
Reproducibility
Build script: scripts/build-forecast-v2-features.py.
Train script: scripts/train-forecast-v2.py. Artifacts at
/opt/voidly-ai/ml-deploy/censorship_forecast_v2_contagion.pkl + sidecar.
v1 model UNTOUCHED — voidly-forecast.service still serving v1 predictions.