The forecast XGBoost trains with defaults: n_estimators=200, max_depth=6, learning_rate=0.1. Standard ML wisdom says: tune the hyperparameters. Our earlier audit predicted +2-4% AUC from a 27-cell grid search.
We ran it. The honest result: no real improvement.
The numbers
| Metric | Current | Grid-search best | Δ |
|---|---|---|---|
| 5-fold CV AUC (training-set i.i.d.) | — | 0.9753 | — |
| Stratified holdout AUC | 0.9789 | 0.9859 | +0.0070 |
| LOCO median AUC | 0.9014 | 0.8987 | −0.0027 |
Best params were learning_rate=0.05, max_depth=8, n_estimators=200.
Follow-up grid on min_child_weight × gamma recommended
the existing defaults (1 and 0).
Why holdout said yes but LOCO said no
The stratified holdout is an i.i.d. random split of the full 14,620-sample training set. i.i.d. means the test rows come from the same distribution as the train rows — same countries, same time periods.
LOCO (leave-one-country-out) is fundamentally different. We train on 18 countries and test on the held-out 19th. The 19th country's feature distribution may differ from the 18. That's a distribution shift regime.
A deeper tree (max_depth=8) fits the i.i.d. training distribution more aggressively — wins on stratified, loses on LOCO because it overfits country-specific quirks. The default max_depth=6 trades some i.i.d. accuracy for better cross-country generalization.
Per-country breakdown (LOCO, top 10 most-active)
| Country | Current | Best | Δ |
|---|---|---|---|
| PK | 0.7785 | 0.7695 | −0.0091 |
| RU | 0.8098 | 0.8055 | −0.0043 |
| UZ | 0.9698 | 0.9641 | −0.0056 |
| EG | 0.9502 | 0.9439 | −0.0063 |
| BD | 0.9040 | 0.9011 | −0.0029 |
| TR | 0.8420 | 0.8308 | −0.0112 |
| MM | 0.9506 | 0.9539 | +0.0033 |
| IR | 0.7926 | 0.7960 | +0.0034 |
| KZ | 0.8988 | 0.8963 | −0.0025 |
| VN | 0.9478 | 0.9481 | +0.0004 |
7 of 10 most-active countries regress. Median LOCO AUC drops −0.0027. Not promoting.
What this means for Voidly
The honest ceiling: future forecast gains come from feature engineering, not hyperparameter tuning. The regime-weighted contagion experiment (forecast v2) hit this wall too — aggregate +17.8pp but Iran regressed 27pp.
Productive directions instead of hyperparam tuning:
- More training data (mining historical OONI archive)
- Per-country opt-out gates for contagion features
- Multi-horizon labels (1d / 7d / 30d separate models)
- Per-ASN granular forecast (more granular than country-level)
- Stealth-blackout detector (BGP-stable + throughput-collapse)
Cost
~72 seconds CPU time on the 8GB Vultr box. Script:
scripts/forecast-grid-search.py (219 lines).
Artifacts in /opt/voidly-ai/models/experimental/forecast_grid_best.json
and forecast_grid_loco_compare.json. Script left in place
for re-runs after feature changes.