voidly

Forecast hyperparameter grid search: defaults already near-optimal

Ran a 27-cell GridSearchCV over XGBoost (n_estimators × max_depth × learning_rate) plus a follow-up min_child_weight/gamma sweep. Holdout AUC improved +0.007. But LOCO median AUC DROPPED -0.003. Best params lose in 7 of 10 most-active countries. The current defaults are at the practical ceiling for this feature set. Future gains require feature engineering, not hyperparams.

#methodology#ml#forecast#hyperparameters#honest-no-improvement#distribution-shift

The forecast XGBoost trains with defaults: n_estimators=200, max_depth=6, learning_rate=0.1. Standard ML wisdom says: tune the hyperparameters. Our earlier audit predicted +2-4% AUC from a 27-cell grid search.

We ran it. The honest result: no real improvement.

The numbers

MetricCurrentGrid-search bestΔ
5-fold CV AUC (training-set i.i.d.)0.9753
Stratified holdout AUC0.97890.9859+0.0070
LOCO median AUC0.90140.8987−0.0027

Best params were learning_rate=0.05, max_depth=8, n_estimators=200. Follow-up grid on min_child_weight × gamma recommended the existing defaults (1 and 0).

Why holdout said yes but LOCO said no

The stratified holdout is an i.i.d. random split of the full 14,620-sample training set. i.i.d. means the test rows come from the same distribution as the train rows — same countries, same time periods.

LOCO (leave-one-country-out) is fundamentally different. We train on 18 countries and test on the held-out 19th. The 19th country's feature distribution may differ from the 18. That's a distribution shift regime.

A deeper tree (max_depth=8) fits the i.i.d. training distribution more aggressively — wins on stratified, loses on LOCO because it overfits country-specific quirks. The default max_depth=6 trades some i.i.d. accuracy for better cross-country generalization.

Per-country breakdown (LOCO, top 10 most-active)

CountryCurrentBestΔ
PK0.77850.7695−0.0091
RU0.80980.8055−0.0043
UZ0.96980.9641−0.0056
EG0.95020.9439−0.0063
BD0.90400.9011−0.0029
TR0.84200.8308−0.0112
MM0.95060.9539+0.0033
IR0.79260.7960+0.0034
KZ0.89880.8963−0.0025
VN0.94780.9481+0.0004

7 of 10 most-active countries regress. Median LOCO AUC drops −0.0027. Not promoting.

What this means for Voidly

The honest ceiling: future forecast gains come from feature engineering, not hyperparameter tuning. The regime-weighted contagion experiment (forecast v2) hit this wall too — aggregate +17.8pp but Iran regressed 27pp.

Productive directions instead of hyperparam tuning:

  1. More training data (mining historical OONI archive)
  2. Per-country opt-out gates for contagion features
  3. Multi-horizon labels (1d / 7d / 30d separate models)
  4. Per-ASN granular forecast (more granular than country-level)
  5. Stealth-blackout detector (BGP-stable + throughput-collapse)

Cost

~72 seconds CPU time on the 8GB Vultr box. Script: scripts/forecast-grid-search.py (219 lines). Artifacts in /opt/voidly-ai/models/experimental/forecast_grid_best.json and forecast_grid_loco_compare.json. Script left in place for re-runs after feature changes.

Raw data