v3.3 regressed 16 countries vs v3.1, concentrated in MENA + post-Soviet regimes. The hypothesis for v3.4: train small per-regime-cluster fine-tuning heads (depth=3, 50 trees) on each cluster's samples plus 200 random "other" samples for negative anchoring, then stack them on top of the v3.3 base model via a logistic-regression head. The cluster heads see only their own regime's signal; the stack head learns when to trust them vs the base.
What we built
- Base model: v3.3 unchanged (GradientBoosting, 16 features incl. regime-weighted contagion)
- Cluster heads (6 trained): MENA (n=1,268, 271 positive, 22 countries), Post-Soviet (n=529, 155 pos, 10 countries), East Asia tight (n=263, 51 pos, 3 countries), SE Asia mixed (n=440, 63 pos, 6 countries), LATAM tight (n=154, 75 pos, 3 countries), Sub-Saharan tight (n=112, 40 pos, 4 countries)
- Stack head: LogisticRegression on 7 features (base_proba + 6 gated cluster probas)
- Negative anchoring: 200 random non-cluster samples per head to keep the small models from over-fitting to their cluster's prior
The stack head told us the cluster heads are useless
Trained stack coefficients (full-data fit):
| Feature | Coefficient |
|---|---|
| base_proba (v3.3) | +9.80 |
| east_asia_tight | +0.64 |
| latam_tight | -0.66 |
| mena | -0.83 |
| post_soviet | +0.47 |
| se_asia_mixed | +0.25 |
| subsah_tight | +0.16 |
The base v3.3 coefficient dominates by 12x. The MENA and LATAM heads get NEGATIVE weight (the stack actively down-weights their predictions), and the others are too small to move probabilities meaningfully. Translation: the per-cluster heads are not learning useful corrections.
Honest LOCO numbers
| Metric | v3.3 (prod) | v3.4 (this experiment) | Delta |
|---|---|---|---|
| LOCO median F1 | 0.8696 | 0.8333 | -3.6pp |
| LOCO mean F1 | 0.7109 | 0.7056 | -0.5pp |
| Stratified F1 | 0.7289 | 0.7240 | -0.5pp |
| Stratified AUC | 0.8991 | 0.8980 | -0.1pp |
| Countries evaluated (LOCO) | 127 | 127 | - |
The 16 regression countries — what actually happened
- Improved ≥3pp (1/16): UZ (+7pp, 0.16 -> 0.23)
- Stayed flat (10/16): KW, OM, TN, LY, YE, JO, MA, AM, KH, QA, TR all within +/-2pp of v3.3 baseline. MA was +2pp, just below the 3pp threshold.
- Regressed further (3/16): VN (-3pp), SY (-5pp), GE (-9pp)
- CU within noise (-1pp)
High-impact countries — also no real movement
- IR: 0.850 -> 0.821 (-3pp, slight regression)
- VE: 0.900 -> 0.909 (within noise)
- EG: 0.726 -> 0.726 (unchanged)
- CN: 0.294 -> 0.368 (+7pp, the only meaningful high-impact win)
- RU: 0.613 -> 0.615 (unchanged)
- DE / JP / IT / GB / US / BR: all unchanged at v3.3 levels
Promotion gates — both fail
- Overall LOCO median F1 ≥ 0.85: FAIL (got 0.833)
- ≥10 of 16 regression countries improve by ≥3pp: FAIL (got 1)
DECISION: v3.4 NOT promoted. v3.3 stays in production.
Why this didn't work
The fundamental problem isn't model capacity — the v3.4 stack head has the freedom to use cluster predictions and chose not to. The actual problem is signal: countries like OM, LY, YE, KW have so few positive samples in the training data (~5-15) that no model can distinguish censorship days from quiet ones. The cluster heads see the same sparse-positive subset and learn the same uncertain boundary as the base. Adding more capacity to ambiguous data doesn't manufacture signal that isn't there.
For KW (n=13, F1=0.00 in both v3.3 and v3.4) and the ≤55 sample countries, F1 is dominated by noise from 1-2 misclassified rows. Model architecture cannot fix this — only more labeled positives can.
What's worth trying next
- Targeted data labeling: hand-curate 20-30 additional positives for OM, LY, YE, KW, AM, TN from the existing evidence DB. More signal beats more architecture.
- Per-country calibration: don't change the base model, just shift the 0.5 threshold per country using a held-out validation slice. This is a recall/precision trade-off knob, not a learning problem.
- Drop the "regression" framing for n<30 countries: KW (n=13), AM (n=18), TN (n=20) have variance large enough that swings of +/-15pp F1 between models are statistically meaningless.
Reproducibility
Training script: scripts/train-classifier-v3.4-regime-finetune.py.
Run on Vultr; metrics at
/opt/voidly-ai/models/experimental/censorship_classifier_v3.4_metrics_20260521_041846.json.
Model bundle (unpromoted) at
.../censorship_classifier_v3.4_20260521_041846.pkl.
Production endpoint /v1/classifier/info still
reports v3.3. v3.3 backup at
.pkl.bak.v3.1-2026-05-21 for rollback.
Negative results count. Publishing a no-promote alongside the promoted experiments keeps the methodology honest — the readership sees what worked AND what we tried and abandoned.