voidly

Classifier v3.4: regime-cluster fine-tuning didn't fix the tail. Held.

Tried per-regime-cluster fine-tuning heads (MENA, post-Soviet, East Asia, SE Asia, LATAM, Sub-Sahara) stacked on top of v3.3 to recover the 16 countries that regressed under v3.3. The stacking head learns to mostly ignore the cluster heads (base coef 9.8 vs cluster coefs in [-0.83, +0.64]). LOCO median F1 drops 0.870 to 0.833, only 1 of 16 regression countries improves by ≥3pp (UZ +7pp), and 2 countries regress further (GE -9pp, SY -5pp). Both promotion gates fail. v3.3 stays in production. Documented as a real negative result.

#methodology#ml#classifier#regime-cluster#fine-tuning#negative-result#honest-no-promote

v3.3 regressed 16 countries vs v3.1, concentrated in MENA + post-Soviet regimes. The hypothesis for v3.4: train small per-regime-cluster fine-tuning heads (depth=3, 50 trees) on each cluster's samples plus 200 random "other" samples for negative anchoring, then stack them on top of the v3.3 base model via a logistic-regression head. The cluster heads see only their own regime's signal; the stack head learns when to trust them vs the base.

What we built

  • Base model: v3.3 unchanged (GradientBoosting, 16 features incl. regime-weighted contagion)
  • Cluster heads (6 trained): MENA (n=1,268, 271 positive, 22 countries), Post-Soviet (n=529, 155 pos, 10 countries), East Asia tight (n=263, 51 pos, 3 countries), SE Asia mixed (n=440, 63 pos, 6 countries), LATAM tight (n=154, 75 pos, 3 countries), Sub-Saharan tight (n=112, 40 pos, 4 countries)
  • Stack head: LogisticRegression on 7 features (base_proba + 6 gated cluster probas)
  • Negative anchoring: 200 random non-cluster samples per head to keep the small models from over-fitting to their cluster's prior

The stack head told us the cluster heads are useless

Trained stack coefficients (full-data fit):

FeatureCoefficient
base_proba (v3.3)+9.80
east_asia_tight+0.64
latam_tight-0.66
mena-0.83
post_soviet+0.47
se_asia_mixed+0.25
subsah_tight+0.16

The base v3.3 coefficient dominates by 12x. The MENA and LATAM heads get NEGATIVE weight (the stack actively down-weights their predictions), and the others are too small to move probabilities meaningfully. Translation: the per-cluster heads are not learning useful corrections.

Honest LOCO numbers

Metricv3.3 (prod)v3.4 (this experiment)Delta
LOCO median F10.86960.8333-3.6pp
LOCO mean F10.71090.7056-0.5pp
Stratified F10.72890.7240-0.5pp
Stratified AUC0.89910.8980-0.1pp
Countries evaluated (LOCO)127127-

The 16 regression countries — what actually happened

  • Improved ≥3pp (1/16): UZ (+7pp, 0.16 -> 0.23)
  • Stayed flat (10/16): KW, OM, TN, LY, YE, JO, MA, AM, KH, QA, TR all within +/-2pp of v3.3 baseline. MA was +2pp, just below the 3pp threshold.
  • Regressed further (3/16): VN (-3pp), SY (-5pp), GE (-9pp)
  • CU within noise (-1pp)

High-impact countries — also no real movement

  • IR: 0.850 -> 0.821 (-3pp, slight regression)
  • VE: 0.900 -> 0.909 (within noise)
  • EG: 0.726 -> 0.726 (unchanged)
  • CN: 0.294 -> 0.368 (+7pp, the only meaningful high-impact win)
  • RU: 0.613 -> 0.615 (unchanged)
  • DE / JP / IT / GB / US / BR: all unchanged at v3.3 levels

Promotion gates — both fail

  • Overall LOCO median F1 ≥ 0.85: FAIL (got 0.833)
  • ≥10 of 16 regression countries improve by ≥3pp: FAIL (got 1)

DECISION: v3.4 NOT promoted. v3.3 stays in production.

Why this didn't work

The fundamental problem isn't model capacity — the v3.4 stack head has the freedom to use cluster predictions and chose not to. The actual problem is signal: countries like OM, LY, YE, KW have so few positive samples in the training data (~5-15) that no model can distinguish censorship days from quiet ones. The cluster heads see the same sparse-positive subset and learn the same uncertain boundary as the base. Adding more capacity to ambiguous data doesn't manufacture signal that isn't there.

For KW (n=13, F1=0.00 in both v3.3 and v3.4) and the ≤55 sample countries, F1 is dominated by noise from 1-2 misclassified rows. Model architecture cannot fix this — only more labeled positives can.

What's worth trying next

  1. Targeted data labeling: hand-curate 20-30 additional positives for OM, LY, YE, KW, AM, TN from the existing evidence DB. More signal beats more architecture.
  2. Per-country calibration: don't change the base model, just shift the 0.5 threshold per country using a held-out validation slice. This is a recall/precision trade-off knob, not a learning problem.
  3. Drop the "regression" framing for n<30 countries: KW (n=13), AM (n=18), TN (n=20) have variance large enough that swings of +/-15pp F1 between models are statistically meaningless.

Reproducibility

Training script: scripts/train-classifier-v3.4-regime-finetune.py. Run on Vultr; metrics at /opt/voidly-ai/models/experimental/censorship_classifier_v3.4_metrics_20260521_041846.json. Model bundle (unpromoted) at .../censorship_classifier_v3.4_20260521_041846.pkl. Production endpoint /v1/classifier/info still reports v3.3. v3.3 backup at .pkl.bak.v3.1-2026-05-21 for rollback.

Negative results count. Publishing a no-promote alongside the promoted experiments keeps the methodology honest — the readership sees what worked AND what we tried and abandoned.

Raw data