v3.2 (geographic adjacency) showed the contagion signal was real but the cohort definition was naive. EG paired with IL/PS/LY/SD regressed because those are wildly different regimes; Western Europe got dragged by authoritarian neighbor noise.
v3.3 weights each neighbor by the historical correlation of anomaly_rate over the last 180 days. Same adjacency map as v3.2, but now the neighbor signal is multiplied by max(0, pearsonr). If a neighbor has low correlation (different regime), its contribution drops to zero.
Aggregate metrics — v3.3 wins clearly
| Metric | v3.1 | v3.2 | v3.3 |
|---|---|---|---|
| Stratified F1 | 0.673 | 0.712 | 0.729 |
| Stratified AUC | 0.868 | 0.892 | 0.899 |
| LOCO mean F1 | 0.710 | 0.706 | 0.711 |
| LOCO median F1 | 0.818 | 0.800 | 0.870 |
Key censorship country wins
- Egypt (EG): 0.548 → 0.413 → 0.726 (recovered from v3.2 regression, +18pp vs v3.1)
- Venezuela (VE): 0.818 → 0.852 → 0.900 (best F1 of any model)
- Iran (IR): 0.795 → 0.846 → 0.850 (steady improvement)
- Pakistan (PK): 0.385 → 0.475 → 0.483 (held v3.2 win)
- Germany (DE): low in v3.2 → 1.000 (Western Europe recovered)
- Italy (IT): 0.71 in v3.2 → 0.909
- Japan (JP): low in v3.2 → 1.000
The honest trade-off — 16 countries regress vs v3.1
The biggest losses are concentrated in MENA + former Soviet states where neighbors don't have enough overlapping observations to compute a meaningful correlation:
- Oman (OM): −29pp F1
- Uzbekistan (UZ): −28pp
- Tunisia (TN): −24pp (small n=20, noise-sensitive)
- Libya (LY): −20pp
- Yemen (YE): −20pp
- Jordan (JO): −19pp
- Morocco (MA): −15pp
- Vietnam (VN): −12pp
- Syria (SY): −9pp
- Cuba (CU): −7pp
Root cause: median directed-edge overlap between neighbor pairs is only 3 days (most countries have ~32 samples each, often on non-overlapping dates). 280/494 edges are dropped by the ≥10-overlap filter, leaving those countries with effectively zero contagion signal — and the model was trained alongside rich contagion signal globally, so dropping their feature to 0 hurts more than keeping the geo-naive v3.2 noise.
Why we promoted anyway
LOCO mean F1 is the population-weighted average across all countries. It went UP (0.7099 → 0.7109). LOCO median is up 5pp. Stratified F1 jumped 8%. The aggregate story is unambiguously better.
The regressions are concentrated in small-sample tail cases. TN has n=20 — F1 is noisy by definition. The wins (IR, VE, EG, DE, JP, IT, ES) all have n=39-96 with strong signal.
Honest framing: v3.3 is the best model for the typical
country, including the most-censored ones (IR, VE,
CN, RU). It is not uniformly better. Users
who specifically need accurate predictions for OM/UZ/TN/LY/YE/
JO/MA should also consult /v1/classifier/score{' '}
which returns the raw model output + top features so the user
can sanity-check.
Next iteration (v3.4)
Worth trying before the next promotion:
- Lower MIN_OVERLAP_DAYS to 5 with shrinkage toward subregion mean. Recovers the dropped edges without trusting them blindly.
- Top-K globally-correlated neighbors (not constrained to UN-subregion) — let KW borrow from IR/IQ if they correlate higher than KW's actual geographic neighbors.
- Impute correlations from subregion mean when pairwise overlap is too low. Halfway between geo-naive (v3.2) and pure regime (v3.3).
Reproducibility
Build script: scripts/build-classifier-v3.3-regime-weighted.py.
Train script: scripts/train-classifier-v3.3.py. Both in the
public repo. The correlation computation lives at the top of
the build script — easy to audit and fork.
Model promoted on Vultr at /opt/voidly-ai/models/censorship_classifier_v3_promoted.pkl.
Backup at .pkl.bak.v3.1-2026-05-21 for instant rollback.
/v1/classifier/info reports
v3.3 with the updated LOCO numbers.