voidly

Empirical-Bayes partial pooling didn't lift the classifier tail. Held.

Classifier v3.3 has a known split: LOCO median F1 0.870 but mean only 0.711, dragged down by ~16 MENA / former-Soviet countries (OM, UZ, TN, LY, YE, JO, MA and similar) that score LOCO F1 between 0.00 and 0.36. This finding is an honest attempt to lift that tail with empirical-Bayes partial pooling — a James-Stein shrinkage layer on top of v3.3 (no retrain) that pulls each country's probability toward its UN-region mean by w = n_country/(n_country+k), with k tuned by an inner LOCO sweep. It failed the promote gate decisively. The k-sweep is strictly monotonically decreasing in k (LOCO mean F1 0.714 at k=4 collapsing to 0.461 at k=60) — there is no interior optimum, so every amount of shrinkage hurts and the sweep just picks the smallest k offered. LOCO mean F1 moved 0.7109 to 0.7136 (+0.27pp, gate needs >=0.75 — FAIL); only 1 of 16 tail countries improved >=3pp (TN, a single row flipping in a 4-positive country — noise; gate needs >=10 — FAIL); LOCO median held at 0.875. The root cause: the 16 named tail countries are not row-poor — they hold 53 to 85 labeled rows each — so the shrinkage weight w is >=0.93 even at k=4 and the layer is structurally near-inert for exactly the countries the gate watches. What they actually lack is positive labels (2-22 confirmed-censorship days apiece), and their UN-region priors are themselves low (~0.19-0.22), so blending a ~0.18 prediction toward a ~0.20 prior cannot push anything across the 0.5 threshold. Three reformulations were also tested and ruled out: positive-count weighting (w = n_pos/(n_pos+k)), a region-floor blend (shrinkage may only raise a prediction), and a per-country F1-optimal threshold decoupled from the 0.5 cut. Partial pooling NOT promoted; classifier v3.3 stays in production unchanged; no serving endpoint exposes the pooled model. This is the third Atlas experiment (after v3.4 regime-cluster fine-tuning and the forecast-contagion port) to converge on the same lesson: no post-hoc architecture fixes a positive-label shortage — the honest path for the tail is targeted data labeling, not modeling.

#methodology#ml#classifier#partial-pooling#empirical-bayes#james-stein#shrinkage#tail-countries#negative-result#honest-no-promote#atlas

Raw data