UZ CN JO EG PK BY MA2026-05-22

A logistic stacker lifts the fused anomaly ensemble from 0.66 to 0.75 AUC

Voidly Atlas fuses four unsupervised anomaly detectors — a DBSCAN per-country shape detector, an STL seasonal-residual detector, a multi-country burst detector, and an HDBSCAN per-domain drift detector — into one composite anomaly score per country per day. The original fusion combined them with a hand-picked weighted average (0.35/0.25/0.20/0.20), then quietly tried three variants and kept whichever scored the highest AUC on the very labels it then reported against — in-sample model selection, so its published ~0.66-0.68 carried selection leakage. This finding is the rebuild. Fusion v2 changes two things. First, evaluation is rolling-origin forward-temporal cross-validation: each of three folds fits on earlier label dates and scores on a strictly later, held-out block — never a shuffled random split, because a prior Atlas audit showed shuffled splits leak time-autocorrelation between adjacent days and inflate AUC. The reported number is the mean across folds. Second, v2 evaluates five fusion methods, each fit only on the train fold: plain averaging, AUC-weighted averaging, rank-averaging, AUC-weighted rank-averaging, and a small logistic-regression stacker over the four detector scores plus four present/absent indicators. The logistic stacker won decisively — mean held-out composite AUC 0.745 (per-fold 0.753 / 0.769 / 0.715), versus 0.655 for the best averaging variant and 0.584 for plain averaging, which barely clears chance. The stacker beats the best single detector by about ten points, and its worst temporal fold (0.715) is the honest lower bound. So fusion v2 is both higher and more honest than the old number: 0.745 mean on a leak-free split versus 0.663 in-sample-selected. The live /v1/anomaly/fused/* endpoints are unchanged — they just load the updated artifact. Honest caveats baked in: the published 0.745 is the held-out mean while the live per-country composite re-fits the stacker on all labels (standard once a method is selected); two of the four detectors (bursts and HDBSCAN drift) are near-static current-snapshot signals that contribute weakly, so DBSCAN and STL carry most of the discriminative load; and "anomalous" is not "censored" — the fused ensemble is a second-opinion signal, with the supervised v3.3 classifier still the headline censorship predictor.

#methodology#ml#anomaly-detection#ensemble#stacking#temporal-cv#accountability#atlas#api

Raw data