BZ OM MA UZ CN2026-05-22

Closing the active-learning loop — and the honest near-zero F1 lift

Voidly Atlas had an active-learning queue that ranked the unlabeled country-days the v3.3 classifier was least sure about — but the loop was OPEN. Human-consensus labels dead-ended at active_learning_labels.json; nothing fed them back, because v3.3 trains from a separate corpus the aggregator never touched. This finding closes the loop end to end: promote-al-labels-to-training.py rebuilds the full 16-feature v3.3 row from live evidence for every consensus-labelled candidate, de-dups on (country,date), excludes IODA disruption incidents from positive labels, and appends to labeled_incidents_v3.3.json with active-learning provenance; once 10 new labels accumulate it queues a classifier retrain via the same retrain-queue.json the weekly cron services (12h cooldown shared with the drift trigger, merges rather than overwrites). GET /v1/sentinel/active-learning-loop-status reports every stage. The honest part: simulate-al-loop-lift.py measured whether feeding the loop actually lifts LOCO F1. On the live queue only 13-19 of 50 candidates were reconstructable; adding them moved LOCO median F1 between -0.0124 and -0.0362 and mean F1 between +0.0002 and -0.0055 across two runs hours apart — negligible, and the runs disagreeing is itself proof the swing is sampling noise, not a stable effect. The per-country breakdown shows the augmented model improving on exactly the hard sparse-data countries the queue surfaces (Belize, Oman, Morocco, Uzbekistan, China all gained double-digit F1 on the larger run) but netting to zero — a volume problem, not a signal problem. A dozen rows over 4,237 is a rounding error. The loop is plumbed and verified but idle until reviewers label a few hundred candidates. We are not claiming a lift that is not there. Filed as an honest negative.

#methodology#ml#active-learning#honest-negative#loco#retraining#accountability#atlas#api#shipped

Raw data