Voidly's supervised v3.3 classifier sits at F1 0.729 / AUC ≈ 0.99 on labeled incidents — by far our strongest signal. But labels are themselves curated, and the unsupervised view answers a different question: which (country, day) feature vectors look weird, regardless of whether anyone wrote them up as an incident?
CenDTect (Aceto & Pescape, 2025) proposed clustering OONI measurements with DBSCAN and treating noise points (cluster label -1) as candidate censorship. We adapted that to a per-country rolling window over the full Voidly evidence table.
The build
- Feature matrix: 12 columns per (country, day) — block_rate, log measurement count, ASN diversity, source diversity, plus the 8-bucket signal-type composition (DNS-block, TCP-reset, blockpage, TLS-reset, outage, interference, generic block, ok).
- Window: rolling per-country, last 45 days. Standardized via StandardScaler within the window.
- DBSCAN:
eps = 75th percentile of k-NN distances(k=3),min_samples = 3. Continuous score = distance to the nearest core point on the test day.
Results
- AUC vs v3.3 labels: 0.6506 (n=3,922 scored, 1,023 positive)
- AUC-PR: 0.3639 — well above the 0.26 baseline (positive rate)
- Binary flag AUC: 0.6372 (using just is_noise vs continuous score)
- Promote floor: 0.65 — passed by 0.6 percentage points
- Improvement over the v1 IsolationForest baseline (AUC 0.489) is +16 percentage points
Honest caveats
- Barely over the floor. 0.6506 vs 0.65 promote is a 1pp margin. Drift in the underlying evidence distribution could push it back below.
- Hyperparameters tuned on the same labels we evaluate on. Window length, eps quantile, and min_samples were picked by grid search over the v3.3 labeled set. There is no held-out validation split — that's the next step before any production use.
- Recall at 90th-pct threshold is only 16.6% (precision 43.3%). The model is useful for prioritizing investigations, not for replacing the supervised classifier.
- Pooled (global) DBSCAN was worse (AUC ~0.58 across all tested window sizes). The per-country approach beats it cleanly — each country's baseline is genuinely different and global pooling washes out the signal.
Why ship it
The supervised classifier is trained against the same labels its AUC is measured against — it overfits the human-curated “what counts as an incident” definition. DBSCAN doesn't see labels at all. When the two disagree, the disagreement is itself the signal — a (country, day) the classifier shrugs at but DBSCAN flags is exactly the kind of case worth a human look.
Live at
GET /v1/anomaly/dbscan/{cc} — score a country's most-recent day (with feature vector + interpretation)
GET /v1/anomaly/dbscan/leaderboard?limit=20 — most-anomalous countries right now
GET /v1/anomaly/dbscan/info — full sidecar metrics for transparency
Example: GET /v1/anomaly/dbscan/IR currently
returns anomaly_score ≈ 4.89, is_anomaly=true, with 100% block_rate across
12 critical measurements concentrated on a single ASN — a textbook
shape-anomalous day.