Voidly already runs CenDTect-style DBSCAN per (country, day)
and surfaces it at /v1/anomaly/dbscan/{cc}
— AUC 0.65 against labeled incidents, promoted as a second-opinion
signal. That axis asks: which days look weird per country?
This finding ships the other axis: which DOMAINS are changing how they're blocked across the world? The implementation uses HDBSCAN (McInnes et al. 2017, arXiv:1705.07321), which gives variable-density clustering — much better suited than DBSCAN's single-eps assumption to a small/medium per-domain matrix where measurement counts span from 10 (niche domains) to 1,800 (claude.ai).
The build
- Feature matrix per domain: 18 numeric columns over the last 28 days — log measurement count, mean and std of per-country block rate, distinct countries blocking, ASN diversity, source diversity, share of each blocking method (DNS poison, TCP reset, blockpage, TLS reset, ISP outage), and one-hot domain-category shares (NEWS, ANON, GRP, PORN, MSG, SRCH, OTHER).
- Filter: domains with >= 10 measurements in the window (signal threshold).
- Clustering: HDBSCAN with
min_cluster_size=5,metric=euclidean, applied afterStandardScaler. Re-fit independently for this-week and last-week matrices. - Drift surface: per-domain L2 distance between this-week and last-week vectors in this-week's standardized space. Cluster-level: matched centroids (greedy nearest-neighbor with a 3-sigma cutoff) yield centroid-drift distances; unmatched THIS-week cluster IDs are flagged as new clusters = novel blocking patterns.
First-run results (week ending 2026-05-20)
- 27 domains clustered this week (same set last week). HDBSCAN forms 2 clusters this week, 0 last week — all of last week was noise. Both this-week clusters are therefore flagged as new clusters this week (label IDs 0 and 1) — novel groupings the domain space didn't support a week ago.
- Top-10 drift domains by L2 distance:
tiktok.com— drift 0.343, moved noise → cluster 1, blocked in 25 countries (avg block_rate 1.0, 77% DNS poison + 23% TCP reset)openai.com— drift 0.336, blocked in 26 countriesgemini.google.com— drift 0.331, blocked in 27 countriesgoogle.com— drift 0.235, avg block_rate 1.0 across 21 countriestwitter.com— drift 0.204, moved noise → cluster 0, blocked in 27 countriesyoutube.com— drift 0.203, moved → cluster 0, 24 countrieswhatsapp.com— drift 0.179, → cluster 0, 24 countriesfacebook.com— drift 0.176, → cluster 0, 25 countrieshuggingface.co— drift 0.157, 26 countriestheguardian.com— drift 0.156, moved → cluster 1, 23 countries
Verification
We re-queried the evidence table for each top-10 drift domain over the last 14 days and asked: are these actually being blocked, or is the drift noise? Result: 10 / 10 top-drift domains have critical or warning evidence in 18-27 countries over the last 14 days. 4 / 10 also appear in our curated incidents table by name (tiktok, openai, twitter, facebook). The unsupervised drift signal is genuinely tracking real blocking activity.
Honest caveats
- Small corpus. Only 27 domains pass the >= 10-measurement filter in a 28-day window. HDBSCAN works with 27 points but the cluster geometry is fragile — week-over-week stability needs months of data to establish.
- No labeled holdout. Unlike the per-country DBSCAN (AUC 0.65 against labels), this surface has no ground truth — drift is a measurement, not a prediction. We verify by cross-referencing the evidence table, not by AUC.
- Cluster IDs are not stable across runs. HDBSCAN assigns IDs by cluster-discovery order. The cron run uses greedy centroid-distance matching to detect when a cluster is "the same shape as last week" — but the centroid-drift cutoff (3 sigma in standardized space) is a heuristic, not tuned.
- This is a SECOND-OPINION signal. The supervised classifier still wins at AUC ~0.99 on labeled incidents. Use this for novel-pattern detection (cluster changes), not for first-line incident triage.
Live at
GET /v1/anomaly/domain-drift/leaderboard?limit=20 — top-N drift domains, sorted by L2 distance with new/dropped domains at the top
GET /v1/anomaly/domain-drift/{domain} — single-domain detail (cluster this/last week, drift score, raw method-share, per-country block geometry)
GET /v1/anomaly/domain-drift/info — sidecar metadata (algorithm, paper, params, run stats)
Refresh cadence: weekly Sunday 04:00 UTC, after the 02:00 retrain and the ~02:30 temporal holdout. State JSON
+ sidecar are written by scripts/run-hdbscan-domain-drift.py and the
Flask endpoint hot-reloads on file-mtime change.