Per-domain HDBSCAN drift surface: novel-blocking detection orthogonal to per-country DBSCAN

Voidly already runs CenDTect-style DBSCAN per (country, day) and surfaces it at /v1/anomaly/dbscan/{cc} — AUC 0.65 against labeled incidents, promoted as a second-opinion signal. That axis asks: which days look weird per country?

This finding ships the other axis: which DOMAINS are changing how they're blocked across the world? The implementation uses HDBSCAN (McInnes et al. 2017, arXiv:1705.07321), which gives variable-density clustering — much better suited than DBSCAN's single-eps assumption to a small/medium per-domain matrix where measurement counts span from 10 (niche domains) to 1,800 (claude.ai).

The build

Feature matrix per domain: 18 numeric columns over the last 28 days — log measurement count, mean and std of per-country block rate, distinct countries blocking, ASN diversity, source diversity, share of each blocking method (DNS poison, TCP reset, blockpage, TLS reset, ISP outage), and one-hot domain-category shares (NEWS, ANON, GRP, PORN, MSG, SRCH, OTHER).
Filter: domains with >= 10 measurements in the window (signal threshold).
Clustering: HDBSCAN with min_cluster_size=5, metric=euclidean, applied after StandardScaler. Re-fit independently for this-week and last-week matrices.
Drift surface: per-domain L2 distance between this-week and last-week vectors in this-week's standardized space. Cluster-level: matched centroids (greedy nearest-neighbor with a 3-sigma cutoff) yield centroid-drift distances; unmatched THIS-week cluster IDs are flagged as new clusters = novel blocking patterns.

First-run results (week ending 2026-05-20)

27 domains clustered this week (same set last week). HDBSCAN forms 2 clusters this week, 0 last week — all of last week was noise. Both this-week clusters are therefore flagged as new clusters this week (label IDs 0 and 1) — novel groupings the domain space didn't support a week ago.
Top-10 drift domains by L2 distance:
1. tiktok.com — drift 0.343, moved noise → cluster 1, blocked in 25 countries (avg block_rate 1.0, 77% DNS poison + 23% TCP reset)
2. openai.com — drift 0.336, blocked in 26 countries
3. gemini.google.com — drift 0.331, blocked in 27 countries
4. google.com — drift 0.235, avg block_rate 1.0 across 21 countries
5. twitter.com — drift 0.204, moved noise → cluster 0, blocked in 27 countries
6. youtube.com — drift 0.203, moved → cluster 0, 24 countries
7. whatsapp.com — drift 0.179, → cluster 0, 24 countries
8. facebook.com — drift 0.176, → cluster 0, 25 countries
9. huggingface.co — drift 0.157, 26 countries
10. theguardian.com — drift 0.156, moved → cluster 1, 23 countries

Verification

We re-queried the evidence table for each top-10 drift domain over the last 14 days and asked: are these actually being blocked, or is the drift noise? Result: 10 / 10 top-drift domains have critical or warning evidence in 18-27 countries over the last 14 days. 4 / 10 also appear in our curated incidents table by name (tiktok, openai, twitter, facebook). The unsupervised drift signal is genuinely tracking real blocking activity.

Honest caveats

Small corpus. Only 27 domains pass the >= 10-measurement filter in a 28-day window. HDBSCAN works with 27 points but the cluster geometry is fragile — week-over-week stability needs months of data to establish.
No labeled holdout. Unlike the per-country DBSCAN (AUC 0.65 against labels), this surface has no ground truth — drift is a measurement, not a prediction. We verify by cross-referencing the evidence table, not by AUC.
Cluster IDs are not stable across runs. HDBSCAN assigns IDs by cluster-discovery order. The cron run uses greedy centroid-distance matching to detect when a cluster is "the same shape as last week" — but the centroid-drift cutoff (3 sigma in standardized space) is a heuristic, not tuned.
This is a SECOND-OPINION signal. The supervised classifier still wins at AUC ~0.99 on labeled incidents. Use this for novel-pattern detection (cluster changes), not for first-line incident triage.

Live at

GET /v1/anomaly/domain-drift/leaderboard?limit=20 — top-N drift domains, sorted by L2 distance with new/dropped domains at the top
GET /v1/anomaly/domain-drift/{domain} — single-domain detail (cluster this/last week, drift score, raw method-share, per-country block geometry)
GET /v1/anomaly/domain-drift/info — sidecar metadata (algorithm, paper, params, run stats)

Refresh cadence: weekly Sunday 04:00 UTC, after the 02:00 retrain and the ~02:30 temporal holdout. State JSON + sidecar are written by scripts/run-hdbscan-domain-drift.py and the Flask endpoint hot-reloads on file-mtime change.

The build

First-run results (week ending 2026-05-20)

Verification

Honest caveats

Live at

Raw data