voidly

Per-domain HDBSCAN drift surface: novel-blocking detection orthogonal to per-country DBSCAN

Shipped a second unsupervised anomaly axis: per-DOMAIN HDBSCAN drift over the last-28-day feature vector for every domain with >= 10 measurements. Weekly cron compares this week vs last week — new clusters = novel blocking patterns, centroid drift = existing patterns intensifying, per-domain L2 distance = how much a domain's blocking profile changed. Orthogonal to /v1/anomaly/dbscan/{cc} (per-country DBSCAN). First run: 27 domains, 2 new clusters, all top-10 drift domains corroborated by critical/warning evidence in the last 14 days. Live at /v1/anomaly/domain-drift/leaderboard and /v1/anomaly/domain-drift/{domain}.

#methodology#ml#anomaly#unsupervised#hdbscan#drift#per-domain#novel-blocking#second-opinion#promoted

Voidly already runs CenDTect-style DBSCAN per (country, day) and surfaces it at /v1/anomaly/dbscan/{cc} — AUC 0.65 against labeled incidents, promoted as a second-opinion signal. That axis asks: which days look weird per country?

This finding ships the other axis: which DOMAINS are changing how they're blocked across the world? The implementation uses HDBSCAN (McInnes et al. 2017, arXiv:1705.07321), which gives variable-density clustering — much better suited than DBSCAN's single-eps assumption to a small/medium per-domain matrix where measurement counts span from 10 (niche domains) to 1,800 (claude.ai).

The build

  • Feature matrix per domain: 18 numeric columns over the last 28 days — log measurement count, mean and std of per-country block rate, distinct countries blocking, ASN diversity, source diversity, share of each blocking method (DNS poison, TCP reset, blockpage, TLS reset, ISP outage), and one-hot domain-category shares (NEWS, ANON, GRP, PORN, MSG, SRCH, OTHER).
  • Filter: domains with >= 10 measurements in the window (signal threshold).
  • Clustering: HDBSCAN with min_cluster_size=5, metric=euclidean, applied after StandardScaler. Re-fit independently for this-week and last-week matrices.
  • Drift surface: per-domain L2 distance between this-week and last-week vectors in this-week's standardized space. Cluster-level: matched centroids (greedy nearest-neighbor with a 3-sigma cutoff) yield centroid-drift distances; unmatched THIS-week cluster IDs are flagged as new clusters = novel blocking patterns.

First-run results (week ending 2026-05-20)

  • 27 domains clustered this week (same set last week). HDBSCAN forms 2 clusters this week, 0 last week — all of last week was noise. Both this-week clusters are therefore flagged as new clusters this week (label IDs 0 and 1) — novel groupings the domain space didn't support a week ago.
  • Top-10 drift domains by L2 distance:
    1. tiktok.com — drift 0.343, moved noise → cluster 1, blocked in 25 countries (avg block_rate 1.0, 77% DNS poison + 23% TCP reset)
    2. openai.com — drift 0.336, blocked in 26 countries
    3. gemini.google.com — drift 0.331, blocked in 27 countries
    4. google.com — drift 0.235, avg block_rate 1.0 across 21 countries
    5. twitter.com — drift 0.204, moved noise → cluster 0, blocked in 27 countries
    6. youtube.com — drift 0.203, moved → cluster 0, 24 countries
    7. whatsapp.com — drift 0.179, → cluster 0, 24 countries
    8. facebook.com — drift 0.176, → cluster 0, 25 countries
    9. huggingface.co — drift 0.157, 26 countries
    10. theguardian.com — drift 0.156, moved → cluster 1, 23 countries

Verification

We re-queried the evidence table for each top-10 drift domain over the last 14 days and asked: are these actually being blocked, or is the drift noise? Result: 10 / 10 top-drift domains have critical or warning evidence in 18-27 countries over the last 14 days. 4 / 10 also appear in our curated incidents table by name (tiktok, openai, twitter, facebook). The unsupervised drift signal is genuinely tracking real blocking activity.

Honest caveats

  • Small corpus. Only 27 domains pass the >= 10-measurement filter in a 28-day window. HDBSCAN works with 27 points but the cluster geometry is fragile — week-over-week stability needs months of data to establish.
  • No labeled holdout. Unlike the per-country DBSCAN (AUC 0.65 against labels), this surface has no ground truth — drift is a measurement, not a prediction. We verify by cross-referencing the evidence table, not by AUC.
  • Cluster IDs are not stable across runs. HDBSCAN assigns IDs by cluster-discovery order. The cron run uses greedy centroid-distance matching to detect when a cluster is "the same shape as last week" — but the centroid-drift cutoff (3 sigma in standardized space) is a heuristic, not tuned.
  • This is a SECOND-OPINION signal. The supervised classifier still wins at AUC ~0.99 on labeled incidents. Use this for novel-pattern detection (cluster changes), not for first-line incident triage.

Live at

GET /v1/anomaly/domain-drift/leaderboard?limit=20 — top-N drift domains, sorted by L2 distance with new/dropped domains at the top
GET /v1/anomaly/domain-drift/{domain} — single-domain detail (cluster this/last week, drift score, raw method-share, per-country block geometry)
GET /v1/anomaly/domain-drift/info — sidecar metadata (algorithm, paper, params, run stats)

Refresh cadence: weekly Sunday 04:00 UTC, after the 02:00 retrain and the ~02:30 temporal holdout. State JSON + sidecar are written by scripts/run-hdbscan-domain-drift.py and the Flask endpoint hot-reloads on file-mtime change.

Raw data