CN RU IR ID IN AE VN BD KZ PK TR SA TH EG MY AZ MA UZ SG KH BY VE IQ2026-05-22

The AS-topology GNN does beat chance — once you give it a real label and a real test

Voidly Atlas runs a 2-layer GraphSAGE GNN over the CAIDA AS-AS peering graph (7,060 nodes, 841K edges) to score per-ASN shutdown risk. It shipped 2026-05-21 with passed_promote_floor=false and an honest caveat: leave-one-out CV ran across only 6 ASNs, so although it reported AUC 0.80, a permutation test gave p=0.32 — n=6 cannot reject the null. This finding is the better-powered re-evaluation, and it surfaced a second flaw the n=6 caveat had hidden. The old label, had_shutdown_next_7d, was defined as "did any next-7-day measurement hit block_rate >= 0.5" — but every ASN-tagged evidence row in the database is a CensoredPlanet block, so block_rate is always 1.0 and the label collapsed to "does this ASN have >= 5 measurements on a post-cutoff day" — a measurement-density flag, not a censorship signal. Proof: the 40 old positives had a median of 24 post-cutoff rows, the 18 old negatives a median of 2. And the GNN node features included n_evidence_30d/180d, n_unique_dates and has_evidence, so the model could read its own label off its inputs — the AUC 0.80 was partly density predicting density. The fix builds a genuine censorship label. CensoredPlanet rows carry signal_value, a continuous block-intensity score (low = the ASN let the probe through, high = it blocked it). Every ASN with >= 20 measurements in the trailing 180 days is relabeled by the fraction of its measurements showing strong blocking (signal_value >= 0.5): >= 60% blocked is positive (censors, 62 ASNs), <= 25% is negative (clean, 35 ASNs), the 25-60% middle band is dropped. That is the directive definition exactly — positive = ASN with confirmed censorship evidence, negative = ASN with clean evidence — and it grows the labeled set from 58 to 97 ASNs across 30 countries. Because signal_value and signal_level are tightly coupled, all seven signal-derived features (block_rate_30d/180d and five pct_* buckets) are dropped to kill leakage; the GNN is retrained on density + topology features only (evidence counts, distinct days, domains, CAIDA degree, country risk tier). Evaluation is leave-AS-out only, never shuffled: leave-one-AS-out across all 97 ASNs, and leave-one-COUNTRY-out (30 country folds, every ASN of a held country removed together — the leakage-safe gate, since same-country ASNs share country features and are topological neighbors). Skill is pooled out-of-fold AUC plus a 5,000-permutation test that shuffles only the label vector. The honest verdict is SIGNIFICANT: leave-one-COUNTRY-out gives AUC 0.7751, permutation p=0.0002; leave-one-AS-out gives AUC 0.7645, p=0.0002 — the two protocols agree, so the result is not same-country leakage. The permutation null averaged AUC 0.4998 (exactly chance, 95th percentile 0.60); the observed 0.775 is far in the tail. Confusion at threshold 0.5 under country-out: precision 0.86, recall 0.68 (42 TP, 20 FN, 28 TN, 7 FP), mean score 0.68 for censoring ASNs vs 0.39 for clean ones. Both the 0.65 AUC floor and the p<0.05 bar are cleared by a wide margin under the leakage-safe protocol, so passed_promote_floor is flipped to true. The measured framing: AUC 0.775 is clearly-better-than-chance, not operationally decisive — recall 0.68 still misses a third of censoring ASNs, the 97-ASN label set is small, and the genuine label says whether an ASN censors, not when. But the GraphSAGE-over-AS-topology approach is validated: AS topology plus measurement density carry real, statistically significant signal for whether an ASN censors. The old AUC 0.80 / n=6 / p=0.32 headline is superseded; the endpoint now ships a defensible significance claim instead of a thin one. Both outcomes were acceptable under the directive — this one happens to be a real positive.

#methodology#ml#gnn#graphsage#as-topology#honest-positive#data-leakage#permutation-test#leave-one-out-cv#accountability#atlas#api

Raw data