IR CN RU AE KZ2026-05-21

Country censorship-behavior similarity graph: which countries block like Iran?

Censorship analysts, journalists and the forecast model all keep asking the same shape of question: "which countries behave like Iran?" — for transfer learning (borrow a label-rich neighbor's signal for a label-poor country), journalist comparisons, and zero-shot priors. Until now Voidly Atlas had no single answer: the DTW shutdown cohorts grouped countries by the SHAPE of their daily signal series, but shape alone misses HOW a country blocks. This finding ships a country censorship-behavior similarity graph: every country is embedded into a 32-dimension vector built from its measured blocking behavior, and for each country the top-10 most behaviorally-similar countries are surfaced by cosine similarity. The feature vector, assembled over a trailing 365-day window, has seven families: blocking-method distribution (tcp-reset / dns-poisoned / ip-blocked / http-redirect fractions over method-tagged rows); signal-family distribution (the raw signal_type vocabulary collapsed into 8 stable families — dns-blocking, tcp-reset-http, interference, generic-blocking, tor-blocking, middlebox, header-manipulation, outage); category distribution (the Citizen Lab NEWS/GRP/COMT/MMED/PORN/SRCH share of block-like category-tagged evidence — what kinds of sites a country blocks); temporal pattern (monthly block-rate seasonality as std across 12 trailing months, burst frequency as the fraction of days exceeding the country's own mean+2sigma, active-day fraction); severity (critical-signal fraction, mean signal_value); DTW cohort membership (one-hot over shutdown_cohorts_v1 clusters, so two countries sharing a daily-signal shape get pulled together); and regime indicators (normalized risk tier, continent one-hot). All features are standardized to zero-mean unit-variance so a high-variance family like outage fraction does not drown out a low-variance one; pairwise cosine similarity ranks neighbors and a UMAP 2D projection (PCA fallback) gives coordinates for a behavioral map. The first run on 21 May 2026 embedded 135 countries — every country with at least 20 evidence rows in the trailing window — across 32 features. The neighbor lists are intuitive where they should be: Iran's nearest five (UAE, Thailand, Turkey, Saudi Arabia, Kazakhstan) are all interference-heavy mid-tier states with broad persistent blocking; China's nearest five (Iran, Russia, Vietnam, Pakistan, UAE) recover the expected heavy-censorship cluster without political input beyond a single risk-tier scalar; Russia sits closest to China, then Kazakhstan and Belarus. It also surfaces pairs worth a second look — Czechia and Albania at cosine 0.85 despite a two-tier risk gap — a reminder the graph measures the FOOTPRINT, not the politics. Honest caveats baked into every response: (1) similarity is BEHAVIORAL, not political — two countries are near each other because they BLOCK in similar ways, which does NOT imply comparable governments, laws or intent; a stable democracy with seasonal sports-piracy blocking can sit near an authoritarian state if their measured footprints rhyme; (2) feature vectors are SPARSE for low-coverage countries — a country with few OONI probes inside it populates only the regime/continent/cohort one-hot dimensions, is flagged sparse=true (fewer than 3 of 5 behavioral feature families populated), and its neighbors should be read with low confidence; (3) the 2D graph projection compresses 32 dimensions to 2 and loses information — use the cosine neighbor list, not projected pixel distance, for quantitative claims. Live at GET /v1/atlas/country-similarity/{cc} (top-10 nearest) + /v1/atlas/country-similarity/graph (2D projection of all 135 countries) + /v1/atlas/country-similarity/info. Rebuilt weekly Monday 05:10 UTC. Implementation: scripts/build-country-similarity-graph.py + patch-atlas-country-similarity-endpoint.py.

#atlas#similarity#embedding#cosine#umap#transfer-learning#censorship#clustering#zero-shot#ml-honesty#transparency#api

Raw data