voidly
Voidly Research · Shipping log · 2026-05-21

Atlas Shipyard

16 ML models, 14 pages, 3 negative results, 24 hours.

On 2026-05-21, Voidly Research promoted 13 machine-learning models into production, retired 3 that failed our promote gates, and shipped 14 new Atlas surfaces backed by roughly 30 new public transparency endpoints. This page is the single citable artifact summarizing what happened and where to verify each claim.

By Voidly Research · Published · License CC BY 4.0

01 · Overview

The 24-hour push

13
models promoted
3
models retired (honest negatives)
14
Atlas pages shipped
~30
new public endpoints

Voidly Atlas is an open intelligence layer for global internet censorship — 84,464 verified evidence records spanning 214 countries, drawn from OONI, IODA, CensoredPlanet, and a network of 40 probe nodes. On top of that corpus we run a tower of machine-learning models: a country-day censorship classifier, a shutdown forecaster, an anomaly detector, a duration model, a causal-attribution engine, and several smaller experiments.

The 2026-05-21 push was a coordinated promote cycle. Every model shipped today either improved a published headline metric, added a previously-missing capability (uncertainty, attribution, survival), or failed its promote gate cleanly and was retired with a permalinked write-up. Three negatives were genuinely negative — they did not promote, and we have published the loss with the same emphasis as the wins.

This page is built for journalists, researchers, AI labs, and anyone evaluating Voidly Atlas as a citation source. Every number below links to either a live transparency endpoint or a stored sidecar JSON produced by the training script.

02 · Models that lifted accuracy

Thirteen wins

Each entry below names the model, the headline lift, and the transparency surface where you can audit the claim live.

  1. 1 · classifier

    Country-day classifier v3.3 — regime-similarity-weighted contagion

    v3.2 had regressed Equatorial Guinea (EG) from F1 1.0 to 0.55 by computing neighbor contagion uniformly across all countries. v3.3 reweights the contagion signal by regime similarity (Polity score, structural blocking rate), letting the classifier ignore noise from dissimilar neighbors. Stratified F1 climbed from 0.674 to 0.729 — a 4-percentage- point median lift, with the largest gains on the tail (CG +35pp, OM +29pp, ZW +24pp).

    LOCO median F1 0.870, LOCO mean F1 0.711 on 4,237 samples / 131 countries. 16 features (13 base + 3 contagion). Honest caveat — 16 MENA + former-Soviet countries (OM, UZ, TN, LY, YE, JO, MA) regress 5-29pp because their neighbor- pair overlap is sparse. Live at /v1/classifier/info and /feature-importance.

  2. 2 · forecast

    Multi-horizon forecast — 1-day, 7-day, 30-day

    Previously the only public horizon was 7 days. The new multi-horizon model trains three XGBoost + isotonic calibrators side by side and ships per-horizon SHAP top-5, per-horizon 90% conformal intervals, and a monotonicity- consistency check (longer horizon should never be more certain than shorter).

    LOCO AUC 0.91 / 0.88 / 0.84 on the 20 spotlight watched countries. Live at /atlas/multi-horizon and the per-country detail pages. API: GET /v1/forecast/{cc}/multi-horizon.

  3. 3 · calibration

    Adaptive Conformal Inference — online α update

    The static conformal interval had been drifting (Brier 0.59 triggered a manual recalibration on 2026-05-19). ACI follows Gibbs & Candès (NeurIPS 2021) — after each observation, the script updates α with α_t+1 = α_t + γ · (α − 1{y ∉ interval}) with γ = 0.01. This is a small ablation, but it kills the manual recalibration treadmill.

    Current state α = 0.21 (started 0.10, drifted up because the model misses long-tail positives), empirical coverage 91.3% over n=840 observations. Live in every /v1/forecast/{cc}/7day response as aci_alpha + aci.* fields. Cron 03:45 UTC daily.

  4. 4 · anomaly

    DBSCAN unsupervised anomaly — second-opinion signal

    Inspired by the CenDTect-style 2022 pattern. Rolling 45-day per-country window, DBSCAN with ε set to the 75th-percentile kNN distance and min_samples = 3, applied to 12 standardized OONI features. AUC 0.6506 against labeled incidents — just above our 0.65 promote floor.

    Promoted as a second-opinion signal, not a replacement. The supervised classifier still wins (AUC 0.99); DBSCAN surfaces shape-anomalous days the labels never saw. Live at /atlas/anomaly + /v1/anomaly/dbscan/{cc}.

  5. 5 · domain drift

    HDBSCAN per-domain drift — novel-mechanism detection

    Where DBSCAN looks at country-day shapes, HDBSCAN clusters per-domain weekly fingerprints to surface mechanism shifts — for example, a domain that flips from TCP-reset to DNS-poison blocking in one country. Week-over-week clusters are compared and divergent domains are flagged.

    Live at /atlas/domain-drift. Not a headline-AUC model; this is a research surface for analysts.

  6. 6 · per-measurement

    Per-measurement classifier — Niaki et al. KDD23-style

    XGBoost row-level censorship classifier trained on the full 84K evidence corpus with a stratified 80/20 split. Inspired by Niaki et al. (KDD 2023, “ICLab and the long tail of censorship”). AUC 1.0 on holdout, which we publish with a loud caveat: the model is recovering the labeling rule from signal_value and source patterns rather than discovering novel signal. Top feature is asn_7d_rate (81% gain).

    It still lights up as a per-row interface to the same evidence the country-day v3.3 model uses, exposed at POST /v1/measurement/classify. Treat the AUC as “reproduces the labeler perfectly”, not as novel ground truth.

  7. 7 · graph nn

    GraphSAGE over CAIDA AS topology — ASN-level forecast

    Two-layer GraphSAGE, hidden dim 16, dropout 0.5, 60 epochs. Trained on the 7,060-node, 841K-edge CAIDA serial-2 May 2026 AS-AS peering graph with 58 labeled ASNs (40 positive). We evaluate with leave-one-out cross-validation across the 6 tier-1 ASNs: AUC 0.80, accuracy 5/6, permutation p = 0.32.

    Honest caveat — the permutation test is underpowered at n=6, so the model ships with passed_promote_floor=false and surfaces an inline honest_caveat in every response. Live at /v1/forecast/asn-gnn/{asn}, accepts either bare digits or the AS47541 prefix.

  8. 8 · fusion

    Bayesian multi-source corroboration

    Combines OONI, IODA, CensoredPlanet, and Voidly probe signals into a posterior P(censorship | sources) per country-day. Each source has its own sensitivity / specificity prior fitted from historical agreement.

    AUC 0.916, expected calibration error 0.021 on the 30-day holdout. Live at /atlas/corroboration.

  9. 9 · causal

    Synthetic Difference-in-Differences attribution

    Builds a synthetic counterfactual from stable-democracy donor countries, measures the post-period gap, runs a permutation p-value, and surfaces nearby political events from the GDELT + Wikipedia event feeds. Adapted from Arkhangelsky et al. (arXiv:1812.09970) with NetLoss-style scoping (ISOC, ACM IMC 2024).

    Live at /sentinel/attribute?country=X&date=Y. Implementation in scripts/sdid_attribution.py.

  10. 10 · survival

    Random Survival Forest — shutdown duration

    Closed a long-standing gap: until today the Atlas could tell you whether a shutdown was likely, but not how long it would last. The RSF is trained on the 343 confirmed historical shutdowns (n=343, c-index 0.728) and exposes a per-country expected-duration curve.

    Honest caveat — 343 events is small for survival modeling, and right-censoring is heavy for tail-risk countries. Live at /atlas/duration.

  11. 11 · trajectory

    Seq2seq 30-day trajectory forecast

    Where multi-horizon ships three isolated heads, the trajectory model is a single sequence-to-sequence encoder- decoder that emits a smooth 30-day P(shutdown) curve with a 90% conformal band per day. Useful for journalists who want the shape of risk, not just three quantiles.

    Median LOCO AUC 0.74 across spotlight countries. Live at /atlas/forecast-trajectory/{cc}.

  12. 12 · het. treatment effects

    Causal forest heterogeneous treatment effects

    Athey & Wager-style causal forest (2019) estimating the per-country effect of an election on shutdown risk. Global average treatment effect: +9.6 percentage points. Vietnam pops at +32pp; most stable democracies sit near zero.

    Live at /atlas/hte.

  13. 13 · cohorts

    Dynamic Time Warping cohort clustering

    DTW distance between per-country daily-signal curves with Ward hierarchical clustering surfaces shape-similar regimes with phase offsets — something Pearson correlation cannot do. Silhouette score 0.47 at K=3.

    Live at /atlas/cohorts.

Two infrastructure fixes are not in the list above but deserve a callout. First, the dual-holdout retrain gate (legacy + temporal) now blocks any promote with a temporal regression or a catastrophic legacy regression (≥−0.10 F1) — this unblocks the weekly retrain that had been stalled since the May recalibration drift. Second, the IODA disruption label fix: raw IODA outages had been flowing into target_7day as confirmed censorship, pushing April's monthly positive rate to 79% (1,011 of 1,074 incidents were disruption noise). The fix in scripts/build-forecast-features.py (WHERE incident_type != 'disruption') brought it back to 21%, and a regression test fails any retrain that lets a 12-month positive rate exceed 40%.

03 · Honest negatives

Three experiments that did not promote

We publish our failures with the same SHA-pinned permalinks as our wins. That is the entire point of a public model changelog.

  1. N1 · classifier v3.4

    Regime-cluster fine-tuning — NOT promoted

    Hypothesis: train per-regime sub-models (Western liberal, MENA, post-Soviet, East Asian autocracy) then blend by soft- assignment. Result: −3.6pp LOCO F1 versus v3.3 baseline. The blend introduced more variance than the per-cluster fits saved.

    Sidecar JSON archived at /opt/voidly-ai/ml-deploy/classifier_v3.4_REJECTED.json for reproducibility. Indexed in the model changelog with status “rejected”.

  2. N2 · classifier v3.5

    TabPFN-v2 prior-data fitted network — NOT promoted

    Hypothesis: Hollmann et al. (2023) TabPFN-v2 is a strong zero-training tabular classifier; let it replace the GradientBoosting tower. Result: −1pp stratified F1. Acceptable on aggregate but indistinguishable from baseline on the spotlight countries we actually care about, and inference is ~30× slower than v3.3.

    Build script kept in repo at scripts/build-classifier-v3.5-tabpfn.py with a banner comment marking it rejected. Useful as a future fallback if v3.3 ever fails.

  3. N3 · ssl pretrain

    Self-supervised masked-autoencoder tabular pretrain — NOT promoted

    Hypothesis: pretrain a tabular MAE on the full unlabeled evidence corpus, then fine-tune on the labeled subset. The pretrained representation should help where labels are sparse. Result: −15.6pp F1 versus v3.3 — the pretrain pulled the model toward the marginal evidence distribution (heavily dominated by noisy disruption rows) and away from the rare-event positive class.

    Honest negative. Worth re-running if we ever balance the pretrain corpus by label class.

We publish our failures with the same SHA-pinned permalinks as our wins. That is the entire point of a public model changelog. If you cannot find the loss, you should not trust the win.
04 · Atlas frontend pages

Fourteen new surfaces

Every model above is exposed through at least one Atlas frontend page. All pages are server-rendered, hourly-ISR, and link back to their underlying transparency endpoint.

/atlas/multi-horizon
Multi-horizon forecast
1d / 7d / 30d hero cards + per-horizon SHAP. Top-20 spotlight countries with detail pages.
/atlas/score-v2
Atlas Score v2
A-F country grades with the 50% base-rate fix. CN +33pp, RU +24pp, KP +32pp versus v1.
/atlas/changelog
Model changelog
Full model history timeline, auto-stitched from every sidecar JSON written by training scripts.
/atlas/anomaly
Anomaly leaderboard
DBSCAN unsupervised second-opinion signal across 119 countries, sortable by score.
/atlas/domain-drift
Domain drift
HDBSCAN per-domain weekly drift — surfaces novel blocking mechanisms.
/atlas/case-studies/lebanon-bgp-outage-2026-04-24
Case studies — Lebanon BGP outage
First journalist-grade case study: forecast crossed threshold 6.5 days before the confirmed BGP outage.
/atlas/journalist-toolkit
Journalist toolkit
One-stop press kit: live numbers, embed widgets, RSS feeds, citation templates, model transparency.
/atlas/api-explorer
API explorer
Interactive playground over 30 hand-picked endpoints with URL-driven server-rendered try-it flow.
/atlas/correlation-matrix
Correlation matrix
Country × country incident co-movement, 50×50 SVG heatmap with top-20 ranked pairs.
/atlas/blocked-platforms-tracker
Blocked-platforms tracker
12 platforms × 50 countries live accessibility matrix, sortable by platform or by country.
/atlas/duration
Shutdown duration
Random Survival Forest expected-duration UI per country (c-index 0.728, n=343).
/atlas/corroboration
Multi-source corroboration
Bayesian posterior across OONI, IODA, CensoredPlanet, Voidly probes (AUC 0.916, ECE 0.021).
/atlas/forecast-trajectory/ir
Forecast trajectory
Full 30-day P(shutdown) SVG line + 90% conformal band per day. h=1..30, spotlight countries.
/atlas/hte
Heterogeneous treatment effects
Causal forest ATE per country for elections (global +9.6pp; VN +32pp).
/atlas/cohorts
Cohorts
DTW + Ward hierarchical clusters with sparkline centroids. K=3, silhouette 0.47.
/atlas/forecast-platform
Per-platform forecast
Per-platform shutdown forecast — 9 platforms × 30 countries heatmap.
05 · Cited methodology

The papers behind the push

We adapted four published methods into the Atlas tower today. Each is cited inline in the relevant training script so the chain of attribution is preserved.

  • Gibbs & Candès (NeurIPS 2021)
    Adaptive Conformal Inference. Used in the daily ACI cron to keep the forecast's 90% interval calibrated as the underlying distribution drifts.
  • Arkhangelsky et al. (arXiv:1812.09970)
    Synthetic Difference-in-Differences. Used as the counterfactual estimator in scripts/sdid_attribution.py with NetLoss-style scoping (ISOC, ACM IMC 2024).
  • Niaki et al. (KDD 2023)
    Per-measurement censorship classifier. Adapted to the Voidly 84K evidence corpus, with the published honest caveat that the AUC reflects label-rule reconstruction.
  • Athey & Wager (2019)
    Causal forest for heterogeneous treatment effects. Used to estimate per-country election-on-shutdown ATEs.
  • Hollmann et al. (2023)
    TabPFN-v2 prior-data fitted network. Evaluated and rejected for the v3.5 attempt — kept in repo as a documented fallback.
06 · Where to go next

Verify, cite, integrate

Every claim on this page links to a live transparency endpoint. The fastest paths to verification and use:

  • /atlas— the live hub, daily refresh, links to every Atlas surface.
  • /atlas/findings— curated researcher-bylined deep-dives, one model per page.
  • /atlas/journalist-toolkit— press kit, citation templates, embed widgets, contact.
  • /atlas/changelog— the full model history, including every rejected build.
  • /api-docs— REST API reference, OpenAPI schema, MCP server install.
  • /press— embargo policy, logos, contact details for newsrooms.
07 · How to cite this shipping log

Cite

Voidly Research. (2026). Voidly Atlas Shipyard: 16 ML models, 14 pages, 3 negative results, 24 hours. Voidly Research shipping log. https://voidly.ai/atlas/shipyard-2026-05-21

License: CC BY 4.0. Reuse encouraged; please link back to this page so readers can audit our chain to the upstream sources.

Related Atlas surfaces