voidly
Atlas · topic modeling

Incident topics

8 themes auto-discovered across 1,195 deduped incident descriptions (2,636 source rows, 55% boilerplate dedupe). NPMI coherence 0.726 — well above the 0.30 promote floor.

Honest caveat: NMF is exploratory and topic labels are auto-generated heuristics over the top words, not editorial. Pick a label, scan the sample incidents, decide for yourself if the grouping holds. BERTopic with semantic embeddings would likely cluster differently — see methodology below.

Raw JSON · /topics/info · model: tfidf_nmf_v1 · trained May 21, 2026

Topic cards

Connectivity disruption (IODA)

765 docs · 64.0%
Top words
connectivitydisruption connectivitydrop alertsconnectivity disruptionconnectivity dropdroprecordedalerts recordedalertsdisruption
Sample incidents

Connectivity disruption (IODA)

157 docs · 13.1%
Top words
network disruptionnetworkdrop alertsalertsconnectivity droprecordeddisruptiondropalerts recordedconnectivity
Sample incidents

Social media platform blocks

51 docs · 4.3%
Top words
anomalousblocking probesprobes anomalousdnsdns blockingprobesblockingcomanomalous tiktoktiktok com
Sample incidents

Sustained activity / repeated alerts

53 docs · 4.4%
Top words
confirms activityconfirmsactivityrecorded confirmsactivity confirmsdisruption connectivityganptmma
Sample incidents
  • JO-2026-0083(JO) Internet connectivity disruption in Jordan
  • AZ-2026-0132(AZ) Internet connectivity disruption in Azerbaijan
  • EG-2026-0167(EG) Internet connectivity disruption in Egypt
  • PK-2026-0157(PK) Internet connectivity disruption in Pakistan

HTTP/TLS probe timeouts

48 docs · 4.0%
Top words
blockedprobes blockedprobesblocking timeouttimeout probestimeoutblockinghttps blockinghttpshttp
Sample incidents

Connectivity disruption (IODA)

15 docs · 1.3%
Top words
uniteddrop unitedunited connectivitydisruption unitedunited alertsconnectivityconnectivity disruptionconnectivity dropalertsdisruption
Sample incidents
  • GB-2026-0039(GB) Internet connectivity disruption in United Kingdom
  • US-2026-0057(US) Internet connectivity disruption in United States
  • GB-2026-0037(GB) Internet connectivity disruption in United Kingdom
  • GB-2026-0033(GB) Internet connectivity disruption in United Kingdom
Top countries

Social media platform blocks

40 docs · 3.3%
Top words
comblockingdnsdns blockingblocking domainscensoredplanetdomainsinstagramcom censoredplanetfacebook
Sample incidents
  • BY-2026-0004(BY) DNS blocking in BY: bbc.com, facebook.com, google.com, instagram.com, medium.com +13 more
  • EG-2026-0005(EG) DNS blocking in EG: bbc.com, facebook.com, google.com, instagram.com, medium.com +13 more
  • SA-2026-0004(SA) DNS blocking in SA: bbc.com, facebook.com, google.com, instagram.com, medium.com +13 more
  • TR-2026-0006(TR) DNS blocking in TR: bbc.com, facebook.com, google.com, instagram.com, medium.com +13 more

OONI anomaly burst

47 docs · 3.9%
Top words
oonianomaly raterateanomalyshutdownsustainednetwork averagingaveragingelevated networkaveraging anomaly
Sample incidents

Unlabeled (low signal)

19 docs · 1.6%

Country × topic heatmap

Top 40 countries by incident volume (min 4 incidents in the corpus). Cell shade = share of that country's incidents in that topic (deeper green = larger share). Helps a journalist see which themes dominate a specific country's history.

Country
t0 · Connectivity disruptio
t1 · Connectivity disruptio
t2 · Social media platform
t3 · Sustained activity / r
t4 · HTTP/TLS probe timeout
t5 · Connectivity disruptio
t6 · Social media platform
t7 · OONI anomaly burst
n
IR
18
1
2
3
8
32
VE
20
1
1
1
2
25
RU
6
1
4
2
9
23
PK
3
3
6
5
3
21
EG
9
1
2
2
3
1
19
ID
18
1
19
IN
14
1
3
19
NG
14
2
1
1
18
TT
17
1
18
MM
6
1
4
3
2
17
TZ
3
1
3
9
1
17
IQ
10
1
2
2
16
KZ
7
1
1
2
1
3
15
MX
13
1
14
SY
8
4
2
14
UZ
3
6
2
2
1
14
AZ
6
3
4
13
BD
7
1
4
1
13
ET
9
1
1
2
13
SI
12
1
13
TR
7
2
3
1
13
CM
10
2
12
CN
5
1
4
1
11
CU
6
2
1
1
1
11
FR
10
1
11
GB
1
8
2
11
PA
10
1
11
BA
9
1
10
BY
2
1
1
4
1
1
10
MZ
9
1
10
NI
5
1
4
10
SA
4
2
2
1
10
UA
9
1
10
VN
4
1
2
1
10
ZW
9
1
10
AO
8
1
9
BR
7
1
1
9
CO
8
1
9
DZ
5
1
2
1
9
HN
8
1
9

Methodology

  1. Corpus build: pull every incident (title + description) from the live DB, normalize whitespace, dedupe near-duplicates by SHA-256 of the lowercased text. 2,636 rows → 1,195 unique docs.
  2. Vectorize: tf-idf, unigrams + bigrams, min_df=2, max_df=0.95, vocab capped at 5,000, English stopwords + domain stopwords (boilerplate like "cenalert", "detected", country names) ⇒ 623 terms.
  3. NMF sweep: fit sklearn NMF with k ∈ [8, 24] step 2, init="nndsvda", beta_loss=frobenius, score by NPMI coherence on top-12 words per topic. Pick the k that maximizes coherence.
  4. Assignment: each doc → argmax topic (hard assignment). Documents with W weight below 1e-6 fall into an "unlabeled" bucket.
  5. Labels: auto-generated by checking the top-N words against a hand-curated keyword map (election / VPN / DNS / etc). Falls back to "Topic N: word1 / word2" if no keywords match. These are heuristic — they exist to make scanning faster, not to be authoritative.

k sweep

Coherence by k (NPMI on training corpus, top-12 words per topic):

kcoherencerecon err
80.725719.81
100.707119.35
120.699218.95
140.714918.61
160.700218.33
180.701418.02
200.686417.72
220.694317.43
240.679317.15

Honest caveats

  • NMF topics are exploratory. Labels are auto-generated heuristics over the top words, not editorial.
  • Many IODA disruption incidents share boilerplate ("CenAlert detected interference"). Dedupe removes near-duplicates by SHA-256 of normalized text, but a residual disruption topic is expected.
  • Topic assignment is hard (single argmax). An incident with mixed signals (e.g. election + DNS) only counts toward one topic.
  • Coherence is approximated via pairwise NPMI on the training corpus, not full c_v from Röder et al. Reasonable proxy but not directly comparable to BERTopic numbers in the literature.
  • Sentence-transformers + BERTopic was not installable in the production venv-ml — fell back to tf-idf+NMF per the project directive. BERTopic with all-MiniLM-L6-v2 would likely surface more semantic clusters; tf-idf catches the lexical surface.

Related