- Corpus build: pull every incident (title + description) from the live DB, normalize whitespace, dedupe near-duplicates by SHA-256 of the lowercased text. 2,636 rows → 1,195 unique docs.
- Vectorize: tf-idf, unigrams + bigrams, min_df=2, max_df=0.95, vocab capped at 5,000, English stopwords + domain stopwords (boilerplate like "cenalert", "detected", country names) ⇒ 623 terms.
- NMF sweep: fit sklearn NMF with k ∈ [8, 24] step 2, init="nndsvda", beta_loss=frobenius, score by NPMI coherence on top-12 words per topic. Pick the k that maximizes coherence.
- Assignment: each doc → argmax topic (hard assignment). Documents with W weight below 1e-6 fall into an "unlabeled" bucket.
- Labels: auto-generated by checking the top-N words against a hand-curated keyword map (election / VPN / DNS / etc). Falls back to "Topic N: word1 / word2" if no keywords match. These are heuristic — they exist to make scanning faster, not to be authoritative.
k sweep
Coherence by k (NPMI on training corpus, top-12 words per topic):
| k | coherence | recon err |
|---|
| 8 | 0.7257 | 19.81 |
| 10 | 0.7071 | 19.35 |
| 12 | 0.6992 | 18.95 |
| 14 | 0.7149 | 18.61 |
| 16 | 0.7002 | 18.33 |
| 18 | 0.7014 | 18.02 |
| 20 | 0.6864 | 17.72 |
| 22 | 0.6943 | 17.43 |
| 24 | 0.6793 | 17.15 |