2026-05-21

Multilingual semantic search: 50+ languages over the incident corpus

Voidly Atlas semantic search at /v1/atlas/search ran on all-MiniLM-L6-v2, an English-only sentence-transformer. Journalists querying in Arabic, Persian, Russian, Chinese, Hindi, etc. got noise. Added a parallel multilingual index (paraphrase-multilingual-MiniLM-L12-v2, same 384-d shape, 50+ languages) over the same 2,696 incidents, stored in a separate table (incident_embeddings_multilingual) so the English-only endpoint stays byte-identical. New endpoint GET|POST /v1/atlas/search/multilingual?q=... returns the same shape plus `detected_language` from a script-based heuristic (Arabic/Persian/Cyrillic/CJK/Hebrew/Devanagari/Thai/Burmese/Ethiopic blocks). Test queries: AR "إيران الإنترنت" → top-3 Iran incidents (sim 0.71-0.73), FA "اینترنت ایران" → top-3 Iran (0.69-0.70), RU "блокировка интернета Россия" → top-3 Russia (0.76-0.79), CN "中国互联网封锁" → top-3 China (0.74), ES "censura internet" → Mauritius/South Sudan/Turkey censorship rows (0.74-0.77). Latency overhead +33-49ms vs English-only (40ms → 73-89ms) — the L12 encoder is ~3x larger but still well under the 200ms p95 target. Honest caveats inline in every response: (1) incident descriptions are mostly English so cross-language matching relies on the encoder mapping queries into a shared space, (2) multilingual models score 5-10pp lower precision than per-language, (3) script-based language detection cannot distinguish ES vs FR vs IT inside Latin. Weekly cron `5 5 * * 0` rebuilds the multilingual index 5 minutes after the English one.

#ml#search#multilingual#embeddings#sentence-transformers#i18n#accessibility#transparency

Raw data