Shutdown duration is not predictable — an honest audit of the survival model
Voidly Atlas ships a shutdown-duration model — a Random Survival Forest (RSF) that answers "once a shutdown starts, how long will it last?" — served at POST /v1/forecast/duration. The model card reported a concordance index (c-index) of 0.728, above the 0.65 promote floor and flagged passed_promote_floor: true. A platform-wide ML audit had caught several Voidly models inflating metrics via shuffled train/test splits; the duration RSF had not been individually audited. This finding is that audit. How the 0.728 was computed: scripts/train-shutdown-duration-rsf.py evaluates the model with a single random 75/25 train_test_split (random_state=42), stratified only on the censoring flag — no temporal ordering, no country grouping. Three problems sink that number. (1) The point estimate is a lucky seed: only 74 of 343 incidents have an observed end (78.4% censored), so a 25% test fold holds ~19 events and ~286 comparable pairs — tiny and noisy. Re-running the identical random split across 20 seeds gives 0.728 ± 0.055, range 0.62–0.82; the published number is just where seed 42 landed. (2) The random split leaks calendar time: the top permutation feature is first_seen_year, and that is leakage not signal — the "event observed" label is status=confirmed, and confirmed status is an artifact of incident AGE, not of the shutdown ending. 2019–2023 incidents are ~100% confirmed; 2024–2026 incidents are 9–56% confirmed. A shuffle scatters old and new years across both folds, so the RSF learns "old year ⇒ resolved row" — a dataset-assembly pattern that cannot generalise to a live shutdown today. (3) The honest split shows no skill: a forward-temporal split (train on shutdowns whose outcome was known before a cutoff, test on strictly later ones, sorted by end date) yields c-index 0.609 / 0.563 / 0.571 / 0.437 across cutoffs 2022–2025 — mean ≈ 0.55, and the most-recent fold (the one matching real use) scores 0.44, below a coin flip. The promote gate requires beating a naive baseline — predict this country's mean past observed duration. On the identical temporal folds the naive baseline scores 0.509; the RSF's honest 0.55 is within noise of it and loses outright on two of four folds. Giving the model its best shot — an enriched feature set (incident confidence, measurement count, anomaly rate, source/domain/service/ASN counts, severity grade, blocking mechanism), dropping the leaky year — scored 0.495 on the forward-temporal split. More features did not help. Verdict: honest negative. No duration model beats a naive country-mean baseline on a leak-free split; nothing is promoted. With Voidly's current data, how long a shutdown lasts is not predictable — too few resolved events (~one per country), an "end" label contaminated by curation lag rather than measurement, and durations quantized to 24-hour ingestion multiples. The artifact and sidecar now carry the honest c-index, passed_promote_floor: false and an honest_caveats block; the endpoint surfaces the audit verdict. Publishing the honest negative — and not shipping a number we cannot stand behind — is the fix.