voidly

Synthetic baseline benchmark: every Atlas ML model vs predict_yesterday, base-rate, and four other trivial baselines

It is easy to claim "F1 0.87" for a censorship-forecast model. It is much harder to answer "is the model adding value over predict_yesterday?" honestly. This finding ships the synthetic-baseline benchmark suite: every Voidly Atlas ML model evaluated against six trivial baselines (always_zero, always_one, base_rate_constant, predict_yesterday, country_base_rate, random_with_base_rate), with the lift surfaced inline as a single number per row. Live at GET /v1/atlas/baseline-benchmark; weekly cron Thursday 04:45 UTC. First run benchmarked 23 models. 7 flagged barely_beats_baseline=true (F1/AUC lift < 5pp vs predict_yesterday) — including forecast_7day (F1 lift +2.29pp), forecast_1d/7d/30d (multi-horizon, all F1-negative vs persistence), classifier_v3.3 (F1 lift -21pp on rolling holdout — different task framing, documented honestly), and trajectory_d7/d30 (AUC lift -21 to -23pp because persistence dominates on 30-day horizons). Best by AUC lift is per_domain_pornhub.com at +50.74pp, but every AUC=1.000 row gets an explicit "code smell" honest_caveat — the model is likely reconstructing the labeling rule from a leaked feature, not discovering signal. Worst is trajectory_d30 at -22.78pp AUC. Brier sign-flipped so positive=better. Filters: family, barely, min_lift_pp. Six baselines documented in /info. The point is accountability: a model that barely beats predict_yesterday is on a list the user can see, not buried under a nice-looking F1.

#baseline#benchmark#accountability#ml-honesty#transparency#predict-yesterday#lift#api

Raw data