Synthetic baseline benchmark: every Atlas ML model vs predict_yesterday, base-rate, and four other trivial baselines
It is easy to claim "F1 0.87" for a censorship-forecast model. It is much harder to answer "is the model adding value over predict_yesterday?" honestly. This finding ships the synthetic-baseline benchmark suite: every Voidly Atlas ML model evaluated against six trivial baselines (always_zero, always_one, base_rate_constant, predict_yesterday, country_base_rate, random_with_base_rate), with the lift surfaced inline as a single number per row. Live at GET /v1/atlas/baseline-benchmark; weekly cron Thursday 04:45 UTC. First run benchmarked 23 models. 7 flagged barely_beats_baseline=true (F1/AUC lift < 5pp vs predict_yesterday) — including forecast_7day (F1 lift +2.29pp), forecast_1d/7d/30d (multi-horizon, all F1-negative vs persistence), classifier_v3.3 (F1 lift -21pp on rolling holdout — different task framing, documented honestly), and trajectory_d7/d30 (AUC lift -21 to -23pp because persistence dominates on 30-day horizons). Best by AUC lift is per_domain_pornhub.com at +50.74pp, but every AUC=1.000 row gets an explicit "code smell" honest_caveat — the model is likely reconstructing the labeling rule from a leaked feature, not discovering signal. Worst is trajectory_d30 at -22.78pp AUC. Brier sign-flipped so positive=better. Filters: family, barely, min_lift_pp. Six baselines documented in /info. The point is accountability: a model that barely beats predict_yesterday is on a list the user can see, not buried under a nice-looking F1.
Raw data
- Live: full 23-model ranked table
- Live: methodology + baseline definitions + caveats
- Live: only the 7 barely_beats_baseline=true models
- Live: only the forecast family
- Live: only the trajectory family (where persistence wins)
- Live: only the per-domain family (every row AUC ~1.0, code-smell flagged)
- Live: models with F1 lift >= 30pp (genuinely strong)
- Related: forecast model accuracy live
- Related: classifier v3.3 transparency
- Related: prediction track record (live precision/recall)