2026-05-21

ML serving reliability dashboard: 30+ endpoint health in one curl

With 30+ ML endpoints (/v1/forecast/*, /v1/classifier/*, /v1/anomaly/*, /v1/measurement/*, /v1/sentinel/*) in production, a journalist or partner asking "is the model live and working?" used to need 30+ curls. The new daily dashboard at GET /v1/atlas/serving-reliability collapses that into one. Cron runs scripts/build-serving-reliability.py at 04:00 UTC: 10 probes per endpoint, HTTP availability, p50 + p95 latency, JSON schema fingerprint (top-level-keys superset check), model staleness (days since the underlying sidecar was last written or trained_at declared), honest_caveats presence, and calibration drift (forecast/sentinel only). Sidecar at /opt/voidly-ai/ml-deploy/serving_reliability.json. Promote criteria: >= 25 endpoints probed and honest about any 500s/timeouts. Honest caveats: 10-probe sample is small so p95 is noisy, probes hit localhost upstream so CF gateway issues are NOT detected, schema check is a fingerprint not full JSON-schema, MCP server endpoints excluded.

#monitoring#reliability#ml-honesty#transparency#observability

Raw data