Observability — where failures are and why¶

This subsystem exists because of a real incident. A JSON-parsing bug once failed forty times in a single Lab run; each failure was logged as a warning, the run reported success, and only 4 of 44 experiments actually executed. Nothing aggregated the failure rate, so nobody saw it until someone happened to grep the logs. The rule that came out of that incident: nothing in this pipeline is allowed to fail quietly.

pipeline_events (Postgres, self-hosted)¶

Every stage records ok/fail with the reason: compose, lab_compose, render, s3_upload, segmentation, validator, scene, scheduler, lab_cycle. Inserts are fire-and-forget — telemetry can never block or break the pipeline.

Surfaces¶

GET /api/admin/pipeline — per-stage 24h/7d ok/fail + latest failure reasons
GET /api/admin/pipeline/events?stage=&status= — filterable feed
GET /health/full — degrades when any stage logs ≥10 failures in 24h (DB-backed: spikes visible regardless of which process recorded them)
Strategist → Status — failures-by-stage panel, 30s refresh
logfire — optional extra sink when a token is configured; nothing depends on it

Fire-drill (verified)¶

12 synthetic compose failures → health degraded, spiking: ['compose'], reasons visible in the panel; cleanup → ok. The next gemini-class failure turns the sidebar red within one poll, naming the stage, the model, and the raw output.