Observability — where failures are and why¶
This subsystem exists because of a real incident. A JSON-parsing bug once failed forty times in a single Lab run; each failure was logged as a warning, the run reported success, and only 4 of 44 experiments actually executed. Nothing aggregated the failure rate, so nobody saw it until someone happened to grep the logs. The rule that came out of that incident: nothing in this pipeline is allowed to fail quietly.
pipeline_events (Postgres, self-hosted)¶
Every stage records ok/fail with the reason: compose, lab_compose,
render, s3_upload, segmentation, validator, scene, scheduler,
lab_cycle. Inserts are fire-and-forget — telemetry can never block or break
the pipeline.
Surfaces¶
GET /api/admin/pipeline— per-stage 24h/7d ok/fail + latest failure reasonsGET /api/admin/pipeline/events?stage=&status=— filterable feedGET /health/full— degrades when any stage logs ≥10 failures in 24h (DB-backed: spikes visible regardless of which process recorded them)- Strategist → Status — failures-by-stage panel, 30s refresh
- logfire — optional extra sink when a token is configured; nothing depends on it
Fire-drill (verified)¶
12 synthetic compose failures → health degraded, spiking: ['compose'],
reasons visible in the panel; cleanup → ok. The next gemini-class failure
turns the sidebar red within one poll, naming the stage, the model, and the
raw output.