← all decisions

ADR-002: Pipeline-health cron for self-healing automation

Status: accepted · Date: 2026-04-18 · Applies to: reli, word-coach-annie, filmduel, cosmic-match, un-reminder, beads


Context

The ecosystem relies on three crons to keep code flowing to prod:

But these crons only work on happy-path inputs. Three failure modes exist where nothing comes back to fix the pipeline itself:

  1. Main CI red — something landed that broke main. Nothing picks this up because PR-maintenance only looks at PRs, not main.
  2. Prod deploy failed or lagging — Railway webhook dropped, Actions workflow errored. The merge happened, the deploy didn’t.
  3. Pipeline stalled — no commits, no archon runs, no activity. Could be a quota limit, could be a systemic problem. Either way, nothing escalates.

Decision

A fourth cron, pipeline-health-cron, runs every 30 minutes and detects these meta-failures. When it finds one, it either files an issue tagged for archon pickup (self-heal) or fires archon-assist directly (urgent bottleneck).

Specifically, each tick:

Consequences

Alternatives considered