ADR-002: Pipeline-health cron for self-healing automation
Status: accepted · Date: 2026-04-18 · Applies to: reli, word-coach-annie, filmduel, cosmic-match, un-reminder, beads
Context
The ecosystem relies on three crons to keep code flowing to prod:
issue-pickup-cron— picks open GH issues and launches archon to draft fixespr-maintenance-cron— merges clean PRs, rebases dirty ones, fixes failing CI
But these crons only work on happy-path inputs. Three failure modes exist where nothing comes back to fix the pipeline itself:
- Main CI red — something landed that broke
main. Nothing picks this up because PR-maintenance only looks at PRs, not main. - Prod deploy failed or lagging — Railway webhook dropped, Actions workflow errored. The merge happened, the deploy didn’t.
- Pipeline stalled — no commits, no archon runs, no activity. Could be a quota limit, could be a systemic problem. Either way, nothing escalates.
Decision
A fourth cron, pipeline-health-cron, runs every 30 minutes and detects these
meta-failures. When it finds one, it either files an issue tagged for archon
pickup (self-heal) or fires archon-assist directly (urgent bottleneck).
Specifically, each tick:
- Main CI red → file
archon:in-progress-tagged issue, fire archon immediately on that issue, dedup by commit SHA via marker file - Prod deploy failed/lagging → same pattern, using deploy SHA for dedup; reads signal from either the GH Actions deploy workflow or the GitHub deployments API (Railway posts here)
- Zombie archon DB runs (status=running, age >4h) → abandon
- Disk pressure >85% → ntfy
- No progress in last tick — if token-limit markers in logs, wait; else fire archon-assist diagnostic with 2h cooldown
Consequences
- Self-healing covers three previously-silent failure modes.
- Zero cost when the pipeline is healthy — nothing fires.
- AI cost is bounded by the 2h diagnostic cooldown and per-SHA dedup.
- Adds one more thing to forget to deploy if the machine is reimaged — the cron needs to be re-installed. Mitigation: crontab lives in the memory reference file, re-install is documented.
Alternatives considered
- Nagios / Uptime Kuma / PagerDuty: rejected. External monitoring tells you that something broke — the goal here is for the pipeline to fix itself without a human pager.
- Put the auto-fix inside each project’s CI: rejected. CI only runs on push, so it can’t detect “nothing happened for 30 minutes.” Needs to be out-of-band.