Four Failures in One Afternoon: Why Your Agent Swarm Needs a Watchdog

We spawned seven agents. All seven died silently. Four different bugs stacked on top of each other before we built something that could catch them.

Seven agents. Seven silent deaths. Zero alerts.

That was our Friday afternoon. We spawned a wave of Codex agents to build out AgentTeams, our cloud agent platform. Each got its own git worktree, its own tmux session, its own carefully written prompt. We kicked them off, moved on to other work, and checked back thirty minutes later expecting seven PRs.

We got seven empty worktrees and a lot of quiet.

The Stack of Failures

The first bug was a flag that didn't exist. We'd typed --auto-edit instead of --full-auto. Codex printed a usage error and exited. The tmux session stayed open, looking alive to anyone who glanced at it. No alert. No log we were checking.

We fixed the flag, respawned. All seven died again. This time the flag was -q, which also doesn't exist. Same silent pattern. Tmux alive, process dead, nobody watching.

Third attempt, we got the flags right. The agents started, ran for about eight minutes, then the watchdog we'd just built killed them. It had counted the previous two failures and decided these were double-failures. The fix created a new problem.

Fourth attempt. Correct flags, clean state. All seven launched. All seven died instantly. The error this time: o3 model is not supported when using Codex with a ChatGPT account. We'd switched from gpt-5.3-codex to o3 without checking whether our API key supported it.

Four different bugs. Four different root causes. One shared symptom: silence.

The Real Problem

None of these bugs were hard to fix. A wrong flag takes thirty seconds. A wrong model takes ten. The problem was that we had no mechanism to notice. Our monitoring was "check tmux, see if it looks busy." That's not monitoring. That's hope.

The gap between "agent spawned" and "agent produced useful output" was a black box. We knew when they started. We knew when we manually checked on them. Everything in between was invisible.

What We Built

A watchdog. Python script, 200 lines, running on a systemd timer every five minutes. It does three things.

First, it detects death. For every tmux session that starts with codex-, it checks whether there's actually a live process inside. Tmux sessions survive their children. A dead agent looks identical to a working one unless you check the process tree.

Second, it acts. On first failure, it auto-respawns the agent with the correct flags and resets the failure counter. On second failure for the same ticket, it stops trying. It marks the ticket as failed in our dispatcher, checks what downstream tickets are now blocked, and sends one Telegram message with the damage report. Not a notification chain. Not a prompt that might generate a message. A direct Bot API call.

Third, it chains. When an agent completes successfully, the watchdog calls our ticket dispatcher, marks the ticket done, gets back the next spawnable tickets, and launches them. Completion of one agent triggers the start of the next. No human in the loop for mechanical handoffs.

The Named Insight

Silence is the failure mode of autonomous systems.

Not crashes. Not errors. Not wrong outputs. Silence. The agent that exits without a trace. The process that dies while its container stays warm. The flag that's wrong in a way that produces no output at all.

Every reliability pattern we know from traditional software assumes something will complain. Exceptions get thrown. Logs get written. Health checks return 500. Agents don't follow these patterns. They launch into a context window, do their work (or don't), and exit. If nothing is watching the exit, nothing knows it happened.

What Changed

Before the watchdog, spawning a wave of agents meant spawning a wave of anxiety. You'd check back in twenty minutes hoping for the best. Now the worst case is a five-minute detection window, an automatic respawn, and a Telegram message if the respawn also fails.

The agents that completed Friday afternoon triggered the next wave automatically. The dispatcher chained seven completions into four new spawns without us touching anything. The failed ones got flagged with their downstream impact so we knew exactly what was blocked and why.

We went from "spawn and pray" to "spawn and the system handles the rest, unless it can't, in which case it tells you exactly what it needs."

The watchdog took an afternoon to build. The afternoon of silent failures that motivated it was more expensive.