OpenClaw + Codex Agent Swarm: The Full Setup Guide

The complete setup guide for running a multi-agent development team. Dispatcher, watchdog, phase gates, and the full automated pipeline.

Our multi-project agent swarm for parallel development. Navi orchestrates up to 7 concurrent Codex agents running in isolated git worktrees via tmux. An integrated dispatcher handles ticket dependencies and phase gates. A self-healing watchdog auto-respawns failed agents, chains completions into next spawns, and auto-spawns new work when the swarm goes idle.

Architecture

Mike (human)
  ↓ sprint plan / phase gate approval
Navi (orchestrator — OpenClaw main agent)
  ├── Business context (MEMORY.md, memd, sprint.json)
  ├── Dispatcher (tickets, deps, phase gates)
  ├── Watchdog (health, auto-respawn, auto-chain)
  ↓
┌────────────┬────────────┬────────────┬──── ··· ────┐
│ codex-at-  │ codex-at-  │ codex-at-  │  (max 7)    │
│ p3-02-tg   │ p4-02-cat  │ p7-01-adm  │             │
│ tmux       │ tmux       │ tmux       │             │
│ worktree   │ worktree   │ worktree   │             │
└─────┬──────┴─────┬──────┴─────┬──────┘             │
      └── PRs ────→ main ──→ GitHub Actions CI        │
                                                      │
Watchdog (systemd timer, every 5 min) ────────────────┘
  ├── Detects completions → dispatcher.done() → auto-spawn next
  ├── Detects failures → auto-respawn (1st) or flag (2nd)
  ├── Detects idle → auto-spawn from dispatcher queue
  └── Phase gates → alerts Mike for approval

Why Two Tiers

	Navi (orchestrator)	Codex agents
Context	Business: sprint, customers, decisions, memory	Code: worktree files, types, tests
Model	Claude Opus (via OpenClaw)	gpt-5.3-codex (via Codex CLI)
Scope	Strategy, scoping, prompt writing, review, steering	Implementation, testing, PR creation
Persistence	MEMORY.md, memd, daily logs	Ephemeral (worktree + tmux session)
Tools	Full OpenClaw tooling	Terminal only (exec, git, gh, npm)

Context windows are zero-sum. Navi holds business context; agents hold code context. Specialisation through context, not through different models.

Design Principles

Five rules that keep the swarm stable:

Codex tmux only. One agent backend. No sub-agents for code tasks.
Specialisation through context, not models. Navi holds business context; agents hold code context.
Phase gates between phases, auto-spawn within. Human checkpoint at phase boundaries. Full autonomy inside a phase.
Double-failure halt. Agent fails twice on same ticket: flag it, don't retry. Silence is worse than stopping.
RAM guard. Don't spawn if free -m shows less than 1GB available.

Prerequisites

Tool	Version	Check
codex CLI	0.106.0+	`codex --version`
tmux	3.4+	`tmux -V`
gh CLI	2.x	`gh --version` (authenticated)
git	2.43+	`git --version`
Node.js	22.x	`node --version`

codex --version && tmux -V && gh auth status && git --version && node --version

Components

1. Dispatcher (`swarm/dispatch.py`)

The brain. Tracks tickets, dependencies, phases. Answers "what should run next?"

# Status dashboard
python3 swarm/dispatch.py -p agentteams status
python3 swarm/dispatch.py -p mc status

# What's spawnable right now?
python3 swarm/dispatch.py -p agentteams dispatch

# Mark ticket done (returns next spawnable)
python3 swarm/dispatch.py -p agentteams done P3-02 abc123f

# Mark ticket running
python3 swarm/dispatch.py -p agentteams running P3-02 codex-at-p3-02-telegram

The dispatcher emits signals:

Signal	Meaning	Action
(none)	Tickets spawnable in current phase	Auto-spawn
`PHASE_GATE`	Phase complete, next phase ready	Alert human, wait for "go"
`ALL_DONE`	Every ticket done	Project complete
`BLOCKED`	No spawnable tickets, deps incomplete	Investigate upstream

2. Watchdog (`swarm/watchdog.py`)

Self-healing health monitor. Runs every 5 minutes via systemd timer. This is what makes the swarm autonomous.

Event	Action
No agents running + spawnable tickets	Auto-spawn up to 3 tickets (RAM check first), alert human
Agent completes (has commits/changes)	`dispatcher.done()` → auto-spawn next ticket → update tracker
Agent fails (1st time)	Auto-respawn with correct flags, reset failure count
Agent fails (2nd time)	Mark failed, report downstream blocked tickets, alert human
Agent stuck (45min+, no changes)	Alert human
Phase gate reached	Alert human for approval

The idle auto-spawn is key. When the watchdog fires and finds zero running tmux sessions, it queries the dispatcher for all projects. If spawnable tickets exist and RAM is above 1GB, it spawns up to 3 agents and sends a notification. This means the swarm is fully self-sustaining within a phase: completions chain into spawns, and idle gaps are automatically filled.

Notification rules:

Auto-respawns on first failure are silent (logged only)
Idle auto-spawns send a notification (so you know work started)
Only decisions go to the human: double-failure, phase gate, stuck agent

# Check timer status
systemctl --user status swarm-watchdog.timer

# Run manually
python3 swarm/watchdog.py          # live
python3 swarm/watchdog.py --dry-run # test without side effects

3. Prompts (`swarm/prompts/`)

One markdown file per ticket. This is the most important part. A good prompt produces a working PR. A bad prompt produces a silent failure.

Each prompt includes:

Context (what the project is, tech stack, relevant existing code)
Exact task scope (what to build, what NOT to build)
Files to create or modify
Verification steps (build, test, lint)
Git instructions (branch name, commit format)
Directory ownership (which paths this agent owns, to prevent conflicts with parallel agents)

4. Project Configuration

swarm/projects.json registers all projects:

{
  "agentteams": {
    "description": "AgentTeams SaaS",
    "tracker": "agentteams-tracker.json",
    "repo": "/home/openclaw/projects/agentteams",
    "spawnScript": "spawn-at.sh"
  },
  "mc": {
    "description": "Mission Control",
    "tracker": "mc-tracker.json",
    "repo": "/home/openclaw/projects/openclaw-mission-control",
    "spawnScript": "spawn.sh"
  }
}

Tracker files hold every ticket with status, dependencies, and phase:

{
  "project": "agentteams",
  "maxConcurrent": 7,
  "tickets": {
    "P3-02": {
      "status": "done",
      "depends": ["P3-01"],
      "label": "codex-at-p3-02-telegram",
      "phase": 3,
      "sha": "0b2b47b"
    }
  }
}

Statuses: todo → running → done / failed

The Automated Pipeline

This is how the full cycle works end-to-end:

1. Write prompts for spawnable tickets
2. Spawn agents (manual kick or watchdog auto-spawn)
3. Watchdog monitors every 5 min:
   - Completion → dispatcher.done(ticket, sha)
     → returns next spawnable tickets
     → watchdog auto-spawns them
     → tracker updated
   - Failure (1st) → auto-respawn
   - Failure (2nd) → mark failed, alert human
   - Idle (0 running) → query dispatcher, spawn if available
4. Phase complete → PHASE_GATE signal → alert human
5. Human says "go" → next phase auto-spawns on next watchdog pass
6. ALL_DONE → integration pass, testing, deploy

Within a phase, this runs with zero human intervention. Completions trigger the next spawn. Failures get one retry. The idle detector catches any gaps. The human only gets involved at phase boundaries or when something fails twice.

Codex CLI Reference

# Non-interactive (what agents use)
codex exec -m gpt-5.3-codex \
  --dangerously-bypass-approvals-and-sandbox \
  -C /path/to/worktree \
  "prompt text here"

Key flags:

-m gpt-5.3-codex — model selection (must match your API key tier)
--dangerously-bypass-approvals-and-sandbox — required for non-interactive
-C <dir> — working directory

Flags that do NOT exist (will cause silent failure):

--auto-edit — not a real flag
-q — not a real flag
--full-auto — interactive mode only

PATH requirement: Codex lives at /home/openclaw/.local/bin/codex. Systemd services and cron jobs must include this in PATH or use the full path.

Monitoring

# All running agents
tmux ls | grep codex-

# Check specific agent output
tmux capture-pane -t codex-at-p3-02-telegram -p | tail -20

# Worktree changes
cd /home/openclaw/projects/agentteams-worktrees/at-p3-02-telegram
git status --porcelain && git log --oneline -5

# Watchdog log
tail -30 ~/.openclaw/workspace/logs/watchdog.log

# Dispatcher status
python3 swarm/dispatch.py -p agentteams status

Troubleshooting

Symptom	Cause	Fix
Agent exits immediately	Bad CLI flags or missing binary	Check `codex.log` in worktree
`codex: command not found`	PATH missing in tmux/systemd	Use full path `/home/openclaw/.local/bin/codex`
tmux session alive, no process	Codex crashed mid-run	Watchdog auto-respawns on next pass
0 files changed after 10+ min	Agent stuck or still thinking	Check tmux pane; watchdog alerts at 45min
Watchdog not spawning when idle	Watchdog only checked existing sessions	Fixed: now queries dispatcher when 0 agents running
`o3 not supported`	Wrong model for API key tier	Use `gpt-5.3-codex` for ChatGPT-tier keys
Agent completes but next not spawned	Missing prompt file for next ticket	Write prompt to `swarm/prompts/<ticket>.md`

File Map

swarm/
├── dispatch.py              # Ticket dispatcher (deps, phases, signals)
├── watchdog.py              # Health monitor + dispatcher integration
├── watchdog-state.json      # Failure counts (auto-managed)
├── projects.json            # Project registry
├── agentteams-tracker.json  # AT ticket tracker
├── mc-tracker.json          # MC ticket tracker
├── spawn.sh                 # MC agent spawner
├── spawn-at.sh              # AT agent spawner
├── prompts/                 # Prompt files per ticket
│   ├── p3-02-telegram.md
│   ├── kan-6a-ask-navi.md
│   └── ...
├── status.sh                # Quick tmux status
└── cleanup.sh               # Worktree cleanup

logs/
└── watchdog.log             # Watchdog event log

~/.config/systemd/user/
├── swarm-watchdog.service   # Watchdog oneshot service
└── swarm-watchdog.timer     # 5-minute timer