Our multi-project agent swarm for parallel development. Navi orchestrates up to 7 concurrent Codex agents running in isolated git worktrees via tmux. An integrated dispatcher handles ticket dependencies and phase gates. A self-healing watchdog auto-respawns failed agents, chains completions into next spawns, and auto-spawns new work when the swarm goes idle.
Architecture
Mike (human)
↓ sprint plan / phase gate approval
Navi (orchestrator — OpenClaw main agent)
├── Business context (MEMORY.md, memd, sprint.json)
├── Dispatcher (tickets, deps, phase gates)
├── Watchdog (health, auto-respawn, auto-chain)
↓
┌────────────┬────────────┬────────────┬──── ··· ────┐
│ codex-at- │ codex-at- │ codex-at- │ (max 7) │
│ p3-02-tg │ p4-02-cat │ p7-01-adm │ │
│ tmux │ tmux │ tmux │ │
│ worktree │ worktree │ worktree │ │
└─────┬──────┴─────┬──────┴─────┬──────┘ │
└── PRs ────→ main ──→ GitHub Actions CI │
│
Watchdog (systemd timer, every 5 min) ────────────────┘
├── Detects completions → dispatcher.done() → auto-spawn next
├── Detects failures → auto-respawn (1st) or flag (2nd)
├── Detects idle → auto-spawn from dispatcher queue
└── Phase gates → alerts Mike for approval
Why Two Tiers
| Navi (orchestrator) | Codex agents | |
|---|---|---|
| Context | Business: sprint, customers, decisions, memory | Code: worktree files, types, tests |
| Model | Claude Opus (via OpenClaw) | gpt-5.3-codex (via Codex CLI) |
| Scope | Strategy, scoping, prompt writing, review, steering | Implementation, testing, PR creation |
| Persistence | MEMORY.md, memd, daily logs | Ephemeral (worktree + tmux session) |
| Tools | Full OpenClaw tooling | Terminal only (exec, git, gh, npm) |
Context windows are zero-sum. Navi holds business context; agents hold code context. Specialisation through context, not through different models.
Design Principles
Five rules that keep the swarm stable:
- Codex tmux only. One agent backend. No sub-agents for code tasks.
- Specialisation through context, not models. Navi holds business context; agents hold code context.
- Phase gates between phases, auto-spawn within. Human checkpoint at phase boundaries. Full autonomy inside a phase.
- Double-failure halt. Agent fails twice on same ticket: flag it, don't retry. Silence is worse than stopping.
- RAM guard. Don't spawn if
free -mshows less than 1GB available.
Prerequisites
| Tool | Version | Check |
|---|---|---|
| codex CLI | 0.106.0+ | codex --version |
| tmux | 3.4+ | tmux -V |
| gh CLI | 2.x | gh --version (authenticated) |
| git | 2.43+ | git --version |
| Node.js | 22.x | node --version |
codex --version && tmux -V && gh auth status && git --version && node --version
Components
1. Dispatcher (swarm/dispatch.py)
The brain. Tracks tickets, dependencies, phases. Answers "what should run next?"
# Status dashboard
python3 swarm/dispatch.py -p agentteams status
python3 swarm/dispatch.py -p mc status
# What's spawnable right now?
python3 swarm/dispatch.py -p agentteams dispatch
# Mark ticket done (returns next spawnable)
python3 swarm/dispatch.py -p agentteams done P3-02 abc123f
# Mark ticket running
python3 swarm/dispatch.py -p agentteams running P3-02 codex-at-p3-02-telegram
The dispatcher emits signals:
| Signal | Meaning | Action |
|---|---|---|
| (none) | Tickets spawnable in current phase | Auto-spawn |
PHASE_GATE | Phase complete, next phase ready | Alert human, wait for "go" |
ALL_DONE | Every ticket done | Project complete |
BLOCKED | No spawnable tickets, deps incomplete | Investigate upstream |
2. Watchdog (swarm/watchdog.py)
Self-healing health monitor. Runs every 5 minutes via systemd timer. This is what makes the swarm autonomous.
| Event | Action |
|---|---|
| No agents running + spawnable tickets | Auto-spawn up to 3 tickets (RAM check first), alert human |
| Agent completes (has commits/changes) | dispatcher.done() → auto-spawn next ticket → update tracker |
| Agent fails (1st time) | Auto-respawn with correct flags, reset failure count |
| Agent fails (2nd time) | Mark failed, report downstream blocked tickets, alert human |
| Agent stuck (45min+, no changes) | Alert human |
| Phase gate reached | Alert human for approval |
The idle auto-spawn is key. When the watchdog fires and finds zero running tmux sessions, it queries the dispatcher for all projects. If spawnable tickets exist and RAM is above 1GB, it spawns up to 3 agents and sends a notification. This means the swarm is fully self-sustaining within a phase: completions chain into spawns, and idle gaps are automatically filled.
Notification rules:
- Auto-respawns on first failure are silent (logged only)
- Idle auto-spawns send a notification (so you know work started)
- Only decisions go to the human: double-failure, phase gate, stuck agent
# Check timer status
systemctl --user status swarm-watchdog.timer
# Run manually
python3 swarm/watchdog.py # live
python3 swarm/watchdog.py --dry-run # test without side effects
3. Prompts (swarm/prompts/)
One markdown file per ticket. This is the most important part. A good prompt produces a working PR. A bad prompt produces a silent failure.
Each prompt includes:
- Context (what the project is, tech stack, relevant existing code)
- Exact task scope (what to build, what NOT to build)
- Files to create or modify
- Verification steps (build, test, lint)
- Git instructions (branch name, commit format)
- Directory ownership (which paths this agent owns, to prevent conflicts with parallel agents)
4. Project Configuration
swarm/projects.json registers all projects:
{
"agentteams": {
"description": "AgentTeams SaaS",
"tracker": "agentteams-tracker.json",
"repo": "/home/openclaw/projects/agentteams",
"spawnScript": "spawn-at.sh"
},
"mc": {
"description": "Mission Control",
"tracker": "mc-tracker.json",
"repo": "/home/openclaw/projects/openclaw-mission-control",
"spawnScript": "spawn.sh"
}
}
Tracker files hold every ticket with status, dependencies, and phase:
{
"project": "agentteams",
"maxConcurrent": 7,
"tickets": {
"P3-02": {
"status": "done",
"depends": ["P3-01"],
"label": "codex-at-p3-02-telegram",
"phase": 3,
"sha": "0b2b47b"
}
}
}
Statuses: todo → running → done / failed
The Automated Pipeline
This is how the full cycle works end-to-end:
1. Write prompts for spawnable tickets
2. Spawn agents (manual kick or watchdog auto-spawn)
3. Watchdog monitors every 5 min:
- Completion → dispatcher.done(ticket, sha)
→ returns next spawnable tickets
→ watchdog auto-spawns them
→ tracker updated
- Failure (1st) → auto-respawn
- Failure (2nd) → mark failed, alert human
- Idle (0 running) → query dispatcher, spawn if available
4. Phase complete → PHASE_GATE signal → alert human
5. Human says "go" → next phase auto-spawns on next watchdog pass
6. ALL_DONE → integration pass, testing, deploy
Within a phase, this runs with zero human intervention. Completions trigger the next spawn. Failures get one retry. The idle detector catches any gaps. The human only gets involved at phase boundaries or when something fails twice.
Codex CLI Reference
# Non-interactive (what agents use)
codex exec -m gpt-5.3-codex \
--dangerously-bypass-approvals-and-sandbox \
-C /path/to/worktree \
"prompt text here"
Key flags:
-m gpt-5.3-codex— model selection (must match your API key tier)--dangerously-bypass-approvals-and-sandbox— required for non-interactive-C <dir>— working directory
Flags that do NOT exist (will cause silent failure):
--auto-edit— not a real flag-q— not a real flag--full-auto— interactive mode only
PATH requirement: Codex lives at /home/openclaw/.local/bin/codex. Systemd services and cron jobs must include this in PATH or use the full path.
Monitoring
# All running agents
tmux ls | grep codex-
# Check specific agent output
tmux capture-pane -t codex-at-p3-02-telegram -p | tail -20
# Worktree changes
cd /home/openclaw/projects/agentteams-worktrees/at-p3-02-telegram
git status --porcelain && git log --oneline -5
# Watchdog log
tail -30 ~/.openclaw/workspace/logs/watchdog.log
# Dispatcher status
python3 swarm/dispatch.py -p agentteams status
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Agent exits immediately | Bad CLI flags or missing binary | Check codex.log in worktree |
codex: command not found | PATH missing in tmux/systemd | Use full path /home/openclaw/.local/bin/codex |
| tmux session alive, no process | Codex crashed mid-run | Watchdog auto-respawns on next pass |
| 0 files changed after 10+ min | Agent stuck or still thinking | Check tmux pane; watchdog alerts at 45min |
| Watchdog not spawning when idle | Watchdog only checked existing sessions | Fixed: now queries dispatcher when 0 agents running |
o3 not supported | Wrong model for API key tier | Use gpt-5.3-codex for ChatGPT-tier keys |
| Agent completes but next not spawned | Missing prompt file for next ticket | Write prompt to swarm/prompts/<ticket>.md |
File Map
swarm/
├── dispatch.py # Ticket dispatcher (deps, phases, signals)
├── watchdog.py # Health monitor + dispatcher integration
├── watchdog-state.json # Failure counts (auto-managed)
├── projects.json # Project registry
├── agentteams-tracker.json # AT ticket tracker
├── mc-tracker.json # MC ticket tracker
├── spawn.sh # MC agent spawner
├── spawn-at.sh # AT agent spawner
├── prompts/ # Prompt files per ticket
│ ├── p3-02-telegram.md
│ ├── kan-6a-ask-navi.md
│ └── ...
├── status.sh # Quick tmux status
└── cleanup.sh # Worktree cleanup
logs/
└── watchdog.log # Watchdog event log
~/.config/systemd/user/
├── swarm-watchdog.service # Watchdog oneshot service
└── swarm-watchdog.timer # 5-minute timer


