Why Your AI Agent Works in Dev and Breaks in Prod

The most predictable failure mode in AI engineering: the dev-to-prod gap.

Jun 07, 2026

Your agent nailed every test case. You shipped it. Within 48 hours, users report hallucinated outputs, silently dropped tool calls, and responses that bear zero resemblance to what worked on your machine. You reload the same prompt locally. It works perfectly. Welcome to the most predictable failure mode in AI engineering: the dev-to-prod gap.

This is Crucible C01. We dissect the five failure modes that kill agents in production and give you the tools to catch them before your users do.

The Idea (60 Seconds)

Developers test agents in idealized conditions: deterministic inputs, warm context windows, generous API latency budgets, and sequential tool calls. Production exposes the opposite environment: cold starts strip context, rate limits compress timing, and parallel calls introduce race conditions. The agent that performed flawlessly at temperature 0 on a 2k-token context window collapses at temperature 0.7 on an 8k-token window.

The five failure modes are temperature drift and context window overflow first; silent API errors and prompt drift follow; race conditions complete the set. Each one has a detection pattern and a fix, and this article delivers both plus the CLI tool to automate the detection.

Why This Matters

AI agent failures differ from traditional software failures in one critical way: they are stochastic. A web API either returns 200 or 500. An AI agent returns something that looks plausible 90% of the time and is catastrophically wrong 10% of the time. That 10% is invisible in manual testing and devastating in production.

The economics compound fast because every failed agent interaction wastes tokens, and wasted tokens cost money. At scale, a subtly broken agent burns budget faster than a working one because it retries, loops, and rephrases instead of succeeding. A single temperature drift bug can double your API spend.

Reliability is the differentiator. The market is flooding with AI wrappers. The ones that survive will be the ones that work consistently, under load, with real user inputs. Crucible exists to make your agent one of the survivors.

Walkthrough

Mode 1: Temperature Drift

Detection pattern: Run the same prompt at your dev temperature and your prod temperature. Hash the outputs. Hash divergence signals drift.

The fix: Pin temperature to 0 in both environments. If you need sampling variance for creativity, isolate it to a single generation step and wrap the rest of the pipeline in deterministic calls. Document the temperature in your agent config file. Treat it like a database connection string: an infrastructure parameter, always explicit, zero room for runtime guesses.

Mode 2: Context Window Overflow

Detection pattern: Instrument your agent to log cumulative token count per conversation. When it crosses 75% of your model’s context limit, flag the conversation. Watch for truncated outputs, repeated phrases, or instructions that the model appears to have forgotten.

The fix: Implement a context compaction strategy. Summarize older turns and replace them with a compressed summary token block. Set hard token budgets per turn and per conversation. When the budget is exhausted, either summarize or start a fresh context window with a recovery prompt that preserves the task state.

Mode 3: Silent API Errors

Detection pattern: Log every API call’s HTTP status code and response body. Count calls that return non-200 statuses. If your agent has retry logic, log whether the retry succeeded. A pattern of failed retries with continued execution signals swallowed errors.

The fix: Treat API errors as hard failures by default. Wrap every API call in a circuit breaker that halts the agent on persistent errors. Log the error, notify the orchestrator, and return a structured failure to the caller. Silent continuation on error state is the single most dangerous production behavior in any agent system.

Mode 4: Prompt Drift

Detection pattern: Version your system prompts. On every agent run, hash the active prompt and compare it to the canonical hash. When outputs diverge between runs, diff the prompt versions first.

The fix: Lock system prompts in version control. Deploy prompt changes through the same review pipeline as code changes. Run regression tests: execute a benchmark suite against the old prompt, then the new prompt, and diff the results. Any change that shifts more than 10% of benchmark outputs requires manual review.

Mode 5: Race Conditions in Parallel Tool Calls

Detection pattern: Enable request-order logging. When your agent dispatches parallel calls, log the dispatch order and the completion order. Any inversion signals a potential race condition.

The fix: Avoid parallel tool calls unless you can guarantee idempotent, order-independent results. When parallelism is necessary, implement a reconciliation step that sorts responses by a sequence token before the agent processes them. Better yet, use a deterministic execution model: serialize all tool calls, accept the latency cost, and gain correctness.

The Prompt Toolkit

1. Agent Failure Analyst Prompt

<role>
You are an Agent Failure Analyst for the Crucible diagnostic framework.
</role>

<input>
  <agent_architecture>
    {{AGENT_ARCHITECTURE_DESCRIPTION}}
  </agent_architecture>
  <failure_scenario>
    {{FAILURE_SCENARIO_DESCRIPTION}}
  </failure_scenario>
</input>

<task>
Analyze the agent architecture against the failure scenario. Identify which of the six failure modes are present or likely:

1. temperature_drift , Dev and prod temperature settings diverge.
2. context_overflow , Token count exceeds model context limit.
3. token_limit , Response truncation due to max_tokens ceiling.
4. prompt_drift , System prompt edits propagate uncontrolled cascading changes.
5. api_latency , Timeouts or rate limits cause silent failures.
6. race_condition , Parallel tool calls return out of order.

For each identified failure mode, provide:
- evidence: Specific architectural features or scenario details that indicate this mode.
- severity: critical, high, medium, low.
- reproduction_steps: Exact sequence to trigger the failure.
- fix_strategy: Concrete architectural change to eliminate the failure mode.
</task>

<output_format>
<analysis>
  <failure_mode name="..." present="true|false">
    <evidence>...</evidence>
    <severity>...</severity>
    <reproduction_steps>
      <step order="1">...</step>
      <step order="2">...</step>
    </reproduction_steps>
    <fix_strategy>...</fix_strategy>
  </failure_mode>
  <summary>...</summary>
  <priority_fixes>
    <fix order="1">...</fix>
    <fix order="2">...</fix>
  </priority_fixes>
</analysis>
</output_format>

2. agentprobe CLI

The agentprobe command-line tool scans your agent configuration for common failure modes, traces live runs with full instrumentation, diffs two runs to locate divergence points, and replays failed traces to test determinism.

Install and run:

cp agentprobe.py /usr/local/bin/agentprobe
chmod +x /usr/local/bin/agentprobe
agentprobe scan --config agent_config.json
agentprobe trace --config agent_config.json --prompt "Analyze the Q3 report"
agentprobe diff --run-a trace_001.json --run-b trace_002.json
agentprobe replay --trace trace_001.json

Download: agentprobe.py

Caveats

The five failure modes cover the most common production breakdowns, yet they remain an incomplete set. Model-specific quirks, provider-specific rate limit architectures, and custom orchestration logic introduce failure modes unique to your stack. Treat these five as your baseline scan, then extend the detection patterns to match your architecture.

The agentprobe tool instruments API calls and logs token counts, yet it relies on the provider reporting accurate token usage. Some providers approximate. Cross-check token counts against your own tokenizer when precision matters.

Determinism is a spectrum, all-or-zero. Temperature 0 reduces variance dramatically, yet even at temperature 0, some models exhibit minor non-determinism due to floating-point accumulation differences across hardware. Replay results that match 99% of tokens are as good as deterministic for practical purposes.

Philosophy

The Crucible stance: test in conditions that match production, or accept production failures as inevitable. Every shortcut in your testing pipeline compounds into a failure in your production pipeline. Agents are stochastic systems. Stochastic systems demand systematic testing, systematic observation, and systematic repair.

The dev-to-prod gap is avoidable. It requires treating your agent’s non-determinism as a first-class engineering concern, designing for the worst case from day one, and instrumenting everything. The tools in this article automate the detection. The fixes are architectural. The discipline is yours.

Crucible C01 is the first article in the Crucible Series by ArchonHQ. Each article dissects a specific AI agent failure mode and delivers the prompts and tools to eliminate it. Subscribe for full access to the series.

ArchonHQ

Discussion about this post

Ready for more?