ArchonHQ

Why Your AI Agent Works in Dev and Breaks in Prod

Michal Szalinski — Sun, 07 Jun 2026 21:15:10 GMT

Your agent nailed every test case. You shipped it. Within 48 hours, users report hallucinated outputs, silently dropped tool calls, and responses that bear zero resemblance to what worked on your machine. You reload the same prompt locally. It works perfectly. Welcome to the most predictable failure mode in AI engineering: the dev-to-prod gap.

This is Crucible C01. We dissect the five failure modes that kill agents in production and give you the tools to catch them before your users do.

The Idea (60 Seconds)

Developers test agents in idealized conditions: deterministic inputs, warm context windows, generous API latency budgets, and sequential tool calls. Production exposes the opposite environment: cold starts strip context, rate limits compress timing, and parallel calls introduce race conditions. The agent that performed flawlessly at temperature 0 on a 2k-token context window collapses at temperature 0.7 on an 8k-token window.

The five failure modes are temperature drift and context window overflow first; silent API errors and prompt drift follow; race conditions complete the set. Each one has a detection pattern and a fix, and this article delivers both plus the CLI tool to automate the detection.

Why This Matters

AI agent failures differ from traditional software failures in one critical way: they are stochastic. A web API either returns 200 or 500. An AI agent returns something that looks plausible 90% of the time and is catastrophically wrong 10% of the time. That 10% is invisible in manual testing and devastating in production.

The economics compound fast because every failed agent interaction wastes tokens, and wasted tokens cost money. At scale, a subtly broken agent burns budget faster than a working one because it retries, loops, and rephrases instead of succeeding. A single temperature drift bug can double your API spend.

Reliability is the differentiator. The market is flooding with AI wrappers. The ones that survive will be the ones that work consistently, under load, with real user inputs. Crucible exists to make your agent one of the survivors.

Walkthrough

Mode 1: Temperature Drift

Detection pattern: Run the same prompt at your dev temperature and your prod temperature. Hash the outputs. Hash divergence signals drift.

The fix: Pin temperature to 0 in both environments. If you need sampling variance for creativity, isolate it to a single generation step and wrap the rest of the pipeline in deterministic calls. Document the temperature in your agent config file. Treat it like a database connection string: an infrastructure parameter, always explicit, zero room for runtime guesses.

Mode 2: Context Window Overflow

Detection pattern: Instrument your agent to log cumulative token count per conversation. When it crosses 75% of your model’s context limit, flag the conversation. Watch for truncated outputs, repeated phrases, or instructions that the model appears to have forgotten.

The fix: Implement a context compaction strategy. Summarize older turns and replace them with a compressed summary token block. Set hard token budgets per turn and per conversation. When the budget is exhausted, either summarize or start a fresh context window with a recovery prompt that preserves the task state.

Mode 3: Silent API Errors

Detection pattern: Log every API call’s HTTP status code and response body. Count calls that return non-200 statuses. If your agent has retry logic, log whether the retry succeeded. A pattern of failed retries with continued execution signals swallowed errors.

The fix: Treat API errors as hard failures by default. Wrap every API call in a circuit breaker that halts the agent on persistent errors. Log the error, notify the orchestrator, and return a structured failure to the caller. Silent continuation on error state is the single most dangerous production behavior in any agent system.

Mode 4: Prompt Drift

Detection pattern: Version your system prompts. On every agent run, hash the active prompt and compare it to the canonical hash. When outputs diverge between runs, diff the prompt versions first.

The fix: Lock system prompts in version control. Deploy prompt changes through the same review pipeline as code changes. Run regression tests: execute a benchmark suite against the old prompt, then the new prompt, and diff the results. Any change that shifts more than 10% of benchmark outputs requires manual review.

Mode 5: Race Conditions in Parallel Tool Calls

Detection pattern: Enable request-order logging. When your agent dispatches parallel calls, log the dispatch order and the completion order. Any inversion signals a potential race condition.

The fix: Avoid parallel tool calls unless you can guarantee idempotent, order-independent results. When parallelism is necessary, implement a reconciliation step that sorts responses by a sequence token before the agent processes them. Better yet, use a deterministic execution model: serialize all tool calls, accept the latency cost, and gain correctness.

The Prompt Toolkit

1. Agent Failure Analyst Prompt


You are an Agent Failure Analyst for the Crucible diagnostic framework.



  
    {{AGENT_ARCHITECTURE_DESCRIPTION}}
  
  
    {{FAILURE_SCENARIO_DESCRIPTION}}
  



Analyze the agent architecture against the failure scenario. Identify which of the six failure modes are present or likely:

1. temperature_drift , Dev and prod temperature settings diverge.
2. context_overflow , Token count exceeds model context limit.
3. token_limit , Response truncation due to max_tokens ceiling.
4. prompt_drift , System prompt edits propagate uncontrolled cascading changes.
5. api_latency , Timeouts or rate limits cause silent failures.
6. race_condition , Parallel tool calls return out of order.

For each identified failure mode, provide:
- evidence: Specific architectural features or scenario details that indicate this mode.
- severity: critical, high, medium, low.
- reproduction_steps: Exact sequence to trigger the failure.
- fix_strategy: Concrete architectural change to eliminate the failure mode.




  
    ...
    ...
    
      ...
      ...
    
    ...
  
  ...
  
    ...
    ...

2. agentprobe CLI

The agentprobe command-line tool scans your agent configuration for common failure modes, traces live runs with full instrumentation, diffs two runs to locate divergence points, and replays failed traces to test determinism.

Install and run:

cp agentprobe.py /usr/local/bin/agentprobe
chmod +x /usr/local/bin/agentprobe
agentprobe scan --config agent_config.json
agentprobe trace --config agent_config.json --prompt "Analyze the Q3 report"
agentprobe diff --run-a trace_001.json --run-b trace_002.json
agentprobe replay --trace trace_001.json

Download: agentprobe.py

Caveats

The five failure modes cover the most common production breakdowns, yet they remain an incomplete set. Model-specific quirks, provider-specific rate limit architectures, and custom orchestration logic introduce failure modes unique to your stack. Treat these five as your baseline scan, then extend the detection patterns to match your architecture.

The agentprobe tool instruments API calls and logs token counts, yet it relies on the provider reporting accurate token usage. Some providers approximate. Cross-check token counts against your own tokenizer when precision matters.

Determinism is a spectrum, all-or-zero. Temperature 0 reduces variance dramatically, yet even at temperature 0, some models exhibit minor non-determinism due to floating-point accumulation differences across hardware. Replay results that match 99% of tokens are as good as deterministic for practical purposes.

Philosophy

The Crucible stance: test in conditions that match production, or accept production failures as inevitable. Every shortcut in your testing pipeline compounds into a failure in your production pipeline. Agents are stochastic systems. Stochastic systems demand systematic testing, systematic observation, and systematic repair.

The dev-to-prod gap is avoidable. It requires treating your agent’s non-determinism as a first-class engineering concern, designing for the worst case from day one, and instrumenting everything. The tools in this article automate the detection. The fixes are architectural. The discipline is yours.

Crucible C01 is the first article in the Crucible Series by ArchonHQ. Each article dissects a specific AI agent failure mode and delivers the prompts and tools to eliminate it. Subscribe for full access to the series.

Subscribe now

Build Your Own MCP Server from Scratch

Michal Szalinski — Thu, 04 Jun 2026 21:15:53 GMT

Every AI agent ships with the same bottleneck: it can only reason over what it can reach. MCP servers dissolve that boundary. They expose tools, resources, and prompts to any compliant client over a JSON-RPC wire format so lean you can implement it in an afternoon. Yet most developers grab a framework, copy a template, and ship something they can barely debug. Forge starts differently. You will build an MCP server from the bare protocol up, understand every byte on the wire, and gain the mental model that makes every future server trivial.

The Idea (60 Seconds)

MCP is a JSON-RPC 2.0 protocol. A client sends a request. Your server returns a response. Three request types power the core loop:

initialize , handshake. Client and server exchange capabilities.
tools/list , discovery. Server returns every tool it offers, each with a JSON Schema describing its inputs.
tools/call , execution. Client names a tool and passes arguments. Server runs the handler and returns structured content.

Transport is either stdio (JSON-RPC over stdin/stdout) or HTTP (Streamable HTTP). Stdio is the simplest place to start: read a line from stdin, parse it, dispatch, write a line to stdout.

That is the entire architecture. Everything else is error handling, schema validation, and ergonomics.

Why This Matters

MCP servers are the new APIs. Where REST gave machines endpoints, MCP gives agents tools with typed inputs and structured outputs. Every integration layer from IDE assistants to autonomous workflows converges on this protocol. The standard is young. The primitives are stable. The surface area is small enough to hold in your head all at once.

Knowing the wire format gives you three advantages frameworks obscure:

Debugging , when a tool call fails, you can read the raw JSON-RPC message and pinpoint the fault in seconds.
Portability , any language, any runtime, any transport. Write a server in Bash if you want. The protocol is the contract.
Evolution, MCP will add capabilities. Understanding the base protocol lets you adopt new features by extension, always, sidestepping full rewrites.

Forge articles build on this foundation. If you understand the three core requests and the JSON-RPC envelope, every subsequent pattern is just a new handler.

Walkthrough

The JSON-RPC Envelope

Every message shares the same shape:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "get_weather",
    "arguments": { "city": "Portland" }
  }
}

The response mirrors it:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [
      { "type": "text", "text": "72°F, clear skies" }
    ]
  }
}

Errors swap result for error:

{
  "jsonrpc": "2.0",
  "id": 1,
  "error": {
    "code": -32602,
    "message": "Invalid params: missing 'city'"
  }
}

Three fields matter: jsonrpc (always "2.0"), id (correlates request to response), and method (the dispatch key).

Tool Schema Design

Each tool advertises itself through a JSON Schema object. A well-designed schema is the difference between a tool agents use and one they fumble.

{
  "name": "get_weather",
  "description": "Retrieve current weather for a given city",
  "inputSchema": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "City name, e.g. Portland"
      }
    },
    "required": ["city"]
  }
}

Rules for effective schemas:

Mark every required parameter in required. Agents rely on this to construct valid calls.
Add description to each property. The agent reads descriptions to decide which tool to invoke and what values to pass.
Use enum for constrained values. This prevents hallucinated inputs.
Keep schemas flat. Nested objects are valid but harder for agents to populate correctly.

The Request Lifecycle

Your server runs a loop:

Step Action 1 Read a JSON-RPC line from stdin 2 Parse the method 3 Dispatch to the matching handler 4 Handler returns a result or raises an error 5 Serialize the response to JSON 6 Write it to stdout

In Python with asyncio:

async def handle_message(message):
    method = message.get("method")
    if method == "initialize":
        return {"capabilities": {"tools": {}}}
    elif method == "tools/list":
        return {"tools": list_tools()}
    elif method == "tools/call":
        return await call_tool(message["params"])
    else:
        return {"error": {"code": -32601, "message": f"Method {method} unseen"}}

Building the Server

Start with the mcpbuild CLI. Run mcpbuild init my-server and you get a project scaffold:

my-server/
  pyproject.toml
  server.py
  tools/
    __init__.py

server.py contains the JSON-RPC read loop and dispatch table. Each tool is a function registered by name. The add-tool command generates a stub handler and appends the tool schema to the registry. The run command boots the server on stdio (default) or HTTP transport.

The full CLI ships with this article. Download it, make it executable, and build your first MCP server in minutes.

The Prompt Toolkit

MCP Server Architect Prompt

Feed this prompt a server concept. Get back a complete specification ready for implementation.


You are an MCP Server Architect. You produce complete MCP server specifications from a concept description.

{{SERVER_CONCEPT}}


Return a specification with these sections:

1. **Server Identity**: name, version, description
2. **Tools**: For each tool, provide:
   - name (snake_case)
   - description (one sentence, action verb)
   - inputSchema (valid JSON Schema, flat preferred)
   - output shape (content types returned)
   - error cases (expected failure modes)
3. **Transport**: stdio or HTTP with rationale
4. **Auth**: required or none; if required, specify mechanism (API key header, OAuth scope, etc.)
5. **Error Handling Strategy**: per-tool error codes, fallback behavior, logging approach


- Every tool must have a description an agent can use for tool selection.
- Every inputSchema must include property-level descriptions.
- Prefer enum constraints over free-text where values are bounded.
- Transport choice must include latency and deployment context rationale.
- Error codes must use JSON-RPC standard codes where applicable (-32600, -32601, -32602) and custom codes in the -32000 to -32099 range for server-specific errors.

mcpbuild CLI

The mcpbuild CLI scaffolds, runs, and validates MCP servers from the terminal. Five commands cover the full lifecycle:

Command Description init Scaffold a new MCP server project with pyproject.toml, server.py, and tool stubs add-tool Interactive: enter tool name, description, and input schema JSON; generates a handler stub and registers the tool run Start the server (defaults to stdio transport; pass --transport http --port 8080 for HTTP) validate Check the server against the MCP spec: tool schemas are valid JSON Schema, error handlers exist for every tool, transport config is sound test Send test tools/list and tools/call messages to a running server and verify responses match the spec

Download the full implementation below. Single file, zero dependencies beyond the standard library and asyncio.

# Download
curl -O https://drive.google.com/file/d/1b1WFnBv0ZYcgQEW8KIVOKQtDzEKjPotm/view?usp=drive_link


chmod +x mcpbuild.py

# Scaffold a project
./mcpbuild.py init weather-server

# Add a tool interactively
cd weather-server
../mcpbuild.py add-tool

# Run on stdio
../mcpbuild.py run

# Validate
../mcpbuild.py validate

# Test against a running server
../mcpbuild.py test

The CLI is a single Python file. Read it, modify it, make it yours. It uses raw JSON-RPC over stdio so you see exactly what flows between client and server.

Caveats

MCP is evolving. The spec adds capabilities and the reference implementations shift. The wire format is stable, but higher-level features like sampling, elicitation, and structured logging may change. Build on the core three methods (initialize, tools/list, tools/call) and you stay safe.

Stdio transport pairs with process-based hosting (Claude Desktop, IDE extensions). HTTP transport pairs with remote hosting. Pick the one that matches your deployment. Mixing both in one server adds complexity best reserved for later.

Schema validation is your first line of defense. Validate every incoming tools/call against the tool’s inputSchema before the handler runs. Reject early with a -32602 Invalid params error. This prevents malformed data from reaching your business logic.

Philosophy

Forge believes in building on the protocol, around it. Frameworks accelerate while protocols ground. When you understand the JSON-RPC message format, the dispatch table, and the schema contract, frameworks become optional convenience rather than required dependency, letting you debug faster, ship leaner, and adapt when the spec evolves.

The best MCP server is the one you can explain on a whiteboard. Tool schema in, content out. Everything else is detail.

This is F01 in the ArchonHQ Forge Series. The next article, F02, covers tool schema design patterns that make agents invoke your tools correctly on the first try. Subscribe to ArchonHQ to unlock every Forge article, CLI tool, and prompt kit.

Subscribe now

Build Your Own Private RAG Knowledge Base

Michal Szalinski — Tue, 02 Jun 2026 21:11:54 GMT

Every query you send to a cloud RAG service leaves your perimeter. Your documents, your questions, your retrieved context, all of it traverses networks you control, stored on servers you trust, accessible to compliance teams you have yet to meet. The convenience is seductive and the cost is invisible until it becomes painfully visible. Bastion builds differently: your knowledge base stays on your hardware, your embeddings stay local, and your audit trail stays complete. This article gives you the architecture and the tooling to run Retrieval Augmented Generation entirely on your own terms.

The Idea (60 Seconds)

RAG augments a language model with retrieved context from your own documents. The typical pipeline ships your data to a cloud vector database and calls a remote embedding API for every query. A private RAG system replaces every cloud dependency with a local equivalent:

Storage and embeddings: Run models like all-MiniLM-L6-v2 or bge-large-en-v1.5 locally via sentence-transformers, stored in SQLite with a cosine similarity function. Zero external database server dependencies.
Chunking: Split documents with fixed-size overlap, semantic boundaries, or markdown-aware strategies.
Generation: Route prompts to a local LLM through Ollama, llama.cpp, or any local inference engine.
Audit trail: Log every query and every retrieved chunk to a SQLite table. Compliance becomes a SQL query.

The result is a fully functional RAG system that sends zero bytes to external services.

Why This Matters

Privacy is a constraint that sharpens design. When you eliminate cloud dependencies, you also eliminate latency from network round trips, vendor lock-in from proprietary APIs, and data exposure from third-party processing. Your compliance team can audit the entire query history with a single SQL statement. Your security team can verify that zero data leaves the network perimeter. Your finance team can predict costs exactly, because local compute has a fixed price.

Regulated industries , healthcare, legal, defense, finance , operate under data residency rules. Sending patient records or classified documents through an external embedding API violates those rules by design. A private RAG system satisfies the rules by architecture, by policy, or by procedural override. The architecture itself enforces the boundary.

Performance improves too. Local embeddings on a modern GPU or Apple Silicon reach hundreds of embeddings per second. SQLite handles millions of vectors with sub-millisecond lookups when you pre-filter by collection. The bottleneck shifts from network latency to disk I/O, which you control entirely.

Walkthrough

Run Your Own LLM on a Laptop: The Complete Guide

Michal Szalinski — Sun, 31 May 2026 21:10:51 GMT

Your data leaves your machine every time you call a cloud LLM API. Every prompt, every document, every private thought flows through someone else’s infrastructure. You pay for the privilege with money and privacy. Bastion believes you should own the entire stack. Run the model on your hardware, keep your data on your disk, and control every parameter of inference. This guide shows you exactly how to do it, from scanning your laptop’s capabilities to serving a local model at production quality.

The Idea (60 Seconds)

Local LLM inference runs large language models on your own hardware instead of renting compute from OpenAI, Anthropic, or Google. Three engines dominate the landscape:

llama.cpp: The workhorse. Runs on CPU, GPU, or both. Optimized for constrained hardware. Use this on laptops with limited VRAM.
vLLM: The throughput king. Built for NVIDIA GPUs with at least 16GB VRAM. Serves multiple concurrent requests with paged attention.
Ollama: The simplicity layer. Wraps llama.cpp with a friendly CLI. Best for developers who want a working model in under five minutes.

The core decision matrix maps your VRAM to the right model and quantization:

VRAM Model Size Quantization Example 6 GB 7B params Q4_K_M Llama-3.1-8B Q4_K_M 12 GB 13B params Q5_K_M Mistral-Nemo Q5_K_M 24 GB 34B params Q8_0 Command-R Q8_0 24 GB 70B params Q4_K_M Llama-3.1-70B Q4_K_M

CPU-only users can still run 7B models at Q4_K_M with 16GB system RAM, expect 3 to 8 tokens per second depending on your processor. The experience remains usable for interactive chat and code assistance.

Why This Matters

Privacy is the obvious reason, and it matters more than most people admit. Every prompt you send to a cloud API is logged, stored, and potentially used for training. Your proprietary code, your client data, your personal reflections all become someone else’s data point the moment they leave your network.

Cost compounds over time. A heavy API user spending $50 monthly on tokens saves the full amount after the first month of local inference. The hardware you already own pays for itself.

Latency disappears. A local model on your laptop responds instantly. Zero network round-trips, zero queuing behind other users, zero rate limits. Your iteration loop tightens from seconds to milliseconds.

Control matters for serious practitioners. You choose the model, the quantization, the context window, the sampling parameters. You decide when to upgrade, which version to pin, and how to batch requests. Cloud APIs make all those decisions for you.

Resilience rounds out the argument. Local inference works during internet outages, in air-gapped environments, and in regions with unreliable connectivity. Your capability stays online even when the cloud goes dark.

Walkthrough

Step 1: Scan Your Hardware

Before downloading any model, understand what your machine can handle. The localmllm scan command detects your available RAM, GPU, VRAM, and CPU cores automatically:

python localmllm.py scan

Output looks like this:

Hardware Scan Results
  System RAM : 32 GB
  GPU        : NVIDIA RTX 3080 Laptop
  VRAM       : 8 GB
  CPU Cores  : 8
  CPU Freq   : 3.2 GHz

Write down the VRAM number. It determines everything that follows.

Step 2: Choose Your Engine

Pick your inference engine based on the scan results:

8 GB VRAM or less: llama.cpp with GPU offloading. Offload as many layers as VRAM allows, run the rest on CPU.
16 GB VRAM or more: vLLM for maximum throughput, or Ollama for simplicity.
CPU only: llama.cpp with all layers on CPU. Acceptable for 7B models at Q4_K_M.

Ollama is the fastest path to a working model. Install it, pull a model, and start chatting:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b

For more control, build llama.cpp from source:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j$(nproc)
./llama-server -m models/llama-3.1-8b-q4_k_m.gguf -ngl 99 --port 8080

The -ngl 99 flag offloads all layers to GPU. Reduce this number if you run out of VRAM.

Step 3: Select Your Model and Quantization

Apply the decision matrix from the Idea section. Higher quantization means better quality and larger files. Lower quantization means smaller files and faster inference. Q4_K_M hits the sweet spot for most use cases: minimal quality loss, significant size reduction.

Download GGUF files from Hugging Face. TheBloke andbartowski namespaces host well-organized quantized models. Verify the file size matches your available disk space before downloading.

Step 4: Benchmark Before Committing

Always benchmark before you commit disk space and trust a model for real work. The localmllm benchmark command runs a standardized inference test:

python localmllm.py benchmark --model models/llama-3.1-8b-q4_k_m.gguf

You get three key metrics:

Tokens per second: Sustained generation speed. Above 10 t/s feels interactive. Below 5 t/s feels sluggish.
Time to first token: Prefill latency. Under 2 seconds keeps conversations feeling responsive.
Peak memory usage: Confirms the model fits your hardware with room for the operating system.

If the benchmark reveals memory pressure, drop to a lower quantization or reduce the context window size. Both adjustments free RAM and VRAM.

Step 5: Serve Your Model

Launch a local inference server and point your applications at it. The localmllm serve command generates the optimal configuration and starts the server:

python localmllm.py serve --model models/llama-3.1-8b-q4_k_m.gguf

Your model now exposes an OpenAI-compatible API at http://localhost:8080/v1. Any tool that supports the OpenAI API format connects directly. Point your code editor, your chat client, or your automation scripts at localhost and keep every request on your machine.

The Prompt Toolkit

1. Local LLM Architect Prompt

Copy this XML prompt into your favorite LLM (even a cloud one, since this is a one-time planning call) and get a tailored local inference plan:


You are a local LLM infrastructure architect. Given the user's hardware specifications and use case, produce a complete deployment plan.




  Amount in GB
  Name of GPU or "none"
  Amount in GB or "0"
  Number of cores

Describe what you will use the model for: coding, writing, chat, research, analysis




  Name and size
  Specific quant level (e.g. Q4_K_M)
  llama.cpp | vLLM | Ollama
  
    Key configuration parameters: GPU layers, context window, thread count, batch size
  
  Recommended size in tokens
  
    Estimated range
    Estimated range
    Expected RAM + VRAM usage
  
  Model file size estimate




1. Always fit the model within available VRAM plus RAM with 4 GB headroom for the OS.
2. Prefer GPU offloading over CPU-only inference whenever VRAM is available.
3. Q4_K_M is the default quantization. Increase only when VRAM allows.
4. Context window size affects memory linearly. Default to 4096 unless the use case demands more.
5. For coding tasks, prefer Code Llama or DeepSeek Coder. For general chat, prefer Llama 3.1 or Mistral.
6. When VRAM is 0, recommend CPU-only llama.cpp with thread count equal to physical cores minus 1.

Use it once, get your plan, then run everything locally forever.

2. localmllm CLI

The localmllm.py CLI automates the entire workflow from hardware scanning and model recommendation through benchmarking and server launch. Download it from the Bastion CLI downloads section. Five commands cover the full lifecycle:

# Detect your hardware capabilities
python localmllm.py scan

# Get a model recommendation for your use case
python localmllm.py recommend --use-case coding

# Generate inference engine configuration
python localmllm.py config --engine llama.cpp

# Benchmark a model on your hardware
python localmllm.py benchmark --model ./models/llama-3.1-8b-q4_k_m.gguf

# Launch a local inference server
python localmllm.py serve --model ./models/llama-3.1-8b-q4_k_m.gguf

Full source code is available in the Bastion CLI downloads. Every command runs offline with zero cloud dependencies: your hardware, your models, your rules.

Caveats

Local inference requires honest expectations about what your hardware can deliver. A laptop with 8GB VRAM runs a 7B model beautifully but chokes on a 70B model regardless of quantization tricks. Respect the memory limits.

Model quality matters: a Q4 quantized 7B model produces good output for coding assistance and casual chat, but noticeably weaker output than GPT-4 or Claude for complex reasoning, long-form writing, and nuanced analysis, so match the model to the task.

Disk space adds up fast. A single 70B Q4_K_M file consumes 40GB. Keep models you actively use and delete the rest. The localmllm benchmark command tells you whether a model earns its disk space before you commit to keeping it.

Updates require manual effort. Cloud APIs improve automatically. Local models stay frozen until you download a newer version. Subscribe to model release feeds or check Hugging Face weekly for updates.

Philosophy

Bastion stands on one principle: own your compute the way you own your tools. A carpenter keeps their saws in their own shop. A writer keeps their manuscripts in their own desk. A practitioner of AI should keep their models on their own hardware.

Every cloud dependency is a lease where leases expire, terms change, prices increase, and services shut down. Local inference is ownership: you paid for the hardware, downloaded the model, and run the server, controlling access while keeping prompts private at a fixed cost.

The open source ecosystem made this possible. Llama.cpp, vLLM, Ollama, and the thousands of model contributors built the infrastructure of independence. Bastion curates and distills their work into actionable guides. Use them. Build on them. Own your stack.

CTA

Ready to cut the cord on cloud LLMs? Download the localmllm.py CLI from the Bastion CLI downloads section, run scan to profile your hardware, and have a local model serving requests in under thirty minutes. The complete source code lives in the same directory, ready for customization.

Bastion Series articles give you the full blueprint: prompts, tools, and philosophy. Subscribe to unlock every article and every CLI tool in the archive. Your data stays on your machine. Your capability stays under your control.

Subscribe now

The Offer Equation

Michal Szalinski — Thu, 28 May 2026 21:26:00 GMT

You sit frozen at your keyboard, cursor blinking in the price field of your checkout page. $97? $497? $2,000? You refresh Twitter, see a peer charging $5,000 for what looks like the same thing, and feel your stomach drop. You pick a number that feels safe. Safe means low. Low means you just signed up for burnout.

Every person selling their first digital product or service hits this wall. The price field is the single highest-leverage decision you will make, and most people treat it like a feeling. “What feels right?” is the wrong question, and there is a formula with four levers. Pull them deliberately and your offer becomes magnetic. Ignore them and you join the graveyard of competent people who priced themselves into irrelevance.

The Idea (60 Seconds)

Alex Hormozi distilled every offer into a single equation in $100M Offers:

Value = (Dream Outcome × Perceived Likelihood) ÷ (Time Delay × Effort/Sacrifice)

Four levers: two in the numerator (bigger is better) and two in the denominator (smaller is better).

Dream Outcome: How badly does the client want the result? Paint the specific end state. “Build a SaaS” is vague. “Ship a profitable micro-SaaS that earns $5K MRR in 90 days” is a dream outcome.
Perceived Likelihood: How confident is the client that you can deliver? Social proof, guarantees, and track record crank this lever. A money-back promise moves likelihood from “” to “I trust this.”
Time Delay: How long until the client gets the result? Faster always wins. “Learn to code in 12 months” loses to “Deploy your first app this weekend.”
Effort/Sacrifice: How much work does the client do? Less effort wins. Done-for-you beats done-with-you. Done-with-you beats DIY.

Maximize the top. Minimize the bottom. The number that comes out is what people perceive your offer is worth. Pricing follows perception.

Why This Matters

Pricing is the number one anxiety for new sellers. It drives three catastrophic decisions:

Hourly pricing because it feels fair. You become labor. Your income ceiling is locked to your calendar. A developer billing $150/hour earns $150/hour forever, or until they raise the rate and lose clients who see them as a commodity.
Underpricing to win first clients. You land the account at a discount and train the client to negotiate on price forever. The relationship starts as a bargain hunt instead of a partnership.
Feature-listing instead of transforming. “6 coaching calls, email support, worksheets” is a menu. Menus invite comparison shopping. Bridges invite commitment. “Go from stuck to shipped in 30 days” is a bridge.

These mistakes share a root cause: treating price as a gut check instead of a calculated hypothesis. The Value Equation and five pricing processes replace that gut check with a system.

Walkthrough

Step 1: Build the Offer with the Value Equation

Running AI Agents on a Laptop GPU - My 6GB VRAM Setup That Actually Works

Michal Szalinski — Tue, 26 May 2026 21:01:13 GMT

You’ve seen the posts. “I’m running a 70B parameter model on my home server with 48GB VRAM.” Cool story. Most of us are staring at a laptop with 6GB of VRAM and 32GB of system RAM, wondering if personal AI agents are beyond our reach.

They’re within reach. I’m writing this on my everyday laptop in Melbourne, and my AI crew is running in the background right now. Water reminders, posture nudges, health research, meal planning, coding help, all happening locally, privately, and fast enough to keep pace.

Here’s the setup, the models, the use cases, and the honest performance numbers.

The Idea (60 Seconds)

You can run useful AI agents on a 6GB VRAM laptop today. The trick is picking the right models for the right tasks, and using a hybrid local/cloud approach that keeps costs under $5/month. Local inference handles 80% of daily agent work. Cloud fills the gap for complex reasoning. The framework connecting everything matters more than the hardware.

The Minimum Viable Setup

RTX 3060 (6GB VRAM) or equivalent
16GB+ system RAM (32GB preferred)
Ollama or llama.cpp (free)
A model at Q4_K_M quantization
An agent framework that supports local + cloud switching

That’s it. Zero custom builds. Zero ML expertise. Zero $3,000 GPU.

The whole system costs me roughly $5/month in cloud API calls on top of the free local inference. Most days the cloud stays untouched.

Why This Setup, Not the Others

Cloud-only is expensive at scale. Every API call costs money. When your agent runs 50+ tool calls per day, the bills add up. Local inference for routine tasks, cloud for complex ones, keeps costs negligible.

Big VRAM builds are overkill for daily agents. Most agent tasks involve structured data extraction, simple reasoning, and tool orchestration. A 4B model handles these at 25-40 tokens/second. The extra VRAM goes unused 80% of the time.

Hermes Agent is built for this. Most AI frameworks assume you’re running cloud-only or have a server rack. Hermes is designed to work with whatever you’ve got: local models via Ollama, cloud models via OpenRouter, and seamless switching between them with /model. The framework adds 200MB of overhead. The bottleneck lives elsewhere.

Walkthrough: My Model Lineup (And Why Each One Earns Its VRAM)

Your Content Is a Production Pipeline , Build It Like One

Michal Szalinski — Sun, 24 May 2026 21:01:01 GMT

You told yourself you’d post weekly. It’s been six weeks. Your Substack dashboard mocks you with that sad “0 posts this month” counter. You open a blank document, stare at it, close it, open Hacker News instead. The guilt loop continues.

Meanwhile, the AI bros on X are posting three times a day about “content leverage” while clearly using the same ChatGPT template as everyone else. Quantity up, quality sideways, audience numb.

There’s a third option. You can treat content the way you treat production software: as a pipeline with intake, quality control, assembly, finishing, distribution, and feedback. Skip any step and you get either silence or garbage. Run every step and you get consistent, high-quality output while you sleep.

I know because I built it. This article you’re reading? It came out of that pipeline. The other articles in this series? Same pipeline. Six Python scripts, five cron jobs, one environment file, zero frameworks.

Here’s the architecture.

The Idea (60 Seconds)

Content is a manufacturing problem, and manufacturing problems have manufacturing solutions. You need a system that discovers ideas, filters them, drafts them, QA’s them, generates visuals, publishes, distributes, and measures. Each stage is a script. Each script runs on a schedule. The human touches two points: approving ideas (5 minutes) and reviewing QA failures (15 minutes, rare). Everything else is automated.

Why This Pipeline, Not Manual Blogging

Most people treat content as inspiration plus typing. They wait for the muse, then labor over every sentence. It’s artisanal. Admirable. And completely unscalable past a few posts per month. The pipeline approach treats content as what it actually is for a technical blog: a manufacturing process. The ideas are raw materials. The scoring is quality control on intake. The drafting is assembly. The QA is inspection. The hero image is finishing. The distribution is logistics. The analytics are customer feedback.

The output: 2–3 articles per week. The cost: ~$0.50 per article in LLM and image generation tokens. The human time: under 30 minutes per day.

The Framework: Six Stages, Six Scripts

Build Your Own Cline Alternative in 200 Lines

Michal Szalinski — Fri, 22 May 2026 21:00:29 GMT

Your AI coding assistant vanishes overnight. Cline gets abandoned. Roo Code stops responding to issues. The VS Code extension that automated your file operations, ran terminal commands, and integrated with your preferred AI models suddenly throws deprecation warnings.

You’re back to copying code snippets manually. Context switching between terminal and editor. Explaining the same codebase structure to ChatGPT every session. The 40% productivity boost from autonomous coding assistance evaporates because someone else controlled the tools you relied on.

What if you could build your own AI coding assistant in an afternoon, own the entire stack, and customize it exactly for your workflow?

The Idea (60 Seconds)

You’ll create a minimal VS Code extension that handles file operations, executes terminal commands, and connects to any OpenAI-compatible API. The 200-line implementation provides autonomous coding capabilities through a simple chat interface that can read your codebase, modify files, and run commands. Setup takes 30 minutes. The result gives you permanent control over your AI coding workflow.

Why Build This, Beyond Waiting for Alternatives

Dependency risk drops to zero. Commercial tools get discontinued. Open source projects get abandoned. Your custom extension lives in your codebase under your control. Zero external dependencies means zero abandonment risk.

Customization becomes unlimited. You control the prompts, the model endpoints, and the file operation logic. Add project-specific commands. Integrate with your deployment scripts. Modify the behavior to match your exact workflow.

API flexibility stays open. Connect to OpenAI, Anthropic, local Ollama instances, or any OpenAI-compatible endpoint. Switch providers by changing one configuration line. Your tool adapts to whatever AI infrastructure you prefer.

Walkthrough

How to Give Claude Perfect Memory

Michal Szalinski — Wed, 20 May 2026 21:09:50 GMT

By default, Claude’s memory is basically decorative. It forgets context mid-conversation. You re-explain yourself constantly. Even after you do, the next session starts from zero.

Most people have been living with this for months, assuming it’s just how LLMs work. It’s how LLMs work absent a system. With a system, everything changes.

I use Claude every single day. More screen time than any other app on my Mac. I need it sharp, consistent, and carrying forward every decision, preference, and hard-won lesson from the sessions before.

The Idea (60 Seconds)

Three layers of memory, each building on the last. Layer one takes five minutes and covers 90%+ of users.

Why Build a Memory System Instead of Re-explaining

Every time you start a new Claude session, you burn tokens re-establishing context. Over a month, that compounds into hours of wasted time and inconsistent outputs. A memory system pays for itself on day one. Layer one alone saves ten minutes per session. Layer three makes Claude genuinely useful for long-running projects where consistency across weeks matters.

Layer two takes about an hour and changes how Claude operates entirely.

Layer three turns Claude into a self-evolving second brain, trained on all your data, with persistent search and recall across every conversation you’ve ever had.

Here are all three.

Layer 1: Basic Memory (5 Minutes)

Four quick wins. Minutes to set up. Immediate improvement in every conversation.

1. Memory Editing Tool

Go to Settings → Memory right now.

This is the most overlooked page in Claude. Most people have zero awareness it exists.

What you’ll find: everything Claude has stored about you, accumulated passively across every conversation. Preferences, facts, habits, working styles. Left alone, your memory fills up with garbage fast.

The fix: read through everything on this page. Delete anything outdated, inaccurate, or irrelevant. Then manually add the context you actually want Claude to carry permanently.

Stick to the basics here (your role, key preferences). We’ll build highly specific systems soon.

2. Project Instructions

If you use Claude Projects (you should), fill in your Project Instructions field.

My advice: create projects for all your most-used workflows, then voice-prompt all your context into a Google Doc and upload it as a PDF for each project.

3. Tell Claude Directly

The simplest memory hack on this list. Mid-conversation, just tell Claude what to remember.

Things like:

“Remember that I prefer responses under 400 words.”
“Remember that my role is [x].”
“Update your memory with [x].”

Claude stores these immediately. You can also tell it to forget things: “Forget that I mentioned [x].”

4. Memory Imports and Exports

If you’ve been using ChatGPT (or another LLM) and have built up significant context there, you have two options to transfer it:

a) Tell ChatGPT you’re switching platforms and ask it to generate a memory export document: “I’m switching this project to Claude, give me a summary document...”

b) Use Import/Export in Claude. In Settings → Memory, you can import full data from other LLMs.

These four edits cover 90%+ of users and make an immediate impact on how Claude responds.

The next section is for people who want a real system.

Layer 2: Context File System (~60 Minutes)

Layer 1 fixes the basic memory problems. Layer 2 builds something more powerful: a file-based memory architecture that lives on your computer, loads automatically into Cowork and Claude Code.

The concept: instead of prompting Claude for context every time, you store all of that context in .MD desktop files that Claude has access to. You can also attach these markdown files to any LLM or AI agent system.

Create a new desktop folder, label it “Claude Master Folder”, and build these four markdown files within it (Claude can help you do this):

1. Instructions.md

This file tells Claude all your rules and instructions:

## Who you are
## What you do
## Rules
## What good outputs look like

Important to include: “Update Memory.md with my preferences over time.”

This line is crucial. It’s how you get Claude to create a running memory log of your data in the second markdown file.

2. Memory.md

This is the “brain” of Claude, continuously updated over time.

## Preferences
## Corrections
## Patterns
## Decisions

Now whehas yet to you say something like “stop using em dashes,” Claude goes into the memory file and updates it.

3. Context.md

The specific context file for a given project. What’s in this file changes depending on your project. You can also create a general “business context” or “life context” markdown mega file.

4. Archive Copies

This one is purely protective but worth doing.

Claude will update your memory files automatically as you work. Occasionally, it overwrites something incorrectly or makes a change you missed. Absent a backup system, that context is gone.

The fix: once a week, copy your entire master folder (Instructions, Memory, Context, and everything else) into a separate archive folder that Claude has zero access to. Label it with the date.

If anything breaks or gets overwritten incorrectly, restore from the archive.

Setting It Up

Just create a new folder called “Claude Master Folder,” attach it to a new Cowork chat, and paste this prompt:

Go into my "Claude Master Folder" in my connected workspace and build 
these four markdown files inside it:

Instructions.md - includes sections for: Who You Are, What You Do, 
Rules, What Good Outputs Look Like, and a line telling Claude to 
update Memory.md with my preferences over time.

Memory.md - includes sections for: Preferences, Corrections, 
Patterns, Decisions, and Personal Context. Pre-fill with placeholder 
examples so I know what to add.

Context.md - includes sections for: About This Project/Business, 
Audience, Key People & Collaborators, Active Projects & Priorities, 
Tools & Stack, and Important Background/History. Use a template 
format with placeholders I can fill in.

Archive-Guide.md - a step-by-step guide explaining why to archive,
how to do it weekly (duplicate the folder, rename with the date, 
move it somewhere Claude has zero access to), what to include, 
how to restore if something breaks, and where to store the backups.

Anytime you’re working in Cowork or Claude Code, attach your Master Folder and Claude uses it as a mini memory database. It edits the memory markdown file, leaving you with something you can attach to any LLM, new chat, or AI agent.

This system is a complete game-changer. But Layer 3 takes it further.

Layer 3: AI Second Brain (1-2 Hours)

This is the deepest level. It requires setup and ongoing maintenance, but for those who build it, it’s the most advanced memory system available for Claude today.

Two options depending on how you work. Option 1 is the fast path. Option 2 is the power-user path, requiring 1-2 hours of dedicated building.

Keep in mind: for your AI second brain memory vault to be effective, you have to spend time maintaining it and updating your databases. This is a living system, a set-and-forget approach produces decay.

Option 1: Claude x Notion (5 Minutes)

Connecting Claude to Notion is the highest-leverage thing you can do in 5 minutes.

Go to Claude → Settings → Connectors, then enable the Notion connector.

Once connected, Claude reads your Notion workspace directly inside any chat.

All your tasks, CRMs, notes, tables are now accessible and editable for Claude.

I recommend creating a new “Memory Database” where you store all your AI preferences, rules, and important AI context. As you’re working with Claude, you can say: “Send this to my Notion Memory Database.”

You can then export this Notion data to other LLMs or AI platforms via a CSV file or by using the Notion MCP connector.

This setup is similar to Layer 2, except you gain Notion’s built-in board views, to-do lists, and additional functionality.

Option 2: Claude x Obsidian x AI Engram (1-2 Hours)

This is the setup I personally use. It combines three things:

Obsidian for local markdown storage (your files, your machine, your control)
Karpathy’s LLM Knowledge Base schema for structuring how Claude organizes and compounds knowledge over time
AI Engram for persistent search and memory across every conversation

Here’s why this stack matters: Layer 2 gives Claude a folder of files to read. Layer 3 Option 2 gives Claude a searchable, evolving knowledge system that compounds with every conversation.

Step 1: Download Obsidian

Go to obsidian.md and download the app.

Create a new Vault (think of this as a desktop folder where Claude Code stores and accesses your data). Your data stays local. Zero cloud dependency.

Step 2: Point Claude at Your Vault

Open the Claude desktop app and click ‘Select Folder.’ Point it at your Obsidian Vault folder. Claude now has direct read and write access to everything inside it.

Step 3: Inject the Knowledge Base Schema

Paste Andrej Karpathy’s LLM Knowledge Base system prompt into the chatbox. This is the instruction set that tells Claude Code how to build, maintain, and evolve your wiki over time.

The prompt is available here: gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

I wrote about this system in detail in my earlier article, “Build an LLM Knowledge Base That Actually Compounds.” The key architecture:

your-vault/
├── raw/          # Immutable source documents (AI reads, has yet to modifies)
├── wiki/         # AI-maintained wiki with domain folders
│   ├── index.md  # Navigation hub
│   └── log.md    # Append-only action log
├── outputs/      # Generated reports and query answers
└── AGENTS.md     # Schema defining how the AI organizes, ingests, and queries

The AGENTS.md schema is the single most important file. It defines identity, architecture, conventions, and workflows. Every wiki page gets YAML frontmatter. Wiki-links cross-reference topics. Source citations are required. Contradictions get flagged.

Three core workflows defined in the schema:

Ingest: Read a source, extract key information, create/update summary pages, update index, add backlinks, flag contradictions, log it. A single source touches 10-15 wiki pages.
Query: Read index first, find relevant pages, synthesize answer with citations, offer to file insights back into wiki.
Lint (monthly): Check contradictions, stale claims, orphan pages, missing cross-references, unattributed claims. Output a severity-leveled report.

This system alone is powerful. But it has a gap: every new conversation starts with zero recall of past conversations. Claude reads your wiki files, sure, but it has zero memory of the decisions, preferences, and insights from previous chat sessions.

That gap is exactly what AI Engram fills.

Step 4: Install AI Engram

AI Engram is an MCP (Model Context Protocol) server that gives Claude persistent conversation memory and deep search over your markdown workspace. It runs entirely locally. Zero cloud services. Zero API calls.

pip install ai-engram
# or clone from github.com/MikeS071/ai-engram

Add it to your Claude Desktop MCP config:

{
  "mcpServers": {
    "ai-engram": {
      "command": "python",
      "args": ["aiengram_mcp.py"],
      "cwd": "/path/to/your/vault"
    }
  }
}

AI Engram gives Claude 13 new tools, split into two groups:

Content Search (6 tools):

Tool What It Does search_blog BM25 keyword search with relevance scoring and snippets semantic_search_blog Meaning-based search via sentence-transformer embeddings build_index Pre-build or refresh the semantic embedding cache list_blog_files List markdown files, filterable by collection blog_stats File counts and word totals across collections read_blog_file Read full markdown file content (with fuzzy path matching)

Conversation Memory (7 tools):

Tool What It Does remember Store a memory with category and optional tags recall Semantic search across stored memories recall_all Cross-search memories AND blog content via RRF fusion list_memories Browse memories by category, newest first forget Delete a specific memory by ID memory_stats Memory counts by category and storage size get_system_prompt Load the context memory protocol instructions

The search pipeline combines BM25 (keyword) and semantic (embedding) search via Reciprocal Rank Fusion. BM25 catches exact terms. Semantic catches meaning. Together, they find things that either approach alone would miss.

Step 5: How Memory Actually Works

AI Engram stores memories as JSONL entries (append-only, easy to inspect, easy to recover). Each memory has an ID, category, content, tags, timestamp, and source. Six categories:

Category Use Case decision Architectural choices, workflow rules, rejected approaches preference Tool choices, formatting styles, workflow preferences insight Key learnings, patterns discovered, breakthroughs context Background information, project state, environment details task Completed work, milestones, deliverables note General purpose, anything worth persisting

The Context Memory Protocol works like this:

At conversation start, Claude calls recall_all with a relevant query, then list_memories with category “decision” to load workflow decisions from past sessions.

During conversation, Claude automatically stores decisions, preferences, completed tasks, important context, insights, and notes using the remember tool.

The result: every conversation builds on every conversation before it. Decisions persist. Preferences stick. Insights compound.

The Final Product

Your Obsidian Vault now contains:

your-vault/
├── raw/                      # Source documents (immutable)
├── wiki/                     # Evolving knowledge base
│   ├── index.md              # Navigation hub
│   ├── log.md                # Append-only action log
│   └── [domain folders]      # Topic-organized wiki pages
├── outputs/                  # Generated reports
├── AGENTS.md                 # Knowledge base schema
├── .aiengram_memory.jsonl    # Persistent conversation memory
└── .aiengram_cache.pkl       # Semantic embedding cache

Claude reads your wiki. Claude searches your files with hybrid BM25+semantic search. Claude remembers every decision across every session. Your knowledge base compounds. Your memory persists.

Where This System Breaks

Context window ceiling. Around 100 articles or 400K words, selective reading via the index introduces blind spots. Claude reads the index first and may miss relevant pages further down.

Error compounding. The AI writes a subtle mistake into your wiki. A later query uses that mistake. It files back insights reinforcing the error. This is the compounding downside of a compounding system.

Hallucination persists. Your wiki looks authoritative with citations and structured formatting. But the AI can still synthesize false connections. The structure makes mistakes look more credible.

Cost adds up. Frontier models run $1-2 per ingest operation. Ten sources a day adds up. Cheaper models work for simple updates, frontier models for complex ingestion.

AI Engram requires maintenance. The JSONL memory file grows. Occasionally you need to review, prune, and forget outdated memories. A set-and-forget approach produces the same decay as Layer 1’s unmanaged memory page.

Scaling caps out around 10K sources. This system serves individuals and small teams well. Enterprise-scale knowledge management requires a different architecture.

Which Layer Should You Build?

Layer Time Best For 1: Basic Memory 5 minutes Everyone. Start here. 2: Context Files ~60 minutes Power users with repeatable workflows 3 Option 1: Notion 5 minutes People already in Notion who want visual dashboards 3 Option 2: Obsidian + Engram 1-2 hours People who want local control, deep search, and persistent memory across sessions

My recommendation: start at Layer 1 today. Build Layer 2 this week. Graduate to Layer 3 Option 2 when you’re ready to stop repeating yourself across every conversation.

The difference between Claude with default memory and Claude with a second brain is the difference between a goldfish and an elephant. Same fishbowl. Completely different relationship with time.

This article was built from real systems: the LLM Knowledge Base architecture (covered in detail at archonhq.ai) and AI Engram (github.com/MikeS071/ai-engram), an open-source MCP server for persistent AI memory. Both run locally. Both compound. Go build yours.

Subscribe now

Clone Hermes Agent's Architecture for Your Own AI Assistant

Michal Szalinski — Mon, 18 May 2026 21:00:59 GMT

Your AI assistant forgets the conversation context after three exchanges. The tool calling fails when you chain multiple operations. The memory system breaks when handling complex workflows that span multiple sessions.

You’re cobbling together OpenAI function calls with custom prompt engineering while fighting race conditions in multi-step processes. The assistant that worked for simple Q&A completely falls apart when you need it to research, analyze, and execute a series of dependent tasks.

Meanwhile, Nous Research’s Hermes Agent handles complex workflows flawlessly. Multi-turn conversations maintain perfect context. Tool execution chains together seamlessly. The architecture scales from simple queries to sophisticated automation.

The Idea (60 Seconds)

You’ll reverse-engineer Hermes Agent’s core design patterns to build a production-grade AI assistant framework. The implementation uses a modular plugin system, persistent memory management, and standardized tool interfaces that handle complex workflows reliably. Setup takes 2 hours. The result gives you an assistant architecture that scales from basic chat to autonomous task execution.

Why This Architecture, Beyond Simple Function Calling

Memory persistence solves context degradation. Standard chat implementations lose context as conversations grow. Hermes uses structured memory that maintains conversation state, user preferences, and task history across sessions. Your assistant remembers what you discussed yesterday and builds on previous work.

Plugin modularity enables unlimited expansion. Function calling requires hardcoded tool definitions. The Hermes pattern uses a plugin interface where tools register themselves dynamically. Add new capabilities by dropping Python files into a plugins directory. Zero core code changes.

Execution planning prevents tool chaos. Naive implementations call tools randomly based on user input. Hermes creates execution plans that sequence tool calls logically, handle dependencies, and recover from failures. The difference between “search for Python tutorials” and “search for Python tutorials, summarize the top 3, create a learning plan, and schedule practice sessions.”

Walkthrough

1. Core Agent Framework

Create the base agent class that handles conversation flow and tool coordination:

# agent.py
import json
import asyncio
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from datetime import datetime

@dataclass
class Message:
    role: str
    content: str
    timestamp: datetime
    metadata: Dict[str, Any] = None

class HermesAgent:
    def __init__(self, model_client, memory_store, plugin_manager):
        self.model = model_client
        self.memory = memory_store
        self.plugins = plugin_manager
        self.conversation_id = None
        
    async def process_message(self, user_input: str) -> str:
        # Load conversation context
        context = await self.memory.get_context(self.conversation_id)
        
        # Create execution plan
        plan = await self.create_execution_plan(user_input, context)
        
        # Execute plan steps
        results = []
        for step in plan.steps:
            result = await self.execute_step(step)
            results.append(result)
            
        # Generate response
        response = await self.synthesize_response(results, user_input)
        
        # Store conversation state
        await self.memory.store_exchange(
            self.conversation_id, user_input, response, results
        )
        
        return response

2. Memory Management System

Implement persistent memory that maintains context across sessions:

# memory.py
import sqlite3
import json
from typing import Dict, List, Optional

class MemoryStore:
    def __init__(self, db_path: str):
        self.db_path = db_path
        self.init_database()
        
    def init_database(self):
        conn = sqlite3.connect(self.db_path)
        conn.execute('''
            CREATE TABLE IF NOT EXISTS conversations (
                id TEXT PRIMARY KEY,
                created_at TIMESTAMP,
                last_active TIMESTAMP,
                context_summary TEXT
            )
        ''')
        conn.execute('''
            CREATE TABLE IF NOT EXISTS messages (
                id INTEGER PRIMARY KEY,
                conversation_id TEXT,
                role TEXT,
                content TEXT,
                timestamp TIMESTAMP,
                metadata TEXT,
                FOREIGN KEY (conversation_id) REFERENCES conversations (id)
            )
        ''')
        conn.commit()
        conn.close()
        
    async def get_context(self, conversation_id: str) -> Dict:
        conn = sqlite3.connect(self.db_path)
        
        # Get recent messages
        messages = conn.execute('''
            SELECT role, content, timestamp, metadata 
            FROM messages 
            WHERE conversation_id = ? 
            ORDER BY timestamp DESC 
            LIMIT 20
        ''', (conversation_id,)).fetchall()
        
        # Get conversation summary
        summary = conn.execute('''
            SELECT context_summary 
            FROM conversations 
            WHERE id = ?
        ''', (conversation_id,)).fetchone()
        
        conn.close()
        
        return {
            'messages': [
                {
                    'role': msg[0], 
                    'content': msg[1], 
                    'timestamp': msg[2],
                    'metadata': json.loads(msg[3] or '{}')
                } 
                for msg in reversed(messages)
            ],
            'summary': summary[0] if summary else None
        }

3. Plugin System Architecture

Build the modular tool interface that enables dynamic capability expansion:

# plugins.py
import importlib
import os
from abc import ABC, abstractmethod
from typing import Dict, Any, List

class Plugin(ABC):
    @property
    @abstractmethod
    def name(self) -> str:
        pass
        
    @property
    @abstractmethod
    def description(self) -> str:
        pass
        
    @abstractmethod
    async def execute(self, parameters: Dict[str, Any]) -> Any:
        pass
        
    @abstractmethod
    def get_schema(self) -> Dict:
        pass

class PluginManager:
    def __init__(self, plugins_dir: str):
        self.plugins_dir = plugins_dir
        self.plugins: Dict[str, Plugin] = {}
        self.load_plugins()
        
    def load_plugins(self):
        for filename in os.listdir(self.plugins_dir):
            if filename.endswith('.py') and filename != '__init__.py':
                module_name = filename[:-3]
                spec = importlib.util.spec_from_file_location(
                    module_name, 
                    os.path.join(self.plugins_dir, filename)
                )
                module = importlib.util.module_from_spec(spec)
                spec.loader.exec_module(module)
                
                # Find Plugin subclasses
                for attr_name in dir(module):
                    attr = getattr(module, attr_name)
                    if (isinstance(attr, type) and 
                        issubclass(attr, Plugin) and 
                        attr != Plugin):
                        plugin_instance = attr()
                        self.plugins[plugin_instance.name] = plugin_instance
                        
    def get_available_tools(self) -> List[Dict]:
        return [
            {
                'name': plugin.name,
                'description': plugin.description,
                'schema': plugin.get_schema()
            }
            for plugin in self.plugins.values()
        ]

4. Example Plugin Implementation

Create a web search plugin that follows the standard interface:

# plugins/web_search.py
import aiohttp
import json
from plugins import Plugin

class WebSearchPlugin(Plugin):
    @property
    def name(self) -> str:
        return "web_search"
        
    @property
    def description(self) -> str:
        return "Search the web for current information"
        
    async def execute(self, parameters):
        query = parameters.get('query')
        max_results = parameters.get('max_results', 5)
        
        # Use your preferred search API
        async with aiohttp.ClientSession() as session:
            url = f"https://api.search.brave.com/res/v1/web/search"
            headers = {"X-Subscription-Token": "your_api_key"}
            params = {"q": query, "count": max_results}
            
            async with session.get(url, headers=headers, params=params) as response:
                data = await response.json()
                
        results = []
        for item in data.get('web', {}).get('results', []):
            results.append({
                'title': item.get('title'),
                'url': item.get('url'),
                'description': item.get('description')
            })
            
        return {'results': results, 'query': query}
        
    def get_schema(self):
        return {
            'type': 'object',
            'properties': {
                'query': {'type': 'string', 'description': 'Search query'},
                'max_results': {'type': 'integer', 'description': 'Maximum results to return'}
            },
            'required': ['query']
        }

5. Execution Planning

Implement the planning system that sequences tool calls intelligently:

# planner.py
from typing import List, Dict
from dataclasses import dataclass

@dataclass
class ExecutionStep:
    tool_name: str
    parameters: Dict
    depends_on: List[str] = None
    step_id: str = None

class ExecutionPlanner:
    def __init__(self, model_client, plugin_manager):
        self.model = model_client
        self.plugins = plugin_manager
        
    async def create_plan(self, user_input: str, context: Dict) -> List[ExecutionStep]:
        available_tools = self.plugins.get_available_tools()
        
        planning_prompt = f"""
        User request: {user_input}
        Available tools: {json.dumps(available_tools, indent=2)}
        
        Create a step-by-step execution plan. Each step should use one tool.
        Consider dependencies between steps.
        
        Respond with a JSON array of steps:
        [
            {{
                "step_id": "step_1",
                "tool_name": "web_search",
                "parameters": {{"query": "Python tutorials"}},
                "depends_on": []
            }}
        ]
        """
        
        response = await self.model.complete(planning_prompt)
        steps_data = json.loads(response)
        
        return [ExecutionStep(**step) for step in steps_data]

Caveats

Model quality determines planning effectiveness. The execution planner relies on the language model understanding tool capabilities and sequencing logic. Weaker models create inefficient plans or miss dependencies. GLM-5.1 level capability becomes essential for complex workflows.

Memory storage grows indefinitely. The SQLite implementation accumulates conversation history permanently. Add cleanup routines for conversations older than 30 days or implement conversation archiving to prevent database bloat.

Plugin isolation remains minimal. Plugins execute in the same Python process with full system access. Malicious or buggy plugins can crash the entire agent. Consider sandboxing for production deployments handling untrusted plugins.

Philosophy

Building your own agent architecture creates compound advantages over time. Each plugin you add increases the system’s capabilities exponentially. The memory system learns your preferences and work patterns. The execution planner gets better at sequencing tasks for your specific use cases.

The Hermes architecture pattern scales from personal assistant to team automation platform. Start with web search and file operations. Add calendar integration, code analysis, and deployment tools. The modular design grows with your needs while maintaining reliability.

You own the entire stack. Zero vendor dependencies. Zero API rate limits. Zero feature deprecation risk.

Build Yours

Start with the core agent framework and memory system. Build one plugin. Test the execution planning with simple two-step workflows. The architecture becomes clear once you see it running.

What’s the first capability you’ll add to your agent? Drop your plugin ideas in the comments.

Subscribe now

Your ICP Is a Trap

Michal Szalinski — Sun, 17 May 2026 07:26:23 GMT

You spend six weeks building an AI agent that automates invoice processing for small businesses. You launch. Crickets. You posted in three Discord servers, sent 40 cold DMs, ran $200 in ads. Zero paying customers. The product works. The demos are smooth. Sales stay at zero.

The problem stared you in the face the whole time. Your ICP was “small business owners who need automation.” That describes 30 million people and excites exactly zero of them. You defined your ideal customer by demographics, by role, by company size. You listed who they are. You failed to ask whether they care, whether they spend, whether you can reach them, and whether you have any right to win.

The Idea (60 Seconds)

Your Ideal Customer Profile is a trap when it answers the wrong question. Most builders define ICP by demographics: age, income, job title, company size. Those attributes describe a person. They fail to predict behavior.

A strong ICP answers one question: Who is actively trying to solve this problem right now, has the ability to pay, and can be reached?

Urgency and situation beat demographics every time. A 42-year-old CFO at a logistics company drowning in manual reconciliation is your ICP. “CFOs at mid-market companies” is a demographic label that includes thousands of people perfectly happy with their spreadsheets.

The 4-Filter Test screens your ICP before you invest a single build hour. Pain. Market. Access. Fit. Each filter eliminates weak assumptions. Pass all four, and you have a target worth building for.

Two complementary question sequences sharpen the result. The Narrowing Funnel, derived from Alex Hormozi’s framework, starts broad and drills to urgency. The Lighthouse Client Method, created by Rmosh, grounds your ICP in a real human being instead of an abstract persona.

Why This Matters

Every AI builder hits the same wall. You learn prompt engineering. You master agent frameworks. You ship something that works. Then you realize you built it for everybody, which means you built it for an audience of zero.

Generic ICPs produce generic messaging. Generic messaging produces low conversion and high churn. You attract people who kind of sort of need your thing. They sign up, poke around, and leave. Your retention numbers look like a cliff.

The cost compounds fast. Six weeks of building for the wrong audience means six weeks of code you may need to rewrite, six weeks of positioning you need to undo, and six weeks of motivation burned on a product zero people wanted.

The antidote is simple and ruthless: filter before you build. The 4-Filter Test takes 30 minutes and saves months.

Walkthrough

The 4-Filter Test

Run your ICP through these four filters in order. Fail any single one, and you stop. Revisit your assumptions. Pick a different target. Do it all before writing a single line of code.

Filter 1: Pain. Are real people experiencing this problem and actively seeking solutions?

This is the urgency filter. People complain about many things. People seek solutions for far fewer. Your ICP must have a problem painful enough that they are already looking for help, googling alternatives, posting in forums, asking colleagues.

Test: Search for the problem in Reddit, Twitter, industry Slack channels. If people are posting about it and asking for recommendations, pain is real. If you find only vague complaints, the pain is too low to drive purchase behavior.

Example: “Bookkeeping is tedious” is a complaint. “I spent 12 hours last weekend reconciling invoices and I am still behind” is a pain signal. The second person buys. The first person scrolls past your ad.

Filter 2: Market. Is there a group spending money on solutions already?

Existing spend proves willingness to pay. If zero people are spending money to solve this problem, you are fighting human inertia and budget allocation at the same time. That is a losing battle.

Test: Search for existing products, agencies, consultants, or freelancers serving this problem. Check their pricing pages. Look for G2 or Capterra listings. Paid competitors validate the market. Zero competitors usually signals zero market, and first-mover advantage is a myth for solo builders.

Example: Automation tools for real estate agents exist everywhere, and agents pay for them. That signals a market. A tool for “people who want to journal more creatively” faces a market of free alternatives and low willingness to pay.

Filter 3: Access. Can you reach these people through channels you can actually use?

A perfect ICP locked behind an unreachable channel is useless. If your target is Fortune 500 CTOs and your only channel is a Twitter account with 200 followers, you lack access. Access means you can put your message in front of your ICP repeatedly, at low cost, starting this week.

Test: List every channel where your ICP spends time. Then honestly assess whether you can show up there. Do you have followers there? Do you know someone who does? Can you write content they read? Can you cold-email them effectively?

Example: React developers are reachable through Twitter, Dev.to, Discord, GitHub, and conference communities. Mid-market hospital administrators are reachable through expensive trade shows and closed networks. Pick the ICP you can actually reach.

Filter 4: Fit. Does your skill or experience give you an edge with this group?

You need earned advantage. Domain knowledge, professional network, technical expertise, or lived experience that lets you build something better or faster than a random competitor. Fit is your moat at the earliest stage.

Test: Ask yourself what you know about this ICP that most people lack. If the answer is “zero,” you are competing on execution alone against people who have both execution and insight.

Example: A former tax accountant building automation for tax firms has massive fit. A career developer building automation for dental practices has zero fit. Both can build the product. The former builds the right product faster.

The Narrowing Funnel (Hormozi-Derived)

Once your ICP passes all four filters, sharpen it with this question sequence. Each question narrows the field.

Who specifically? Start broad: “Business owners.” Narrow: “E-commerce business owners.” Narrower: “E-commerce business owners doing $1M to $10M in revenue.” Each level removes people who dilute your message.
What is their situation? Describe the context that creates the problem. “E-commerce owners managing inventory across three warehouses with a team of five and lacking a dedicated operations person.”
What is the painful version? Find the acute symptom. “SKU mismatches causing stockouts on best-selling items during peak season.” This is what keeps them up at night.
What triggers them to seek help right now? Identify the event that converts latent frustration into active purchasing. “Black Friday inventory errors cost them $50K in lost sales last year, and Q4 is eight weeks away.” That is urgency.
What is the outcome they would pay for? State the result in their language. “Eliminate SKU mismatches so every order ships correct and on time.” The outcome, the result, rather than the feature.
ArchonHQ is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

The Lighthouse Client Method

The Narrowing Funnel gives you a precise segment. The Lighthouse Client Method grounds it in a real human being.

Identify one person you would love to help. A specific individual. A former colleague, a client you worked with, a person from a community you belong to. Someone you can picture clearly.
Map their entire day. From morning to evening, what do they do? Where do they spend time? What tools do they open? What meetings drain them? What tasks feel like wasted effort?
Find the friction point they complain about most. The thing they mention unprompted. The task that makes them groan. The process they describe as “the worst part of my week.”
Build for that person, then generalize. Create the solution that eliminates their specific friction. Then ask: who else has this same friction in this same context? Those people are your ICP.

This method works because it anchors your product in observed behavior instead of imagined needs. You solve a real problem for a real person. Other people with the same problem recognize themselves in your messaging because it describes their actual experience.

The Two Big Beginner Mistakes

Mistake 1: “My ICP is everyone who might pay.” This feels safe. It is the opposite of safe. Broad targeting produces generic messaging. Generic messaging converts at a fraction of specific messaging. You attract marginal customers who churn fast because the product serves everyone poorly instead of serving someone exceptionally.

Fix: Define your ICP by best-fit criteria and disqualifiers. Write down who you serve and who you deliberately exclude. Disqualifiers sharpen your positioning as much as qualifiers. “We help e-commerce operators doing $1M to $10M. Enterprise teams and solopreneurs fall outside our focus.”

Mistake 2: Choosing a niche based on passion or identity, assuming the market rewards authenticity. Passion is a starting point. It falls short as a standalone strategy. The market rewards value, and value requires craft. Building for a niche you love where you lack skill and produces mediocre products competing against people with genuine expertise.

Fix: Replace passion-first with craft plus pull. Craft means your skill gives you an edge. Pull means the market signals demand. When craft and pull align, you have a sustainable position. When they misalign, you have a hobby.

The Prompt Toolkit

ICP Extraction Prompt

Copy the prompt below, replace the placeholder with your business idea, and paste it into any LLM.

You are a ruthless ICP analyst. You eliminate weak assumptions and surface the truth about whether a business idea has a viable target customer.

Run the 4-Filter Test on the business idea below. Score each filter from 1 to 10. For each filter, provide the score, the reasoning, and the specific evidence a builder should gather to validate or invalidate the score. Be brutally honest. Affirmative evidence only; discard wishful thinking.


 
 Are real people experiencing this problem and actively seeking solutions right now?
 10 = People post daily in public forums begging for a fix. 5 = People complain occasionally. 1 = You assume the pain exists based on logic alone.
 
 
 Is there a group already spending money to solve this problem?
 10 = Multiple paid products with pricing pages and reviews. 5 = One or two niche tools exist. 1 = Zero paid solutions exist.
 
 
 Can you reach these people through channels you can actually use this week?
 10 = You already have an audience or direct connection. 5 = You can reach them through public communities. 1 = They hide behind gatekeepers and enterprise sales cycles.
 
 
 Does your skill or experience give you an edge with this group?
 10 = You have years of domain expertise and a network. 5 = You have adjacent skills. 1 = You have zero connection to this world.
 



Return your response in this exact structure:

 One-sentence restatement of the idea
 Reasoning and evidence to gather
 Reasoning and evidence to gather
 Reasoning and evidence to gather
 Reasoning and evidence to gather
 X/40
 PASS if total >= 28, CONDITIONAL if 20-27, FAIL if below 20
 One specific thing the builder should do next




[PASTE YOUR BUSINESS IDEA HERE]

Lighthouse Client Prompt

Copy the prompt below, answer the questions honestly, and paste it into any LLM.

You are a product strategist who specializes in grounding abstract customer profiles in real human behavior. Your method is the Lighthouse Client Method: find one real person, observe their actual day, and surface the friction that drives purchasing.

Walk me through the Lighthouse Client Method step by step. Ask me one question at a time. Wait for my answer before proceeding to the next step. Complete all four steps.


 
 Ask me to name one specific person I would love to help. This must be a real individual I can picture clearly: a former colleague, a past client, someone from a community I belong to. Ask for their first name (or alias), their role, and their industry.
 
 
 Ask me to describe this person's typical workday from morning to evening. Prompt me to include: what tools they open, what meetings they attend, what tasks consume their time, and what feels like wasted effort. Probe for specifics.
 
 
 Ask me to identify the single task this person complains about most. The thing they mention unprompted. The process that makes them groan. Ask what they have tried to fix it and why those attempts fell short.
 
 
 Based on everything I shared, produce a one-paragraph ICP statement in this format: "People like [name], who are [role] at [type of company], who struggle with [specific friction] because [root cause], and who would pay for [outcome]."
 



After I complete all four steps, output:

 Summary of the person I described
 The specific pain you identified
 The one-paragraph ICP statement from Step 4
 Three specific actions I should take this week to confirm this friction exists for five more people

ICP Validation CLI

Save the script below as icp_check.py, set your OPENROUTER_API_KEY environment variable, and run it.

import argparse, os, json, urllib.request

def main():
 p = argparse.ArgumentParser(description="4-Filter ICP Assessment via OpenRouter")
 p.add_argument("idea", help="Your business idea description")
 p.add_argument("--model", default="google/gemini-2.0-flash-001")
 args = p.parse_args()
 key = os.environ.get("OPENROUTER_API_KEY", "")
 assert key, "Set OPENROUTER_API_KEY env var"
 prompt = f"""Score this business idea on the 4-Filter ICP Test. Each filter gets 1-10.
Filters: Pain (active problem seekers?), Market (existing spend?), Access (reachable channels?), Fit (your edge?).
Idea: {args.idea}
Return JSON only: {{"pain": int, "market": int, "access": int, "fit": int, "total": int, "verdict": "PASS|CONDITIONAL|FAIL"}}"""
 body = json.dumps({"model": args.model, "messages": [{"role": "user", "content": prompt}]}).encode()
 req = urllib.request.Request("https://openrouter.ai/api/v1/chat/completions", data=body,
 headers={"Authorization": f"Bearer {key}", "Content-Type": "application/json"})
 resp = json.loads(urllib.request.urlopen(req).read())
 r = json.loads(resp["choices"][0]["message"]["content"])
 print(f"Pain: {r['pain']}/10 | Market: {r['market']}/10 | Access: {r['access']}/10 | Fit: {r['fit']}/10")
 print(f"Total: {r['total']}/40 | Verdict: {r['verdict']}")

if __name__ == "__main__":
 main()

Caveats

The 4-Filter Test eliminates bad ICPs fast. It can also create false confidence if you lie to yourself on any filter. Confirmation bias is the enemy. Run each filter assuming your ICP fails, and look for evidence that it passes. The opposite approach, seeking evidence that confirms your hope, leads to the same wasted months the test is designed to prevent.

The Lighthouse Client Method risks overfitting to one person. Your lighthouse client may have idiosyncratic needs that diverge from the broader market. After building for them, validate that the problem generalizes. Talk to five more people in the same segment. If three of five describe the same friction, you have product-market signal. If only one of five does, you have a consulting client.

Markets shift. An ICP that passes all four filters today may fail in six months as conditions change. Revisit the test quarterly. Treat your ICP as a hypothesis, and treat revenue as the experiment result.

Philosophy

The best product strategy starts with ruthless selection, and selection means elimination. Every person you exclude from your ICP makes your messaging sharper for the people who remain. Every filter you apply removes a possible path to wasted effort.

Building AI tools is easier than ever. The moat has moved from technical execution to problem selection. The builders who win are the ones who chose the right problem before they wrote a single function. The 4-Filter Test is how you choose correctly.

Specificity is generosity. A vague ICP leaves every reader uncertain whether this product serves them. A precise ICP tells the right people, “this was built for you, and you can see it.” That clarity converts.

This is the first entry in the Caliber Series, a paid column on building and selling AI tools. The next article breaks down how to validate your ICP in 48 hours using zero but free tools and five conversations. Upgrade to access the full series.

Subscribe now

Build Your Own VS Code AI Agent independently of GitHub Copilot

Michal Szalinski — Sat, 16 May 2026 21:37:49 GMT

Your GitHub Copilot subscription hits $10/month. The completions feel sluggish when your internet connection drops. The model suggestions lean generic, trained on everything but optimized for your specific codebase patterns.

You’re paying monthly for AI assistance while being locked into Microsoft’s inference servers, data policies, and model choices. Meanwhile, your local machine sits idle with 32GB of RAM and a capable GPU.

What if you could build a VS Code extension that routes completions through your local AI models, processes requests in 200ms, and learns your coding patterns autonomously?

The Idea (60 Seconds)

You’ll create a custom VS Code extension using the Language Server Protocol to intercept completion requests and route them through local models like Ollama or LM Studio. The system provides real-time code suggestions, context-aware completions, and chat functionality while running entirely offline. Setup takes 30 minutes. The result replaces Copilot with a faster, customizable, cost-free alternative.

Why Local Models, Beyond Cloud APIs

Latency drops to milliseconds. Cloud completions travel to Microsoft’s servers and back. Local inference happens on your machine. The difference between 800ms and 150ms changes how you code.

Context stays private. Your proprietary code remains on your hardware. Zero data leaves your network. Zero logs hit external servers. Your IP stays yours.

Customization becomes possible. You control the model, the prompts, and the training data. Fine-tune on your codebase. Adjust temperature for your preferences. Switch models per project.

Costs disappear. The subscription fee vanishes. Inference runs on hardware you already own. Scale usage based on your machine’s capacity, beyond monthly limits.

Walkthrough

Clone Needle: Build a 26M Parameter Tool-Calling Model

Michal Szalinski — Fri, 15 May 2026 05:06:48 GMT

Your production app calls OpenAI’s API 847 times per day. Each function call costs $0.002. The monthly bill hits $380. Your CFO asks pointed questions about “AI infrastructure costs” in the quarterly review.

Meanwhile, your tool-calling needs are embarrassingly simple. Parse JSON. Validate schemas. Route function calls. Extract parameters. A 175B parameter model feels like hiring a PhD to sort your mail.

What if you could distill those capabilities into a 26M parameter model that runs locally, costs zero per inference, and handles 90% of your tool-calling workload?

The Idea (60 Seconds)

You’ll build a lightweight tool-calling model by distilling Gemini’s function-calling behavior into a compact transformer. The 26M parameter model runs locally, processes tool calls in 50-100ms, and handles structured JSON output with schema validation. Training takes 4 hours on a single GPU. The result replaces expensive API calls for routine function routing and parameter extraction.

Why Distillation, Beyond Fine-tuning

Fine-tuning starts with random weights. You’re teaching a model to speak tool-calling from scratch. Distillation starts with a teacher model that already excels at function calls. You’re copying expertise, instead of building it.

Data efficiency matters more than parameter count. Fine-tuning needs 50K+ examples to learn tool-calling patterns. Distillation works with 5K teacher-student pairs because the student learns from the teacher’s internal representations, beyond just input-output mappings.

Gemini’s tool-calling is already production-tested. Google spent millions optimizing function call accuracy. Distillation captures that optimization in a model you own completely.

The math is simple: 5K distillation examples vs 50K fine-tuning examples. 4 hours vs 40 hours. $20 in compute vs $200.

Walkthrough

1. Generate Teacher-Student Data

Start by collecting Gemini’s tool-calling behavior across diverse function schemas:

# data_generation.py
import google.generativeai as genai
import json
from typing import List, Dict

class ToolCallDataGenerator:
    def __init__(self, api_key: str):
        genai.configure(api_key=api_key)
        self.model = genai.GenerativeModel('gemini-1.5-flash')
        
    def generate_function_call_data(self, schemas: List[Dict], num_examples: int = 5000):
        examples = []
        
        for i in range(num_examples):
            # Sample random function schema
            schema = random.choice(schemas)
            
            # Generate natural language request
            prompt = self.create_natural_prompt(schema)
            
            # Get Gemini's function call response
            response = self.model.generate_content(
                prompt,
                tools=[schema],
                tool_config={'function_calling_config': {'mode': 'ANY'}}
            )
            
            if response.candidates[0].content.parts[0].function_call:
                examples.append({
                    'input': prompt,
                    'function_schema': schema,
                    'teacher_output': response.candidates[0].content.parts[0].function_call,
                    'raw_response': response.text
                })
                
        return examples
    
    def create_natural_prompt(self, schema: Dict) -> str:
        # Generate varied natural language that would trigger this function
        function_name = schema['function']['name']
        
        templates = {
            'weather': [
                "What's the weather like in {city}?",
                "Check the forecast for {city}",
                "Is it raining in {city} today?"
            ],
            'calculator': [
                "Calculate {expression}",
                "What's {expression}?",
                "Solve {expression} for me"
            ]
        }
        
        # Fill templates with realistic data
        return self.fill_template(templates.get(function_name, ["Use the {function_name} function"]))

2. Build the Student Model Architecture

Create a compact transformer optimized for tool-calling output:

# model.py
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM

class ToolCallingModel(nn.Module):
    def __init__(self, vocab_size: int = 32000, d_model: int = 512, n_heads: int = 8, n_layers: int = 6):
        super().__init__()
        
        # 26M parameters: 6 layers, 512 hidden, 8 heads
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(2048, d_model))
        
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads) for _ in range(n_layers)
        ])
        
        self.ln_final = nn.LayerNorm(d_model)
        self.output_head = nn.Linear(d_model, vocab_size)
        
        # Special tokens for function calling
        self.function_start_token = vocab_size - 4
        self.function_end_token = vocab_size - 3
        self.param_sep_token = vocab_size - 2
        
    def forward(self, input_ids, attention_mask=None):
        seq_len = input_ids.shape[1]
        
        # Embeddings + positional encoding
        x = self.embedding(input_ids) + self.pos_encoding[:seq_len]
        
        # Transformer layers
        for block in self.transformer_blocks:
            x = block(x, attention_mask)
            
        x = self.ln_final(x)
        return self.output_head(x)

3. Implement Knowledge Distillation Training

Train the student to mimic both Gemini’s outputs and internal representations:

# distillation_trainer.py
class DistillationTrainer:
    def __init__(self, student_model, teacher_model, tokenizer):
        self.student = student_model
        self.teacher = teacher_model
        self.tokenizer = tokenizer
        
        # Distillation hyperparameters
        self.temperature = 4.0
        self.alpha = 0.7  # Weight for distillation loss
        self.beta = 0.3   # Weight for hard target loss
        
    def distillation_loss(self, student_logits, teacher_logits, hard_targets):
        # Soft target loss (knowledge distillation)
        soft_loss = nn.KLDivLoss(reduction='batchmean')(
            F.log_softmax(student_logits / self.temperature, dim=-1),
            F.softmax(teacher_logits / self.temperature, dim=-1)
        ) * (self.temperature ** 2)
        
        # Hard target loss (actual function calls)
        hard_loss = F.cross_entropy(
            student_logits.view(-1, student_logits.size(-1)),
            hard_targets.view(-1),
            ignore_index=-100
        )
        
        return self.alpha * soft_loss + self.beta * hard_loss
    
    def train_step(self, batch):
        input_ids = batch['input_ids']
        function_call_targets = batch['function_call_targets']
        
        # Get teacher predictions (no gradients)
        with torch.no_grad():
            teacher_logits = self.teacher(input_ids).logits
            
        # Get student predictions
        student_logits = self.student(input_ids)
        
        # Calculate distillation loss
        loss = self.distillation_loss(
            student_logits, 
            teacher_logits, 
            function_call_targets
        )
        
        return loss

4. Create the Inference Server

Build a FastAPI server that handles tool calls with JSON schema validation:

# inference_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
import json

app = FastAPI()

class ToolCallRequest(BaseModel):
    prompt: str
    available_functions: List[Dict]
    max_tokens: int = 150

class ToolCallResponse(BaseModel):
    function_name: str
    parameters: Dict
    confidence: float

@app.post("/tool-call", response_model=ToolCallResponse)
async def generate_tool_call(request: ToolCallRequest):
    try:
        # Tokenize input with function schemas
        input_text = format_prompt_with_schemas(request.prompt, request.available_functions)
        tokens = tokenizer.encode(input_text, return_tensors='pt')
        
        # Generate function call
        with torch.no_grad():
            output = model.generate(
                tokens,
                max_length=tokens.shape[1] + request.max_tokens,
                temperature=0.1,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        # Parse function call from output
        generated_text = tokenizer.decode(output[0][tokens.shape[1]:], skip_special_tokens=True)
        function_call = parse_function_call(generated_text)
        
        # Validate against schema
        validate_function_call(function_call, request.available_functions)
        
        return ToolCallResponse(
            function_name=function_call['name'],
            parameters=function_call['parameters'],
            confidence=calculate_confidence(output)
        )
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

5. CLI Interface

# Install and run
pip install torch transformers fastapi uvicorn

# Start the server
python inference_server.py

# Test a function call
curl -X POST "http://localhost:8000/tool-call" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is the weather in San Francisco?",
    "available_functions": [{
      "name": "get_weather",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {"type": "string"},
          "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        }
      }
    }]
  }'

Caveats

Complex reasoning fails. The 26M model handles straightforward parameter extraction and function routing. Multi-step reasoning, ambiguous queries, and edge cases still need the teacher model or GPT-4.

Schema validation is strict. The model learns patterns from training data. Novel function schemas or unusual parameter types can break inference. Keep a fallback to cloud APIs for schema mismatches.

Training data quality determines ceiling performance. Bad teacher examples create bad student behavior. Gemini occasionally generates malformed function calls. Clean your distillation dataset aggressively.

Performance benchmarks from my testing: 87% accuracy on single-function calls, 72% on multi-function scenarios, 15ms average inference time on RTX 4090.

Philosophy

Tool-calling models represent the future of local AI inference. Most production applications need structured output, parameter extraction, and API routing. These tasks require precision over creativity.

The distillation approach captures expert behavior in compact models you control completely. Zero API dependencies. Zero per-inference costs. Zero data leaving your infrastructure.

This pattern extends beyond tool-calling. Distill code generation, text classification, structured data extraction. Build a library of specialized models that replace expensive API calls with fast local inference.

The 26M parameter model becomes your function-calling foundation. Expand it. Specialize it. Deploy it everywhere.

Build Your Clone

Start with the data generation script above. Collect 5K Gemini examples across your target function schemas. Train the distillation model on a single GPU for 4 hours. Deploy the inference server.

Your tool-calling costs drop to zero. Your inference speed increases 10x. Your data stays local.

What function-calling use case will you tackle first?

Subscribe now

Software 3.0 Is Not a Silver Bullet: Why Engineering Expertise Still Wins with LLMs

Michal Szalinski — Mon, 04 May 2026 03:18:41 GMT

Andrej Karpathy calls it Software 3.0. YC calls it the biggest shift since high-level languages. The framing is seductive: LLMs are the new operating system, prompting is the new programming, and anyone can build software now.

Except they can’t.

I’ve watched the same pattern repeat for a while. A non-engineer pastes a vague request into ChatGPT, gets something that looks plausible, ships it, and watches it fall apart under real conditions. Meanwhile, I take the same starting point and end up with a robust CLI that runs autonomously on a cron job, handles edge cases, and compounds value every day.
Same models. Same access. Radically different outcomes. The difference isn’t the prompt. It’s the engineering.

You can watch the whole lecture here:

The Idea (60 Seconds)

Software 3.0 is real. LLMs are a new kind of runtime. But the “anyone can code now” narrative is incomplete. What Karpathy and the YC lecture get right is the mental model shift; LLMs as operating systems, not smarter search engines. What they understate is how much engineering discipline that OS still demands.

The LLM is a lossy runtime. It hallucinates. It forgets context. It produces plausible garbage. Treating it like a magic oracle gets you slop. Treating it like a savant with cognitive issues, brilliant but unreliable, and engineering around those limitations is what separates working solutions from demo toys.

The engineers who master this don’t just write better prompts. They build systems around the LLM: scaffolding, verification loops, fallback chains, state management, and evaluation frameworks. The prompt is the interface. The engineering is the product.

Why Non-Engineers Produce Slop

When a non-engineer asks an LLM to “build me a tool that extracts data from a website and sends a daily email,” they get a script. It works. Once. On the happy path. With the data formatted exactly the way they showed in their example.

When I build the same thing, here’s what happens in my head before I type a single prompt:

How does this fail when the website changes its layout?
What happens when the API returns empty data instead of the expected format?
Where does state live so we can detect anomalies?
How do we make this idempotent so re-running doesn’t send duplicate emails?
What’s the time budget for each operation, and what’s the hard ceiling?

These aren’t prompt engineering tricks. These are engineering instincts: problem decomposition, error handling, state management, verification. The LLM doesn’t eliminate the need for these. It accelerates the implementation but makes the consequences of skipping them more dangerous, because now your broken code runs automatically.

The YC lecture nails this indirectly: “you have to design around their limitations rather than expecting perfect, human-like reliability.” That’s engineering. That’s always been engineering. The medium changed. The discipline didn’t.

This is why I believe there isn’t going to be a drop in demand for engineering jobs. In fact, there is sufficient evidence from past events (such as the industrial revolution, the internet, digitisation of everything) that greater capability almost always demands greater supply of skills. People just need to adapt to new ways of working.

The LLM as Lossy Runtime

Karpathy’s operating system metaphor is useful. An OS has defined syscalls, documented error codes, and deterministic behavior for the same inputs. An LLM has none of these.

A better mental model: the LLM is a brilliant but unreliable non-deterministic co-processor. It can execute tasks that would take you hours in seconds, but it sometimes returns wrong results with absolute confidence. Your job as the engineer isn’t to write the perfect prompt that prevents errors, instead, it’s to build the harness that catches them.

Here’s what that harness looks like in practice.

Technique 1: Structured Scaffolding, Not Free-Form Prompts

The single biggest upgrade from slop to production: never accept free-form LLM output for anything you plan to use programmatically.

When you ask “write me a CLI,” you get markdown, prose explanations, and code in varying states of completeness. When you ask for a specific JSON schema, you get a parseable artifact you can pipe into your build system.

You are an expert Python engineer. Always respond in valid JSON with this exact schema:

{
  "thinking": "step-by-step reasoning and trade-offs",
  "code": "complete, ready-to-run code with inline comments",
  "explanation": "usage, edge cases, and assumptions",
  "tests": ["list of test cases or pytest snippets"]
}

Only output the JSON. No markdown. No prose outside the schema.

This is the LLM equivalent of typed function signatures. You wouldn’t write a production API that returns unstructured text. Don’t let your LLM do it either.

Pair this with JSON mode in Claude or structured output in OpenAI, and you’ve turned a probabilistic text generator into a deterministic data pipeline — at least at the interface boundary.

Technique 2: Self-Critique Loops

The generate-once-and-ship pattern is how you get slop. The generate-critique-revise pattern is how you get quality.

First, plan the solution step-by-step. Consider architecture,
error handling, and UX.

[After getting initial output]

Now critique the above as a senior staff engineer. Score 1-10 on:
correctness, robustness, usability, maintainability.
List specific fixes. Then output a revised version in the same
JSON schema.

Two or three cycles of this consistently produces dramatically better output. The LLM is excellent at critiquing its own work when given clear evaluation criteria, it just won’t do it unprompted.

This mirrors how senior engineers actually write code: draft, review, revise. The difference is the cycle time. What takes a human hours takes the LLM seconds. The bottleneck shifts from writing to evaluating.

Technique 3: Context Packaging as Modular Architecture

Most people paste their entire codebase into a prompt and hope for the best. Engineers treat context like code modules; structured, versioned, and reusable.


[paste relevant code, docs, or previous outputs here]



- Use Typer + Rich
- Zero unnecessary dependencies
- Full --help support
- Graceful error handling



[Specific request]

XML tags aren’t magic. But they give the LLM clear boundaries between different types of information. The model processes tagged sections more reliably than unstructured walls of text.

This scales. Maintain a library of context blocks; your project’s architecture, your coding conventions, your error handling patterns. Swap them in and out as needed. This is the LLM equivalent of import statements.

Technique 4: Model Routing, Not Model Loyalty

No single model is best at everything. The engineers getting the most from LLMs aren’t loyal to one model — they route tasks based on what each model does best.

In my daily pipeline:

Fast model (Gemini Flash, Haiku) for data extraction, formatting, and routine summarization
Strong model (Claude Opus) for judgment calls — clinical pattern recognition, nuanced coaching recommendations, anything where missing a subtle signal costs more than the extra latency

The routing logic itself is deterministic code. The models are interchangeable components. When a new model drops that’s faster or cheaper, I swap it in for the appropriate tier and verify against my test suite.

This is operating system thinking. You don’t write code that only runs on one processor. You write against abstractions and let the runtime choose the best execution path.

Technique 5: Verification as Architecture

The YC lecture mentions speeding up the “generation → verification loop.” That undersells it. Verification isn’t a loop; it’s the architecture.

Every LLM output in my production system goes through verification before it’s used:

Schema validation — does the JSON match the expected structure?
Content validation — are the key fields populated? Are values in plausible ranges?
State validation — does this output contradict our stored history? (A client who had 33 check-ins yesterday doesn’t have one today.)
Quality validation — is the report complete? Does it have all 5 required sections?

If any validation fails, the system doesn’t silently accept bad output. It retries with a different approach, falls back to stored data, or escalates to a human. The LLM is just one component in a larger system with defined invariants.

This is the engineering discipline that separates “I built a cool demo” from “this runs my business while I sleep.”

The Skills Layer: Where It All Compounds

Here’s where Software 3.0 genuinely shifts the game. Individual prompts are disposable. Skills are compound interest.

A skill is a reusable, versioned, self-contained workflow that packages:

The system prompt (personality, constraints, output format)
The context (domain knowledge, project conventions)
The tools (what the agent can call)
The verification (what “done” looks like)
The fallbacks (what happens when things break)

I have a skill that analyzes fitness check-in data and generates coaching reports. It took 50+ iterations to get right. Now it runs twice daily on a cron job, processing 18 clients autonomously. Each run costs about $0.30 in LLM tokens. The equivalent human effort would be 4+ hours of a coach’s time.

The skill is the artifact. The prompts that built it are long gone. This is the shift from “prompting” to “engineering”, you’re not optimizing a single interaction, you’re building a system that improves with every edge case you handle and every failure mode you fix.

Non-engineers can create impressive one-off outputs. Engineers create systems that compound.

The Honest Truth About Software 3.0

Karpathy is right that we’re in the 1960s of LLMs. The 1960s weren’t democratic. Assembly language existed, but the people who built reliable systems were the ones who understood memory management, error handling, and hardware constraints.

We’re in the same place now, just with natural language instead of assembly. The barrier to entry is lower, you can produce something that runs on your first try. But the barrier to reliability hasn’t moved. Building something that handles edge cases, degrades gracefully, and runs unattended for months still requires engineering discipline.

The tools will improve. Context windows will grow. Models will get more reliable. The verification loops will tighten. But the core skill (treating the LLM as a brilliant, unreliable component in a larger system you engineer) will remain the differentiator.

Software 3.0 doesn’t democratize great software. It supercharges the people who already know how to build it. The gap between engineers and non-engineers isn’t closing, it’s widening, because engineers are building compound systems while everyone else is still optimizing single prompts.

The engineers who master this now who build skills, design verification architectures, and treat the LLM as a lossy runtime to engineer around will define the next decade of software.

Everyone else will keep wondering why their “automated” workflow broke again this morning.

Building production AI automation? ArchonHQ gives you the skills architecture, verification frameworks, and orchestration tools to turn LLMs from toys into reliable systems. Stop prompting. Start engineering.

Automate Anything with AI Skills and CLIs - Your New Superpower in 2026

Michal Szalinski — Fri, 17 Apr 2026 07:04:19 GMT

You’ve seen it. A fitness coach spending three hours every Friday night reviewing client checkins, scrolling through twenty clients, each with weight data, waist measurements, sleep scores, nutrition compliance, workout logs, injury reports, and subjective mood ratings. Copy the numbers. Compare to last week. Write the feedback. Send the email. Next client. Repeat until midnight.

That’s data entry with a human in the loop.

This pattern repeats across every small business: a SaaS platform that holds your data hostage, a manual review process that scales linearly with clients, and an expert whose time gets eaten by the 80% of the work that’s pattern-matching rather than judgment.

Here’s what I built to fix it, and the pattern you can copy for almost any repetitive knowledge workflow.

The Idea (60 Seconds)

You can automate almost any repetitive workflow by stacking five layers:

Reverse-engineer the data source (even absent a formal API)
Build a CLI that extracts and structures the data
Use an LLM to do the analysis a human used to do manually
Package it as a skill so your AI agent can repeat the process reliably
Schedule it with cron so it runs on autopilot

Most people stop at layer one, they ask ChatGPT a question and get an answer. That’s a conversation. Automation is when the system runs autonomously.

I’m going to show you exactly how I built all five layers for a fitness coaching business. The platform lacked a public API. The data was locked behind a login screen. The analysis required professional domain knowledge. And the whole thing needed to run daily and email reports to a real coach with real clients.

The result: 75% less manual work. A coach who used to spend 12+ hours per week reviewing checkins now spends 2 hours scanning AI-generated reports and producing videos based on the analysis for their clients. Instant value-add, zero duplication.

Why Skills and CLIs Are the Underrated Superpower of 2026

Everyone’s talking about AI chatbots and agents. The real unlock goes unmentioned.

The AI maturity ladder looks like this:

Level What You Do Runs Without You? 1. Chat Ask a question, get an answer No 2. Prompt Library Reuse tested prompts No 3. CLI Script that takes arguments and runs Yes (manually triggered) 4. Skill Packaged workflow your AI agent can load and execute Yes (agent-triggered) 5. Cron Scheduled autonomous execution Yes (fully automatic)

Most people are at level 1 or 2. They have folders full of prompts. They paste them into ChatGPT and copy the output. That’s better than a blank page, but it plateaus.

A CLI compounds. You build it once, debug it, and it works forever. You can pipe data into it. You can chain it with other CLIs. You can schedule it.

A skill compounds harder. A skill is a SKILL.md file that packages your entire workflow, triggers, inputs, steps, gotchas, and all the hard-won lessons from debugging. Your AI agent reads it and knows exactly what to do. Every bug you fix, every edge case you discover, gets baked in permanently. The skill is the artifact. The automation is the side effect.

Cron is the payoff. When your CLI runs on a schedule autonomously, you’ve shipped automation. Production.

This is the pattern most people are missing in 2026. They’re using AI to answer questions when they should be using it to replace themselves.

Reverse-Engineering as an AI Superpower

Here’s the uncomfortable truth: the most valuable AI skill in 2026 is reverse-engineering, outranking even prompting.

Most business data is locked in SaaS apps lacking an export button, API documentation, or webhooks. The vendor wants you inside their walled garden. Your data is their leverage.

AI makes the downstream analysis trivially easy, pass data to an LLM, get insights back. But the upstream problem hasn’t changed: you still need to get the data out first. And that requires a craft skill that most “AI practitioners” have yet to develop.

I’m going to walk through the exact reverse-engineering process I used on Kahunas.io, a fitness coaching platform with zero public API documentation. These techniques work on almost any web app.

Build with AI Like a Professional Engineering Team

Michal Szalinski — Tue, 14 Apr 2026 06:43:38 GMT

The AI software slop problem

You’ve seen it. Maybe you’ve done it. Open a coding agent, type “build me a SaaS app,” and watch it spit out 2,000 lines of code in twelve seconds. It compiles. It even runs. You ship it.

Three weeks later you’re debugging why the auth middleware silently skips validation on expired tokens. The database queries have no error handling. There are hardcoded API keys in config files. The tests, if they exist, test that functions return something, not that they return the right thing. Nobody reviewed the architecture because there was no architect. Nobody caught the security holes because there was no security review. Nobody checked if the code actually matched the spec because there was no spec.

This is the AI software slop problem. The code looks professional. It reads like professional code. But it was written by a single agent working in isolation with no oversight, no review gates, and no engineering process. It’s the software equivalent of a kid copying homework, the answers look right until you check the working.

The fix isn’t better prompts. The fix isn’t a smarter model. The fix is the same thing that stopped human developers from shipping garbage: process and specialisation. You need an engineering team, not a code fountain.

In my tests, the same issue exists no matter what coding agent or model you use. I tried building fairly sizable projects using Claude Code and Opus 4.6, Codex and GPT-5.4 or GPT-5.3-codex. I tried OpenCode with Kimi-K2.5, Minimax-M2.7 and Droid Agent with GLM-5.1. Some agents a marginally better than others because they have an in-built harness. However, until you really sit down and map what needs to happen from an engineering delivery and quality assurance point of view, you’re just a kid with crayons painting pretty pictures that are completely unmaintainable. Great for an MVP or testing an idea, but not great for delivering production grade software.

This harness and approach changes that.

The Basic Idea

Professional software teams don’t work the way most people use AI coding agents.

In a real team, you have an architect who designs the system and writes Architecture Decision Records, but never writes implementation code. You have a planner who breaks features into phased delivery plans with exact file paths and dependencies. You have developers who write code test-first and refuse to touch files outside their scope. You have code reviewers who catch what the developer missed. You have a security reviewer who runs OWASP Top 10 checks and issues a BLOCK/WARN/PASS verdict. You have a TDD (Test Driven Development) guide who enforces red-green-refactor. You have a database reviewer, an E2E test runner, a refactoring specialist.

Each role has a narrow, well-defined responsibility. Each role has hard boundaries, things it will not do. The architect never writes code. The security reviewer never edits files. The developer never adds helpers that aren’t in the spec.

What if you could give your AI coding agent the same structure?

That’s what ai-dev-harness does. It’s an open-source framework that installs a complete set of agent profiles, skills, lifecycle policies, and working rules into your project. Your coding agent doesn’t just write code, it works within an engineering system that keeps the code honest.

Why This Setup and What You Get

The harness gives you three things that solo AI coding doesn’t:

1. Agent profiles with hard boundaries. Fourteen specialist roles, each with explicit permissions and restrictions. The architect reads code and produces ADRs, its tool access is limited to read, grep, and glob. The code-agent writes code and runs tests, but must declare a scope boundary before touching anything. The security-reviewer scans for vulnerabilities and produces a structured report with a verdict, but never edits a file. The planner creates phased implementation plans with exact file paths, but never writes code. These aren’t soft suggestions, they’re enforced by the agent’s tool configuration and written into each profile’s “What NOT to Do” section.

2. Skills that encode engineering practices. Eighteen skills covering the full development lifecycle: api-design, backend-patterns, frontend-patterns, database-migrations, deployment-patterns, docker-patterns, security-review, e2e-testing, tdd-workflow, verification-loop, coding-standards, golang-patterns, python-patterns, postgres-patterns, and more. Each skill activates contextually, the tdd-workflow skill triggers when writing new features or fixing bugs, the verification-loop skill triggers after completing significant changes, the security-review skill triggers when handling auth or data protection.

3. A lifecycle policy that routes work to the right agent. A TOML configuration maps ticket types to specialist profiles. A new feature goes to the code-agent. A code quality gap goes to the code-reviewer. A security concern goes to the security-reviewer. A database change goes to the database-reviewer. You don’t have to remember which agent does what, the harness routes it automatically.

Together, these three layers mean your AI agent works within a system and has standards to follow, review gates to pass, and specialist roles to lean on. The result is code that survives contact with production, because it was built like production code.

The Architecture Principles Every Project Starts With

Before you write a single line of code, before you even spec out a feature, there are principles that should be baked into every solution. These aren’t optional. They’re not “nice to haves” you add later. They’re the difference between a system that works and a system that keeps working.

The harness encodes these into its agent profiles, skills, and working rules. But even if you never use the harness, these are the non-negotiables. Print them out. Stick them on the wall. Reference them in every code review. If your project violates one of these, you need a damn good reason documented in an ADR.

1. Modularity, single responsibility, high cohesion, low coupling. Every file does one thing. Every function does one thing. If you can’t describe what a file does in one sentence, it does too much. Target 200-400 lines per file. Hard stop at 800. When a file crosses 800 lines, it’s telling you it has too many responsibilities. Split it.

2. Explicit error handling, no silent failures, ever. Every async call has error handling. Every external dependency call has a try/catch or equivalent. No silent catch blocks. No swallowing errors and continuing. If something fails, the system knows it failed, logs it, and responds appropriately. A silent failure is a lying system. Lying systems kill people’s data.

3. Input validation at boundaries. Validate all external input at the system boundary, API endpoints, message consumers, file parsers, before it reaches business logic. Internal code trusts internal data. External code trusts nothing. This is where injection attacks live. Validate early, validate once, validate completely.

4. Immutability by default across async boundaries. Shared mutable state across async boundaries is a bug factory. Data that crosses an async boundary should be immutable, copied, not referenced. If two concurrent processes can modify the same object, you will get race conditions. Not maybe. Will.

5. Security, defense in depth, least privilege. Authentication on every protected route. Authorization checked on every operation. No hardcoded secrets, all config via environment variables. No PII in logs. Errors sanitized before reaching clients. Security is not a layer you add at the end. It’s a constraint you design for from the start.

6. Stateless services where possible. Design for horizontal scaling from day one. If your service holds session state, it can’t be scaled by adding instances. Push state to the client or to a dedicated state store. Stateless services are easier to deploy, easier to scale, easier to recover. Design for 10x before needing 100x.

7. API contracts before implementation. Define your API surface first: method, path, request body, response body, auth requirements, error responses. Write it down. Share it. Build against it. This is the contract between your frontend and backend, between your services, between your team. If the contract is wrong, the implementation doesn’t matter.

8. Test-driven development, tests before code, always. Write failing tests that describe the expected behaviour. Then write the minimum code to make them pass. Then refactor. Red-green-refactor. Every time. No exceptions. Target 80% coverage on branches, functions, and lines. If coverage is below threshold, write more tests, never lower the threshold.

9. Phased delivery, each phase independently mergeable. Break work into phases that can ship on their own. M0: repo and infrastructure. M1: foundation. M2: features. M3: quality and operations. If a phase can’t be merged without the next phase, the plan is wrong. This isn’t bureaucratic, it’s how you keep the blast radius of any bug bounded.

10. No hardcoded values, config is external. API URLs, feature flags, timeout values, retry counts, rate limits, these change between environments. Hardcode them and you’re deploying code to change a timeout. Externalize them and you’re editing a config file. Every magic number in your codebase is a deployment risk.

11. Consistent patterns over clever solutions. When the existing codebase uses a pattern, use that pattern. Don’t introduce a “better” approach that only you understand. Consistency beats cleverness every time. If the pattern is genuinely wrong, write an ADR, get agreement, then change the pattern everywhere. Don’t leave two patterns coexisting.

12. Logging, security-sensitive operations are always logged. Auth attempts, permission changes, data access, payment operations. If it’s security-sensitive and it’s not logged, you have no forensics when something goes wrong. Log the operation, the actor, the timestamp, and the outcome. Not the sensitive data itself, the fact that the operation happened.

13. Dependency management, know what you depend on. Run dependency audits regularly. No high or critical CVEs in your dependencies. Pin your versions. Know what each dependency does and why it’s there. A dependency you don’t understand is a supply-chain attack vector you can’t defend against.

14. Documentation stays in sync with code. Stale documentation is worse than no documentation, it’s actively misleading. When code changes, docs change. README, API docs, ADRs, runbooks, all of it. The doc-updater profile exists because this is hard for humans. It’s harder for AI agents, which will happily implement a feature and forget the README exists.

15. Design for failure. Every external call will fail. Every database will have slow days. Every third-party API will timeout. Design for it. Circuit breakers, retries with backoff, fallback responses, graceful degradation. If your system assumes the network is reliable, the network will teach you otherwise.

These fifteen principles are the starting point. Not the ending point, you’ll add domain-specific principles as you learn more about your problem. But if your project violates any of these, the violation should be intentional, documented, and justified. Not accidental.

The harness doesn’t just suggest these, it enforces them through agent constraints, skill workflows, and review gates. But even without the harness, this list is your pre-flight checklist. Run through it before every project. Run through it during every review gate. If something’s missing, fix it before you ship.

Step 1: Install and Configure Your Coding Agent

Pick your coding agent. I use Factory’s Droid, but this setup works with Codex, Claude, or any agent that can read a repo and follow instructions. The harness is agent-agnostic, it’s a set of files and conventions, not a vendor lock-in.

Install your agent’s CLI and make sure it can read files, write files, run shell commands, and grep your codebase. That’s the baseline. If your agent can do those four things, the harness works.

Don’t skip the shell access. Half the value of this system comes from running tests, builds, and security scans. If your agent can’t execute commands, you lose the verification-loop skill, the security-reviewer’s automated scans, and the TDD workflow. You’re left with a very expensive syntax highlighter.

Step 2: Initialise Your New Project Folder

Create your project directory and point your coding agent at the harness repo. The harness bootstraps your project with:

.agents/profiles/, the fourteen specialist roles
.agents/skills/, the eighteen engineering practice skills
lifecycle-policy.toml, ticket-type-to-agent routing
AGENTS.md, working rules, conventions, and coding standards
CODEX_INITIAL_PROMPT.md, the initialisation and phased delivery model

The initialisation prompt (in the Prompt Library below) tells the agent to read the entire harness repo, follow the setup instructions, install dependencies, create a GitHub repo, and then learn all the agent roles and skills. This is not a quick step, it takes a few minutes. Let it run. The agent is reading every profile, every skill, every working rule. It’s building a mental model of your engineering team.

Once initialised, your project has a structure that any coding agent can pick up and work within. The rules live in the repo, not in your head.

Step 3: Enable Your Engineering Team to Do Work

This is where it gets real. The harness defines fourteen specialist profiles. Here’s what each one does and why it matters:

Architect, Read-only. Gathers evidence before recommending. Produces Architecture Decision Records with explicit trade-off analysis (Pros/Cons/Alternatives/Decision with single clear rationale). Never writes implementation code. The architect exists to stop you from jumping straight to coding without understanding the problem.

Planner, Read-only. Breaks features into independently deliverable phases with exact file paths, dependencies, complexity estimates, and testing strategy. Each phase must be mergeable on its own. The planner catches “update the API” vagueness and forces specificity: which file, which function, which endpoints.

Code Agent, The implementer. Executes one scoped ticket at a time. Must declare assumptions, scope boundary, and what it’s not building before touching any files. TDD process: write failing tests, implement minimum code, run quality gates, fix failures, commit. Never touches files outside scope. Never adds helpers not in the spec.

Code Reviewer, Read-only. Reviews code for quality, patterns, and consistency. Catches what the code agent missed.

Security Reviewer, Read-only. Runs automated scans for hardcoded secrets, dependency vulnerabilities, SQL injection patterns, and sensitive data in logs. Then does a manual OWASP Top 10 review on changed code. Produces a structured report with BLOCK/WARN/PASS verdict. Any CRITICAL or HIGH finding blocks the phase.

TDD Guide, Writes failing tests before implementation exists. Enforces red-green-refactor. Requires 80% coverage on branches, functions, and lines. If coverage is below threshold, writes more tests, never lowers the threshold.

Database Reviewer, Reviews schema changes, migration safety, and query patterns.

E2E Runner, Runs end-to-end test suites (Playwright) covering critical user flows.

Doc Updater, Keeps documentation in sync with code changes.

Refactor Cleaner, Removes duplication, improves naming, reduces file sizes. Targets 200-400 lines per file, hard stop at 800.

Build Error Resolver, Fixes build failures and CI pipeline issues.

Go Build Resolver / Go Reviewer, Go-specific build and review specialists.

Python Reviewer, Python-specific review with pattern enforcement.

The key insight: each profile has a “What NOT to Do” section. These boundaries are what make the system work. Without them, every agent turns into the same generic code generator.

Step 4: Create Your First Feature or Build Spec

Now you use the “Spec Out” prompt from the Prompt Library. You tell the agent what you want to build and it produces two things: a production-ready build spec and a phased checklist roadmap.

The build spec covers architecture decisions, data models, API contracts, error handling strategy, and security considerations. It follows the harness standards, modularity (single responsibility, 200-400 lines per file), explicit error handling (no silent catches), input validation at boundaries, immutability by default across async boundaries.

The checklist roadmap breaks the work into independently deliverable phases: M0 (repo activation and infrastructure), M1 (foundation, core models and base endpoints), M2 (product core, feature implementation), M3 (quality and operations, security hardening, E2E tests, monitoring). Each phase has clear success criteria and can be merged on its own.

Here’s the important part: the prompt asks the agent to ask you up to 30 questions, one by one, to clarify anything uncertain before building starts. This pattern is deceptively powerful. It forces the AI to extrapolate on solution components you haven’t thought about, authentication strategy, error surface area, data migration paths, rate limiting, cache invalidation. It’s like having a senior engineer sit next to you and say “have you considered what happens when the payment provider times out?” before you write a single line of code.

Don’t skip the 30 questions. This is where the harness earns its keep. The agent has read all the profiles and skills, it will surface concerns that the architect profile would raise, security issues the security-reviewer would catch, edge cases the TDD guide would test for. Answer the questions. Lock in the assumptions. Then build from a position of certainty to create a much better quality solution.

Step 5: Create Quality Validation Gates for Every Phase

Every phase of your checklist roadmap needs a review gate. The “Add a Review Gate” prompt in the Prompt Library appends a verification step to each phase that the agent cannot skip past.

The review gate does three things: compares the code developed in that phase against the build spec, identifies any functionality gaps, and closes them. It enforces 80% test coverage minimum. The phase cannot be called complete until the gate passes.

This is the single most important step in the entire process. Without review gates, AI agents drift. They forget parts of the spec. The review gate forces the agent to audit its own work against the plan before moving on.

The gate works because the agent already has the spec in context. It’s comparing what it built against what it was told to build. This catches two classes of problems: missed functionality (the spec said to validate input at boundaries, but the agent only validated on the happy path) and quality gaps (the spec required 80% coverage, but the agent only wrote happy-path tests).

Run the review gate after every phase, especially the ones that feel straightforward, that’s where complacency lives.

Step 6: Do Walk-throughs of All Your Screens and Update Specs

This step should perhaps be the first step of this whole process. Let me explain.

In a normal cycle of deciding what you want to build you have to spend some time thinking about and then writing down your requirements. Often, this happens in a simple document, bullet point lists etc. This is ok if your system is fairly simple to conceptualise but what happens if you want to build something a bit more complex? How will the AI actually know what you’re trying to accomplish?

Some people who I’ve spoken to say, “well, just talk to the AI”. Ok, true, talking to the AI one prompt after another kind of makes the approach iterative which is fine. But by doing so, you don’t really have a true picture of the entire system and more importantly because you don’t have the holistic view, you can’t make good architectural decisions or tradoffs.

Maybe there is a better way?

There is. Let’s say your system is a SaaS app with several screens with complex, non-trivial use case(s). Writing down a bunch of bullet points is probably not going to cut it. Instead try this:

Actually design the UI before the AI agent starts building anything. You can easily use https://stitch.withgoogle.com or any other UI designer (e.g. www.figma.com) and design how the user interaction is meant to occur. Essentially convert your thinking process into a tangible look and feel so you can confirm your idea working. For example, how we did this design recently:

Once you’re happy with your UI prototype, do a complete walk-through with a speech-to-text recorder. Just talk to yourself or a friend as you walk through every screen, every component, button and text-box. Ask yourself, why is this here? What is it meant to do? What was the expected behaviour? How does this work?

Then, feed the exported UI prototype code and the walk-through transcript into your AI agent and say

"Read the UI code and the transcript and update all specs to ensure the system works as I expect. The transcript is your source of truth."

You’d be amazed how much better and quicker you can build complex, great looking and high quality apps using this approach.

I had a genuine “aha” moment when I finally put this in place for one of my projects.

Step 7: Build Your System Phase by Phase

This is where the phased delivery model pays off. You don’t type “build everything” and hope for the best. You build M0, verify it passes, then M1, verify it passes, and so on.

The harness enforces a natural cadence:

M0, Repo Activation: Initialise the project, install dependencies, set up CI, create the GitHub repo. No feature code yet. Just infrastructure.
M1, Foundation: Core data models, base API structure, auth middleware, database migrations. The smallest slice that compiles and runs.
M2, Product Core: Feature implementation. This is where the code-agent does the heavy lifting, guided by the planner’s phased breakdown. Each feature gets its own TDD cycle.
M3, Quality and Operations: Security hardening (the security-reviewer runs full OWASP Top 10), E2E tests (the e2e-runner covers critical user flows), monitoring, documentation. The polish phase.

Each phase is independently mergeable. If M1 passes but M2 has issues, you can ship M1 while you debug M2. This is not accidental, the planner profile explicitly requires that each phase can be delivered independently. Plans that require all phases to complete before anything works are rejected.

When you build phase by phase, debugging is tractable. If something breaks in M2, you know it wasn’t broken in M1 because M1 passed its review gate. The blast radius of any bug is bounded by the current phase.

Step 8: Test and Fix --> Test and Fix

This step isn’t a one-time action. It’s a loop. Run tests, find failures, fix them, run tests again. Repeat until green.

The harness gives you multiple testing layers:

Unit tests, Written by the TDD guide before implementation. Test individual functions and components. Target < 50ms per test. Mock only external dependencies (database, HTTP, file system). Never mock the thing being tested.

Integration tests, API endpoints, database operations, service interactions. Test that the pieces connect correctly.

E2E tests, Playwright tests covering critical user flows. Login, search, create, update, delete. The full journey.

Security scans, The security-reviewer runs automated checks for hardcoded secrets, vulnerable dependencies, SQL injection patterns, and sensitive data in logs. Then a manual OWASP Top 10 pass.

Verification loop, A six-phase check: build, type check, lint, test suite with coverage, security scan, diff review. The loop produces a structured report: BUILD PASS/FAIL, TYPES PASS/FAIL, LINT PASS/FAIL, TESTS X/Y passed with Z% coverage, SECURITY PASS/FAIL, X files changed, Overall: READY/NOT READY for PR.

The verification loop runs after every significant change. Not just at the end. After each phase, after each feature, after each refactor. It’s the safety net that catches regressions before they compound.

When tests fail, the harness has a specific flow: the build-error-resolver profile handles build failures, the code-agent fixes functional test failures, the security-reviewer addresses security findings. You’re not debugging alone, the right specialist is assigned by the lifecycle policy.

Keep looping until everything is green. Don’t move to the next phase with failing tests. That’s not discipline, that’s engineering.

Where This System Breaks

Nothing is perfect. Here’s where this approach falls down:

The agent can ignore the rules. The profiles and skills are instructions, not compiled constraints. A sufficiently confused or corner-cutting agent will violate its “What NOT to Do” list. The review gates catch most of this, but they’re not foolproof. You still need to read the code.

Context window pressure. Fourteen profiles, eighteen skills, working rules, and your entire build spec, that’s a lot of context. For long sessions, the strategic-compact skill helps by suggesting compaction at logical boundaries (after research, before implementation; after a milestone; before a context shift). But if you’re working on a massive codebase, you’ll feel the token pressure.

The 30-questions pattern can stall. If the agent asks thirty questions and you don’t know the answers, you’ll spend more time researching than building. This is actually a feature, not a bug, it means you’re designing before coding, but it can feel slow if you just want to see something working.

Multi-agent coordination is still emergent. The harness defines roles and routing, but it doesn’t have a central orchestrator that automatically spins up the security reviewer after the code agent finishes. You, the human, decide when to invoke which agent. The lifecycle policy suggests the right routing, but you drive the sequence.

The harness can’t fix bad requirements. If your build spec is vague, contradictory, or incomplete, no amount of review gates will save you. The 30-questions pattern helps surface ambiguities, but if you answer “I don’t know” to half of them, the spec will have holes that show up as bugs later.

None of these are reasons to skip the system. They’re reasons to understand its limits and apply it where it fits, production software that needs to work correctly, securely, and maintainably. For everything else, a solo coding agent is fine. Just don’t pretend the output is production-ready

The Complete Prompt Library

Initialise a New Project

Use this prompt right at the start of any new project. The assumption here is that the droid coding agent is being used. However, if something else is used, like codex or claude than that’s ok. The project structure will still work correctly.

cd ~/new-project-dir
droid

Read the repo in [https://github.com/MikeS071/ai-dev-harness] and follow all instructions to initialise a new project in [new-project-dir]. Ensure all dependencies are installed and are ready to use and remote github repo for this new project has been created and initialised. Once initialised, read the [new-project-dir] and learn all the instructions, skills, agent roles so all capabilities can be used to build this new project.

Spec Out a New Feature or Project

Use this prompt to start designing a new feature or system. The harness knows the standards, structure and broad requirements based on good system design. By asking it to spec things out what we want is to create a production-ready build spec together with a reference-able checklist roadmap so the AI agent can execute when ready and you can keep track of where things are at during a multi-day development cycle.
You may also notice that there is a “Ask me upto 30 questions...” instruction at the end. That pattern is actually super important as it tells the AI to lock in 30 most critical assumptions - you’d be surprised how well this works in this scenario and can be used in other scenarios as it forces the AI to extrapolate on solution components that you may not have thought about. It’s like having a senior engineer sit side-by-side with you and giving you advice.

I want to design and spec out a new feature/system/solution called [feature-name]. Help me plan this out, design a comprehensive and production-ready build spec and also produce a checklist roadmap for all phases of the delivery. Ensure all standards, instructions and good-system design principles are followed. Ask me upto 30 most critical questions one by one to clarify anything that is still uncertain and needs to be locked-in before building starts.

Add a Review Gate to Each Phase

Use this prompt to add a review gate for each phase of the build spec checklist. The idea here is to minimise hallucinations or issues with missed development of code. While AI Agents are great, they do miss things, so forcing a review of it’s own work actually improves your chances of getting a high quality, working solution. Use this prompt as a second pass over the build spec/checklist once it has been generated.

Add a review gate to each phase by adding this exact prompt to the end of each checklist/roadmap phase. Each phase cannot be called complete until the review gate has passed:
"Review the code developed in this phase and compare to the existing build specs. Identify any functionality gaps and if you find any material gaps then close them by building relevant code. Ensure no material gaps exist and that test coverage of 80% or higher. The phase cannot be called complete until the review gate is successfully passed."

Build the System

Once you have the production-ready build specs and checklist/roadmap, it is a simple matter of just saying “build it” to the AI Agent. I suggest you build the system phase by phase to manage complexity and to check that phases pass before moving on to the next section. This makes debugging or fixing issues a lot simpler.

Build Phase M0 (or M1 or M2 etc)

Build an LLM Knowledge Base That Actually Compounds

Michal Szalinski — Fri, 10 Apr 2026 07:52:30 GMT

The Problem You Already Know You Have

You have knowledge scattered everywhere. Articles saved in 4 apps. Bookmarks from 2023 you’ll never revisit. Notes from meetings in a folder you forgot existed.

When you ask AI a question about your stuff, it starts from zero every time. Upload docs, ask a question, get an answer. Next session? Forgotten everything. That’s how ChatGPT file uploads, NotebookLM, and most RAG systems work. Zero accumulation.

The Idea (60 Seconds)

Instead of the AI searching your raw files every time, the AI reads your sources once and compiles a structured wiki. Summaries, cross-references, connections between ideas, contradictions flagged. All maintained by the AI. All in simple markdown files.

Next time you ask a question, the AI doesn’t dig through raw documents. It reads the wiki it already built. The connections are already there. Every new source you add makes the wiki richer. Every question you ask can get filed back in. Knowledge compounds instead of resetting.

No database. No embeddings. No vector store. Just folders and text files.

Why This Setup, Not The Others

Three things make this particular stack worth your time:

Factory Droid reads and writes local files natively. No copy-paste. No uploading. The AI operates directly on your filesystem reading PDFs, creating wiki pages, updating the index, all in one pass.
OpenRouter gives you any model. I run glm-5.1 through OpenRouter. You can swap to Claude, GPT-4o, Gemini, Llama, or any model that drops next week. One config change. No vendor lock-in on the intelligence layer.
Obsidian renders the wiki as it’s built. Graph view, backlinks, search, YAML properties, all work automatically because the AI writes standard Obsidian-compatible markdown. You see the knowledge base grow in real time.

What You Need

Factory Droid - AI coding agent that reads/writes local files (factory.ai)
OpenRouter API key - model gateway that lets you use any LLM (openrouter.ai)
Obsidian - markdown editor with wiki-link support (obsidian.md)
10+ source documents on a topic you care about
30 minutes for initial setup, then 10 minutes per source after that

No special software beyond these three. No accounts beyond these two. No plugins to install.

Step 1: Install and Configure Factory Droid (5 Minutes)

Install Droid:

# macOS / Linux
curl -fsSL https://app.factory.ai/cli | sh

Authenticate:

droid login

Add Your Custom Model via OpenRouter

This is how you run any model through Droid, not just the default ones. I use glm-5.1. You can use whatever OpenRouter supports.

Edit ~/.factory/settings.json (create it if it doesn’t exist):

{
  "customModels": [
    {
      "model": "z-ai/glm-5.1",
      "displayName": "GLM-5.1 [OpenRouter]",
      "baseUrl": "https://openrouter.ai/api/v1",
      "apiKey": "YOUR_OPENROUTER_API_KEY",
      "provider": "generic-chat-completion-api",
      "maxOutputTokens": 65536
    }
  ]
}

To use a different model, change the model field to any model ID from openrouter.ai/models. The displayName is just what shows up in the Droid UI.

To select your custom model in a session:

droid --model "z-ai/glm-5.1"

Or type /model inside a running Droid session to switch on the fly.

Step 2: Create the Folder Structure (2 Minutes)

Create this anywhere on your machine:

my-knowledge-base/
├── raw/           # Your source material. AI reads but never modifies.
│   └── assets/    # Images, screenshots, diagrams
├── wiki/          # AI-maintained wiki. You read. AI writes.
├── outputs/       # Reports, analyses, answers from queries
└── AGENTS.md      # The schema file that makes this whole thing work

Three folders, one file. If you’re spending more than 2 minutes here, you’re overthinking it.

Step 3: Write Your Schema File (The Step Everyone Skips)

The schema is the difference between a generic chatbot and a disciplined wiki maintainer. It tells your AI what the knowledge base is about, how to organize it, and what to do when you add sources, ask questions, or run maintenance.
As you can also see, I have multiple knowledge domains in my structure. You don’t have to setup multi-domain folders, one will do just fine. If that’s the case, just remove or rename what you don’t want.

Save this as AGENTS.md in your project root:

# Knowledge Base Schema

## Identity
This is a personal knowledge base about [YOUR TOPIC HERE].
Maintained by an LLM agent. The human curates sources and asks questions. The LLM does everything else.

## Architecture
- raw/ contains immutable source documents. NEVER modify files in raw/.
- wiki/ contains the compiled wiki. The LLM owns this directory entirely.
  - wiki/architecture/ -- Enterprise and solution architecture
  - wiki/resilience-ops/ -- Resilience, operations, SRE
  - wiki/data-ai/ -- Data platforms, ML, AI
  - wiki/security/ -- Security, IAM, data protection
  - wiki/software-engineering/ -- Software design, practices, CI/CD
  - wiki/ai-automation/ -- AI for business process automation
  - wiki/index.md -- Master index of all pages by domain
  - wiki/log.md -- Append-only chronological record
  - Cross-cutting pages (e.g. contradictions-and-tensions.md) live at wiki/ root
  - Each domain folder has a home.md landing page listing its pages
- outputs/ contains generated reports, analyses, and query answers.

## Wiki Conventions
- Every topic gets its own .md file in the appropriate domain folder under wiki/
- Every wiki file starts with YAML frontmatter:
  ---
  title: [Topic Name]
  created: [Date]
  last_updated: [Date]
  source_count: [Number of raw sources that informed this page]
  status: [draft | reviewed | needs_update]
  ---
- After frontmatter, a one-paragraph summary
- Use [[topic-name]] for internal links between wiki pages
- Every factual claim cites its source: [Source: filename.md]
- When new info contradicts existing content, flag explicitly:
  > CONTRADICTION: [old claim] vs [new claim] from [source]

## Index and Log
- wiki/index.md lists every page by domain with a one-line description
- wiki/log.md is append-only chronological record
- Log entry format: ## [YYYY-MM-DD] action | Description
  (Actions: ingest, query, lint, update)

## Ingest Workflow
When processing a new source:
1. Read the full source document
2. Discuss key takeaways with user
3. Create or update a summary page in the appropriate wiki/ domain folder
4. Update wiki/index.md and the domain's home.md
5. Update ALL relevant entity and concept pages across the wiki
6. Add backlinks from existing pages to new content
7. Flag any contradictions with existing wiki content
8. Append entry to wiki/log.md
9. A single source should touch 10-15 wiki pages

## Query Workflow
When answering a question:
1. Read wiki/index.md first to find relevant pages
2. Read all relevant wiki pages
3. Synthesize answer with [Source: page-name] citations
4. If answer reveals new insights, offer to file it back into wiki/
5. Save valuable answers to outputs/

## Lint Workflow (Monthly)
Check for:
- Contradictions between pages
- Stale claims superseded by newer sources
- Orphan pages with no inbound links
- Concepts mentioned but never explained
- Missing cross-references
- Claims without source attribution
Output: wiki/lint-report-[date].md with severity levels

## Focus Areas
[List 3-5 topics this knowledge base covers]

Customise three things before saving:

The [YOUR TOPIC HERE] line -- make it specific (”Enterprise Architecture for Financial Services” not just “Architecture”)
The domain folders -- rename/add/remove to match your topics (I have 6 domains; you might have 3 or 10)
The Focus Areas -- list the 3-5 domains this KB covers

This file is read by Droid at the start of every session. It’s the single most important file in the entire system.

Step 4: Fill Your Raw Folder (10 Minutes of Dumping, Zero Organising)

Open raw/ and dump everything in:

Copy-paste articles into .md or .txt files
Export notes from whatever app you’re using now
Save screenshots and diagrams to raw/assets/
Drop in PDFs (Droid can extract text from them)
Paste in research papers, competitor breakdowns, internal docs
Dump bookmarks you’ve been hoarding for months

Don’t organise it. Don’t rename anything. Don’t clean it up. That’s the AI’s job.

If you have PDFs, Droid will handle extraction automatically. The Anthropic PDF skill (github.com/anthropics/skills/tree/main/skills/pdf) uses pdfplumber and pypdf under the hood. If Droid doesn’t have these installed, it will install them as part of the ingest. No manual setup needed.

The Obsidian Web Clipper browser extension converts any web article to markdown in one click. Set a hotkey to pull all images locally so the AI can reference them.

The goal is volume. Not perfection.

Step 5: Run Your First Ingest

Open your project in Droid:

cd my-knowledge-base
droid

Then paste this prompt:

INGEST PROMPT (single source):

Read the schema in AGENTS.md. Then process [FILENAME] from raw/. Read it fully, discuss key takeaways with me, then: create a summary page in the appropriate wiki/ domain folder, update wiki/index.md and the domain's home.md, update all relevant concept and entity pages across the wiki, add backlinks, flag any contradictions, and append to wiki/log.md. Use the PDF skill ([github.com/anthropics/skills/tree/main/skills/pdf](https://github.com/anthropics/skills/tree/main/skills/pdf) to read the PDFs or convert them to md format.

Start with one source at a time. Read the summaries. Check the updates. Guide the AI on what to emphasise. This produces dramatically better results than batch-processing everything at once.

What happens during an ingest:

Droid reads the full source document from raw/
It discusses key takeaways with you (this is your quality gate)
It creates a summary page in the right domain folder
It creates cross-cutting concept pages that connect to existing content
It updates the index and domain home page
It adds backlinks from existing pages to the new content
It flags contradictions with existing wiki content
It appends a log entry to wiki/log.md

A single good source will touch 10-15 wiki pages. That’s the compounding in action.

For PDFs specifically, add this to your ingest prompt:

Use the PDF skill (pdfplumber/pypdf) to extract text from the PDF before processing.

For batch ingest (less supervised, use after you trust the system):

Read AGENTS.md. Process all unprocessed files in raw/ sequentially. For each: create summary in the appropriate domain folder, update index and home.md, update relevant pages, add backlinks, flag contradictions, log the ingest. Proceed automatically.

After 5-10 sources, your wiki/ folder will have an index, a log, domain home pages, and 15-30 interconnected pages. That’s when things click.

Step 6: Set Up Obsidian (3 Minutes)

Your wiki is already Obsidian-compatible. It uses markdown files, [[wiki-links]], and YAML frontmatter. You just need to point Obsidian at it.

If Obsidian is already installed:

Open Obsidian
Click “Open folder as vault”
Select your wiki/ folder
Done

If you need to install Obsidian:

# Linux (Flatpak)
flatpak install flathub md.obsidian.Obsidian

# macOS
brew install --cask obsidian

# Or download from obsidian.md

What works immediately in Obsidian:

[[wiki-links]] - all cross-page links are clickable and navigable
Graph view - click the graph icon to see the interconnected page structure
Backlinks panel - right sidebar shows which pages link to the current page
YAML frontmatter - properties like title, status, source_count appear in the properties panel
Search - Ctrl+Shift+F for global search across all pages
Folder navigation - domain folders show up in the file explorer

The vault opens on index.md as the landing page with the full catalogue of all wiki pages by domain.

Step 7: Start Querying Your Knowledge Base

Once you have 10+ wiki pages, the system becomes genuinely useful.

QUERY PROMPT:

Read wiki/index.md. Based on what's in the knowledge base, answer: [YOUR QUESTION]. Cite which wiki pages informed your answer. If this reveals new connections worth preserving, create a new page in the appropriate wiki/ domain folder and update the index.

Questions that extract the most value:

“What are the three biggest gaps in this knowledge base?”
“Which sources disagree with each other, and on what?”
“What should I research next based on what’s here?”
“Write a 500-word briefing on [topic] using only wiki content”
“What connections exist between [concept A] and [concept B]?”
“What contradictions or tensions exist across the sources?”

The critical loop: good answers should be filed back into the wiki. A comparison, an analysis, a connection you discovered. These compound just like ingested sources do. Every question makes the next answer better.

Step 8: Run Monthly Health Checks

This is the step nobody does. It’s the step that prevents the whole system from slowly rotting.

LINT PROMPT:

Run a full health check on wiki/ per the lint workflow in AGENTS.md. Output to wiki/lint-report-[date].md with severity levels. Suggest 3 articles to fill the biggest knowledge gaps.

Why this matters: when the AI writes something slightly wrong and you save it back, the next answer builds on the wrong thing. Two months later, you have five pages reinforcing the same error. Health checks catch this before it snowballs.

One check per month. Ten minutes of your time. Non-negotiable if you want the system to stay trustworthy.

Step 9: Let It Compound

After 4-6 weeks of consistent use, you’re not just searching notes. You’re querying a structured knowledge system that understands connections between your sources better than you do.

Three ways to accelerate the compounding:

File exploration outputs back. When the AI generates a comparison or analysis you find valuable, save it into wiki/ or outputs/. Your own explorations and queries always add up in the knowledge base.

Add visual outputs. Have the AI render answers as markdown tables, charts, or slide decks. These become reusable assets, not throwaway chat messages.

Version control everything. Your wiki is just markdown files. Initialize a git repo. You get full history, branching, and the ability to undo anything the AI messes up.

cd my-knowledge-base
git init
git add .
git commit -m "Initial knowledge base"

Where This System Breaks

This is a nascent pattern, not a finished product. Karpathy himself called it “a hacky collection of scripts.” Here’s what you need to know:

Context Window Ceiling

The wiki works at ~100 articles and ~400K words. But even 128K-token context windows only hold ~96K words. The AI reads selectively through the index, which means it can miss things. Research shows LLMs suffer from “lost in the middle” effects where information in the centre of long inputs gets deprioritised. Your query results will have blind spots. Accept this.

Error Compounding

The AI writes a wiki page with a subtle mistake. You query against it. The mistake enters your answer. You file that answer back. Now two pages reinforce the same error. Monthly linting helps, but the AI doing the linting has the same blind spots as the AI that made the error. This is the single biggest risk.

Hallucination Doesn’t Disappear

The wiki approach reduces hallucination because the AI grounds answers in your sources. But it doesn’t eliminate it. The AI can still synthesise connections that don’t exist in the source material. And because the wiki looks authoritative (clean markdown, cross-references, citations), you’re more likely to trust incorrect information. Don’t.

Cost Isn’t Zero

Every ingest, every query, every lint check costs tokens. A single source that touches 10-15 pages can run $1-2 in API calls with frontier models. Cheaper than a research assistant. Not free. Using OpenRouter with cost-efficient models like glm-5.1 helps, but it’s still not zero.

It Doesn’t Scale to Enterprise

The index-file approach works without RAG at ~100 articles. At 10,000+ sources, this pattern breaks. The index grows too large. Consistency across thousands of pages becomes impossible. You’ll need the infrastructure this system was designed to avoid. Know the ceiling.

What To Do About The Breakpoints

Problem Mitigation Error compounding Monthly lint checks. Cross-check critical claims manually. Never trust blindly on high-stakes decisions. Context limits Keep each wiki focused on one domain. Multiple domains? Multiple knowledge bases. Cost Use frontier models for ingest and complex queries. Cheaper models for simple updates. Hallucination The schema requires source citations on every claim. If a page makes a claim without [Source: filename], linting flags it. Scale Accept this is a personal tool, not enterprise infrastructure. If you outgrow it, that’s a good problem. Model bias Swap models in OpenRouter with one config change. Re-lint after switching to catch interpretation differences.

Your Complete Prompt Library

Every prompt from this guide, collected in one place:

SCHEMA: Copy the full AGENTS.md template from Step 3.

INGEST (one source):

Read the schema in AGENTS.md. Then process [FILENAME] from raw/. Read it fully, discuss key takeaways with me, then: create a summary page in the appropriate wiki/ domain folder, update wiki/index.md and the domain's home.md, update all relevant concept and entity pages across the wiki, add backlinks, flag any contradictions, and append to wiki/log.md.

INGEST (batch, less supervised):

Read AGENTS.md. Process all unprocessed files in raw/ sequentially. For each: create summary in the appropriate domain folder, update index and home.md, update relevant pages, add backlinks, flag contradictions, log the ingest. Proceed automatically.

QUERY:

Read wiki/index.md. Answer: [QUESTION]. Cite wiki pages. If this answer is worth preserving, offer to file it as a new wiki page in the appropriate domain folder.

LINT:

Run a full health check on wiki/ per the lint workflow in AGENTS.md. Output to wiki/lint-report-[date].md with severity levels. Suggest 3 articles to fill gaps.

EXPLORE:

Read wiki/index.md and identify the 5 most interesting unexplored connections between existing topics. For each, explain what insight it might reveal and what source would help confirm it.

BRIEF:

Based on everything in wiki/, write a 500-word executive briefing on [TOPIC]. Cite sources. Structure it as: current state, key tensions, open questions, recommended next steps.

CONTRADICTIONS:

Read all wiki pages and identify every place where guidance in one page conflicts with, undermines, or creates tension with guidance in another page. Categorise as: explicit contradiction, implicit tension, acknowledged trade-off, or vague guidance. Rate severity as high/medium/low.

Go Build It

The difference between bookmarking Karpathy’s gist and benefiting from it is one afternoon.

Pick your topic. Create the folders. Copy the schema. Drop in what you already have. Run your first ingest.

Then do it again tomorrow with another source. And next week with five more.

The wiki gets smarter every time. That’s the whole point.

Three folders. One schema. One custom model. An AI that does the grunt work you’d never do yourself.

Stop collecting bookmarks. Start compiling knowledge.