Software 3.0 Is Not a Silver Bullet: Why Engineering Expertise Still Wins with LLMs
LLMs are the new operating system, prompting is the new programming, and anyone can build software now
Andrej Karpathy calls it Software 3.0. YC calls it the biggest shift since high-level languages. The framing is seductive: LLMs are the new operating system, prompting is the new programming, and anyone can build software now.
Except they can’t.
I’ve watched the same pattern repeat for a while. A non-engineer pastes a vague request into ChatGPT, gets something that looks plausible, ships it, and watches it fall apart under real conditions. Meanwhile, I take the same starting point and end up with a robust CLI that runs autonomously on a cron job, handles edge cases, and compounds value every day.
Same models. Same access. Radically different outcomes. The difference isn’t the prompt. It’s the engineering.
You can watch the whole lecture here:
The Idea (60 Seconds)
Software 3.0 is real. LLMs are a new kind of runtime. But the “anyone can code now” narrative is incomplete. What Karpathy and the YC lecture get right is the mental model shift; LLMs as operating systems, not smarter search engines. What they understate is how much engineering discipline that OS still demands.
The LLM is a lossy runtime. It hallucinates. It forgets context. It produces plausible garbage. Treating it like a magic oracle gets you slop. Treating it like a savant with cognitive issues, brilliant but unreliable, and engineering around those limitations is what separates working solutions from demo toys.
The engineers who master this don’t just write better prompts. They build systems around the LLM: scaffolding, verification loops, fallback chains, state management, and evaluation frameworks. The prompt is the interface. The engineering is the product.
Why Non-Engineers Produce Slop
When a non-engineer asks an LLM to “build me a tool that extracts data from a website and sends a daily email,” they get a script. It works. Once. On the happy path. With the data formatted exactly the way they showed in their example.
When I build the same thing, here’s what happens in my head before I type a single prompt:
How does this fail when the website changes its layout?
What happens when the API returns empty data instead of the expected format?
Where does state live so we can detect anomalies?
How do we make this idempotent so re-running doesn’t send duplicate emails?
What’s the time budget for each operation, and what’s the hard ceiling?
These aren’t prompt engineering tricks. These are engineering instincts: problem decomposition, error handling, state management, verification. The LLM doesn’t eliminate the need for these. It accelerates the implementation but makes the consequences of skipping them more dangerous, because now your broken code runs automatically.
The YC lecture nails this indirectly: “you have to design around their limitations rather than expecting perfect, human-like reliability.” That’s engineering. That’s always been engineering. The medium changed. The discipline didn’t.
This is why I believe there isn’t going to be a drop in demand for engineering jobs. In fact, there is sufficient evidence from past events (such as the industrial revolution, the internet, digitisation of everything) that greater capability almost always demands greater supply of skills. People just need to adapt to new ways of working.
The LLM as Lossy Runtime
Karpathy’s operating system metaphor is useful. An OS has defined syscalls, documented error codes, and deterministic behavior for the same inputs. An LLM has none of these.
A better mental model: the LLM is a brilliant but unreliable non-deterministic co-processor. It can execute tasks that would take you hours in seconds, but it sometimes returns wrong results with absolute confidence. Your job as the engineer isn’t to write the perfect prompt that prevents errors, instead, it’s to build the harness that catches them.
Here’s what that harness looks like in practice.
Technique 1: Structured Scaffolding, Not Free-Form Prompts
The single biggest upgrade from slop to production: never accept free-form LLM output for anything you plan to use programmatically.
When you ask “write me a CLI,” you get markdown, prose explanations, and code in varying states of completeness. When you ask for a specific JSON schema, you get a parseable artifact you can pipe into your build system.
You are an expert Python engineer. Always respond in valid JSON with this exact schema:
{
"thinking": "step-by-step reasoning and trade-offs",
"code": "complete, ready-to-run code with inline comments",
"explanation": "usage, edge cases, and assumptions",
"tests": ["list of test cases or pytest snippets"]
}
Only output the JSON. No markdown. No prose outside the schema.
This is the LLM equivalent of typed function signatures. You wouldn’t write a production API that returns unstructured text. Don’t let your LLM do it either.
Pair this with JSON mode in Claude or structured output in OpenAI, and you’ve turned a probabilistic text generator into a deterministic data pipeline — at least at the interface boundary.
Technique 2: Self-Critique Loops
The generate-once-and-ship pattern is how you get slop. The generate-critique-revise pattern is how you get quality.
First, plan the solution step-by-step. Consider architecture,
error handling, and UX.
[After getting initial output]
Now critique the above as a senior staff engineer. Score 1-10 on:
correctness, robustness, usability, maintainability.
List specific fixes. Then output a revised version in the same
JSON schema.
Two or three cycles of this consistently produces dramatically better output. The LLM is excellent at critiquing its own work when given clear evaluation criteria, it just won’t do it unprompted.
This mirrors how senior engineers actually write code: draft, review, revise. The difference is the cycle time. What takes a human hours takes the LLM seconds. The bottleneck shifts from writing to evaluating.
Technique 3: Context Packaging as Modular Architecture
Most people paste their entire codebase into a prompt and hope for the best. Engineers treat context like code modules; structured, versioned, and reusable.
<context>
[paste relevant code, docs, or previous outputs here]
</context>
<requirements>
- Use Typer + Rich
- Zero unnecessary dependencies
- Full --help support
- Graceful error handling
</requirements>
<task>
[Specific request]
</task>
XML tags aren’t magic. But they give the LLM clear boundaries between different types of information. The model processes tagged sections more reliably than unstructured walls of text.
This scales. Maintain a library of context blocks; your project’s architecture, your coding conventions, your error handling patterns. Swap them in and out as needed. This is the LLM equivalent of import statements.
Technique 4: Model Routing, Not Model Loyalty
No single model is best at everything. The engineers getting the most from LLMs aren’t loyal to one model — they route tasks based on what each model does best.
In my daily pipeline:
Fast model (Gemini Flash, Haiku) for data extraction, formatting, and routine summarization
Strong model (Claude Opus) for judgment calls — clinical pattern recognition, nuanced coaching recommendations, anything where missing a subtle signal costs more than the extra latency
The routing logic itself is deterministic code. The models are interchangeable components. When a new model drops that’s faster or cheaper, I swap it in for the appropriate tier and verify against my test suite.
This is operating system thinking. You don’t write code that only runs on one processor. You write against abstractions and let the runtime choose the best execution path.
Technique 5: Verification as Architecture
The YC lecture mentions speeding up the “generation → verification loop.” That undersells it. Verification isn’t a loop; it’s the architecture.
Every LLM output in my production system goes through verification before it’s used:
Schema validation — does the JSON match the expected structure?
Content validation — are the key fields populated? Are values in plausible ranges?
State validation — does this output contradict our stored history? (A client who had 33 check-ins yesterday doesn’t have one today.)
Quality validation — is the report complete? Does it have all 5 required sections?
If any validation fails, the system doesn’t silently accept bad output. It retries with a different approach, falls back to stored data, or escalates to a human. The LLM is just one component in a larger system with defined invariants.
This is the engineering discipline that separates “I built a cool demo” from “this runs my business while I sleep.”
The Skills Layer: Where It All Compounds
Here’s where Software 3.0 genuinely shifts the game. Individual prompts are disposable. Skills are compound interest.
A skill is a reusable, versioned, self-contained workflow that packages:
The system prompt (personality, constraints, output format)
The context (domain knowledge, project conventions)
The tools (what the agent can call)
The verification (what “done” looks like)
The fallbacks (what happens when things break)
I have a skill that analyzes fitness check-in data and generates coaching reports. It took 50+ iterations to get right. Now it runs twice daily on a cron job, processing 18 clients autonomously. Each run costs about $0.30 in LLM tokens. The equivalent human effort would be 4+ hours of a coach’s time.
The skill is the artifact. The prompts that built it are long gone. This is the shift from “prompting” to “engineering”, you’re not optimizing a single interaction, you’re building a system that improves with every edge case you handle and every failure mode you fix.
Non-engineers can create impressive one-off outputs. Engineers create systems that compound.
The Honest Truth About Software 3.0
Karpathy is right that we’re in the 1960s of LLMs. The 1960s weren’t democratic. Assembly language existed, but the people who built reliable systems were the ones who understood memory management, error handling, and hardware constraints.
We’re in the same place now, just with natural language instead of assembly. The barrier to entry is lower, you can produce something that runs on your first try. But the barrier to reliability hasn’t moved. Building something that handles edge cases, degrades gracefully, and runs unattended for months still requires engineering discipline.
The tools will improve. Context windows will grow. Models will get more reliable. The verification loops will tighten. But the core skill (treating the LLM as a brilliant, unreliable component in a larger system you engineer) will remain the differentiator.
Software 3.0 doesn’t democratize great software. It supercharges the people who already know how to build it. The gap between engineers and non-engineers isn’t closing, it’s widening, because engineers are building compound systems while everyone else is still optimizing single prompts.
The engineers who master this now who build skills, design verification architectures, and treat the LLM as a lossy runtime to engineer around will define the next decade of software.
Everyone else will keep wondering why their “automated” workflow broke again this morning.
Building production AI automation? ArchonHQ gives you the skills architecture, verification frameworks, and orchestration tools to turn LLMs from toys into reliable systems. Stop prompting. Start engineering.


