AI Agents That Survive Production
The agent demos of the last two years share a shape: give a model some tools, a goal, and a loop, then watch it plan, browse, write, and execute. It looks like the future. Then you run the same agent fifty times on real work and discover the demo was a highlight reel. Somewhere between run twelve and run thirty it deleted the wrong file, re-tried a failed payment, or spent forty minutes confidently solving the wrong problem.
None of this means agents don't work. It means the ones that survive production look very different from the ones that win demo day. Having run agent systems continuously for real workloads — including our own infrastructure — the survivors share a handful of traits.
Narrow beats general, every time
The production survivors are boring on purpose. They do one job: triage inbound tickets, reconcile one report, keep one pipeline healthy. The general-purpose "digital employee" fails not because the model is weak but because the error surface is unbounded — no one can review, test, or even describe everything it might do. A narrow agent has a small enough action space that a human can reason about its worst case. That property, not intelligence, is what earns production access.
The loop needs walls, not vibes
Every autonomous loop eventually goes wrong; the design question is what happens then. Survivors run with hard budgets — maximum steps, maximum spend, maximum runtime — and stop rather than improvise when they hit one. They treat irreversible actions differently from reversible ones: reading is free, drafting is cheap, sending and deleting require either a human checkpoint or an allowlist earned over months of clean runs. The rule of thumb that has held up for us: an agent may retry twice, then it must stop and report. Endless self-correction burns money and, worse, hides failures.
An agent that stops and says "I'm stuck, here's why" is production-grade. An agent that keeps trying is a demo that hasn't failed publicly yet.
Memory is a liability until it's designed
Giving an agent memory sounds like an upgrade; done casually, it's contamination. Yesterday's half-correct conclusion becomes today's confident premise. Survivors separate what may be remembered (verified facts, explicit decisions, stable preferences) from what must expire (guesses, intermediate reasoning, anything unverified). The memory that matters most in production is the humble kind: what did I already try, and what happened — the audit trail that lets a human debug the agent's week in ten minutes.
Observability is the actual product
The difference between an agent you trust and one you don't is rarely the model — it's whether you can see what it did. Every tool call, every decision point, every token spent, traced and queryable. When something goes wrong (it will), the question "what exactly happened at 3 a.m. Tuesday" must have a five-minute answer. Teams that treat tracing as an afterthought end up with a system nobody dares to extend, because nobody can prove what it currently does.
The human is part of the architecture
The most successful pattern isn't full autonomy — it's an agent that does 95% of the work and routes the 5% judgment calls to a person, batched, with context, at a sensible cadence. Not because models can't decide, but because accountability can't be delegated to them. The org chart still ends at a human. Systems designed around that fact ship; systems designed to pretend otherwise stall in review.
| Demo agent | Production agent | |
|---|---|---|
| Scope | General-purpose | One narrow job |
| On failure | Retries indefinitely | Stops after 2, reports |
| Memory | Everything, unfiltered | Verified facts only; guesses expire |
| Tracing | None | Every tool call, queryable |
| Judgment calls | Made by the agent | Routed to a human, batched |
Agents are quietly becoming normal infrastructure — cron jobs that can read, write, and reason. The winners won't be the most autonomous. They'll be the most auditable, the most bounded, and the easiest to hand to the person who owns the process. Build for that, and production stops being the place where agents go to die.