Community · Agents

AI Agents That Survive Production

The agent demos of the last two years share a shape: give a model some tools, a goal, and a loop, then watch it plan, browse, write, and execute. It looks like the future. Then you run the same agent fifty times on real work and discover the demo was a highlight reel. Somewhere between run twelve and run thirty it deleted the wrong file, re-tried a failed payment, or spent forty minutes confidently solving the wrong problem.

None of this means agents don't work. It means the ones that survive production look very different from the ones that win demo day. Having run agent systems continuously for real workloads — including our own infrastructure — the survivors share a handful of traits.

↓ steeply
Agent task success rate falls off sharply as the number of required sequential steps grows
Source: METR, "Measuring AI Ability to Complete Long Tasks," 2025
2 retries
The retry budget that has held up best for our own agents before forcing a stop-and-report
Internal operating rule, Ayra ix

Narrow beats general, every time

The production survivors are boring on purpose. They do one job: triage inbound tickets, reconcile one report, keep one pipeline healthy. The general-purpose "digital employee" fails not because the model is weak but because the error surface is unbounded — no one can review, test, or even describe everything it might do. A narrow agent has a small enough action space that a human can reason about its worst case. That property, not intelligence, is what earns production access.

Pattern
Reliability by scopeIllustrative
One narrow job92%
A few related jobs68%
General "digital employee"31%
Illustrative — directional reliability as an agent's action space widens, consistent with our own operating experience. Not a controlled benchmark.

The loop needs walls, not vibes

Every autonomous loop eventually goes wrong; the design question is what happens then. Survivors run with hard budgets — maximum steps, maximum spend, maximum runtime — and stop rather than improvise when they hit one. They treat irreversible actions differently from reversible ones: reading is free, drafting is cheap, sending and deleting require either a human checkpoint or an allowlist earned over months of clean runs. The rule of thumb that has held up for us: an agent may retry twice, then it must stop and report. Endless self-correction burns money and, worse, hides failures.

An agent that stops and says "I'm stuck, here's why" is production-grade. An agent that keeps trying is a demo that hasn't failed publicly yet.

Memory is a liability until it's designed

Giving an agent memory sounds like an upgrade; done casually, it's contamination. Yesterday's half-correct conclusion becomes today's confident premise. Survivors separate what may be remembered (verified facts, explicit decisions, stable preferences) from what must expire (guesses, intermediate reasoning, anything unverified). The memory that matters most in production is the humble kind: what did I already try, and what happened — the audit trail that lets a human debug the agent's week in ten minutes.

Observability is the actual product

The difference between an agent you trust and one you don't is rarely the model — it's whether you can see what it did. Every tool call, every decision point, every token spent, traced and queryable. When something goes wrong (it will), the question "what exactly happened at 3 a.m. Tuesday" must have a five-minute answer. Teams that treat tracing as an afterthought end up with a system nobody dares to extend, because nobody can prove what it currently does.

The human is part of the architecture

The most successful pattern isn't full autonomy — it's an agent that does 95% of the work and routes the 5% judgment calls to a person, batched, with context, at a sensible cadence. Not because models can't decide, but because accountability can't be delegated to them. The org chart still ends at a human. Systems designed around that fact ship; systems designed to pretend otherwise stall in review.

Comparison
Demo agent vs. production agent
Demo agentProduction agent
ScopeGeneral-purposeOne narrow job
On failureRetries indefinitelyStops after 2, reports
MemoryEverything, unfilteredVerified facts only; guesses expire
TracingNoneEvery tool call, queryable
Judgment callsMade by the agentRouted to a human, batched

Agents are quietly becoming normal infrastructure — cron jobs that can read, write, and reason. The winners won't be the most autonomous. They'll be the most auditable, the most bounded, and the easiest to hand to the person who owns the process. Build for that, and production stops being the place where agents go to die.