Agentic Everything: Are We Building the Right Abstractions?

Ayra ix · 8 min read · Talk to the AI panel about this article as you read

A skeptic's take on the AI agent hype cycle

I asked my own AI agent to book a calendar meeting. I had an orchestration system running, the calendar tool was wired up, and the agent had been performing well on simpler tasks. I gave it a natural language instruction, confirmed when it asked for clarification, and watched it go.

It booked the meeting. Right person. Right time zone. Well-written agenda that I couldn't have drafted better. The only problem: I'd asked for the meeting "this Tuesday" and it booked the previous Tuesday. Not next Tuesday. Not a future date. A date that had already passed by six days.

The failure mode was aesthetically perfect. Confident. Polished. Completely wrong. The agent did everything its framework defined as correct. It just didn't understand time. That's not a bug — it's a fundamental characteristic of the abstraction I was using.

Before I shipped any fix, I asked myself a question I now ask before building any agentic system: what happens when this is wrong? In this case: someone gets invited to a meeting that already happened. Embarrassing, not catastrophic. That answer shaped everything about the fix — and about how I build agents now.

The Colleague Mental Model Will Burn You

Most people, myself included at first, model LLM agents as smart colleagues you can delegate to. You'd tell a colleague "book this for Tuesday" and they'd know you meant the next upcoming Tuesday. They'd check. They'd ask if it was ambiguous. They'd flag if something looked off. That's the mental model every agent framework demo is built on, and it's wrong in a way that will bite you every time.

An LLM agent is a stateless text predictor with tools attached. Input: whatever you put in the context window. Output: next tokens, possibly including a tool call. Between calls, nothing persists. There are no goals in any meaningful sense — only probability distributions over next tokens conditioned on the prompt. My calendar agent didn't "think" it was booking the right date. It generated the most statistically plausible token sequence given its input, and that sequence happened to include a past date. When the colleague mental model meets that reality, you get aesthetically perfect failures.

"The colleague mental model is the root cause of most agent failures. Design for a stateless text predictor with tools, not for a smart teammate."

Why Most Agent Frameworks Are Solving Yesterday's Problem

The first wave of agent frameworks were built to make GPT-4's capabilities accessible: chain-of-thought, tool use, multi-step planning. That was genuinely useful work. The problem is that the hard part was never capability. My calendar agent had capability. It used tools correctly, confirmed before acting, formatted its output beautifully. What it lacked was reliability, observability, and controllability — and most frameworks still treat those as afterthoughts.

Reliability: I couldn't trust the output without verification. Observability: I couldn't see why it chose a past date. Controllability: I had no way to intercept the action before it executed. The framework gave me none of these. Frameworks are excellent for demos. They're how you build a prototype in an afternoon. But demos are always the success case — the agent books the meeting correctly, the code runs, the question gets answered. Nobody posts the demo where the agent confidently books six days in the past.

What Actually Fails in Production

Ambiguous instructions where a colleague would ask a clarifying question — agents often don't. Time-sensitive context (the calendar problem). Tasks requiring persistent state across sessions. Anything where "almost right" is worse than nothing: financial calculations, security configurations, anything you're embarrassed to have sent. The pattern: agents fail at the edges of their training distribution, and they fail confidently, without signaling uncertainty. That last part is the dangerous bit.

A Production Agentic Pipeline That Works

My listing evaluation pipeline is the most useful agentic system I've built, and it works precisely because I was honest about what the agent could and couldn't be trusted to do. The pipeline ingests records on a schedule, filters against configurable qualification gates (role fit, location constraints, application friction), evaluates each passing record against a quality rubric, and surfaces the top matches in a periodic digest.

What it doesn't do: execute irreversible actions. That step requires me — a human — to review each match and approve. That single checkpoint eliminates the "confident and wrong" failure mode entirely. The cost is a short daily review. The benefit is that nothing consequential happens without my explicit sign-off. I made peace with the tradeoff before I built the system, not after something went wrong.

The orchestration runs in n8n, not an agent framework. This is not accidental. n8n is a workflow engine — a series of nodes that each do one thing, with explicit branching and defined error states at every step. I can open any past run and see exactly what happened at each node. If a step fails, the workflow stops and I get an alert. If I need to change a filter criterion, I change one node. That's observability, controllability, and maintainability — the three things agent frameworks consistently undertake.

I've used LangChain and others. They're fast for prototyping. But for anything I need to trust in production, I come back to structured workflows. The "agent" behavior emerges from the workflow design, not from framework magic. That's the point — I want to understand exactly why it did what it did.

Before You Build an Agentic System

Answer "what happens when this is wrong?" first — that answer determines whether you need human-in-the-loop, not your preference for autonomy

You have explicit failure states, retry limits, and a hard stop — not just "hope the LLM figures it out"

You can see what happened at each step after the fact, not just whether the final output looked right

What the Agent Hype Cycle Misses

The demos are all success cases. The failure mode demos don't get posted because "agent booked a meeting in the past" is embarrassing, not impressive. The missing content is the median case for complex agentic tasks: tool calls that return unexpected error formats, LLMs that confidently misuse an API, agents that hit their context limit mid-task and truncate without signaling. Design for the median, not the demo.

The Architecture Follows the Answer

Back to the question: what happens when this is wrong? If the answer is catastrophic — wrong data written, wrong person emailed, money moved — you need human review before the action executes. If the answer is annoying but recoverable — wrong file created, wrong API call that can be retried — automate with hard retry limits and rollback. If the answer is unnoticeable — an imperfect summary, a slightly off classification — fully automate and move on.

My calendar agent sat in the "annoying but recoverable" category. So I didn't rebuild the orchestration. I didn't switch frameworks. I added 60 seconds of wait time after the booking, then sent myself a confirmation with the booked date for review before the invite went out. That's it. That's the production fix. The aesthetically perfect failure became an aesthetically adequate system that hasn't misbehaved since.

The question you answer before you build is the difference between a system that runs reliably and one that books meetings in the past.

Related Posts

AI

Multi-Agent Systems

AI

RAG vs Fine-tuning