Multi-Agent Systems: When AI Talks to AI and Things Get Weird
Orchestration patterns, failure modes, and lessons from 6 months in production
Last Tuesday, two parts of my submit-verify automation pipeline disagreed with each other. My submission agent reported a successful form post — record REF-2847, confirmation number APP-2847, status COMPLETE. My verification agent, which I added after the third time the submission agent claimed success on a record that never actually persisted, queried the external portal API and found zero records. No submission. Status: NOT_SUBMITTED.
Both agents had evidence. The submission agent had a screenshot of a confirmation page. The verification agent had a JSON response from the portal API with an empty results array. Both were describing the same event — just 60 seconds apart.
The Conversation Log That Started This Post
The orchestrator halted the branch, fired an ntfy alert to my phone, and waited. I checked the portal manually two minutes later. The submission was there.
The external portal had a 60-second processing delay between form submission and API visibility. The submission agent was correct at T+0. The verification agent was correct at T+5 seconds. They were both describing ground truth — just from different moments. This is not a hallucination problem. It is a distributed state problem with a confidence display bug layered on top.
Welcome to multi-agent systems in production.
What "Production-Grade" Actually Means
I want to say this upfront, because everything else in this post flows from it: a multi-agent system is a distributed system. Not metaphorically. Literally. You have processes passing messages, shared state that can be observed at different times, external APIs that can fail or lag, and retry logic that can amplify partial failures into full disasters. Every failure mode from microservices applies. The frameworks don't protect you from any of it.
My definition of production-grade for an agent system: failures are contained, visible, and recoverable without human intervention for 95% of cases. The 5% that need human attention are surfaced immediately with enough context to resolve in under 10 minutes. That's it. Not "agents that succeed" — agents that fail gracefully.
Three Orchestration Patterns, and When Each One Bit Me
Most tutorials describe these patterns theoretically. Here's how they played out in my actual pipeline.
Sequential
Parallel
Hierarchical
Sequential. Agent A runs, passes output to Agent B, which passes output to Agent C. Each step waits for the previous one. This is the only pattern I use for anything that involves an irreversible action — form submissions, emails, database writes. The execution log in n8n is a clean, ordered record of every step: what the input was, what the agent returned, when it happened. When something breaks, I open the execution history, find the failed node, read the exact error. It's boring to describe and invaluable at 11pm when a submission silently failed.
The cost is speed. My sequential submit pipeline takes about 90 seconds per record. I don't care. I process fewer records, not faster ones.
Parallel. I use parallel execution for the ingestion phase — hitting five external sources simultaneously instead of one at a time. That part works well because the tasks are genuinely independent and the merge contract is trivial: union the results, deduplicate by record ID, done. The mistake I made early on was trying to parallelize the submit-and-verify step to speed things up. That's what produced the incident above. The agents were racing against portal processing time and the orchestrator had no way to know whose snapshot was newer.
The rule I now follow: parallel is safe when sub-tasks are independent AND the merge contract is defined before writing a line of code. "We'll figure out merging later" is how you build the submit/verify contradiction I described.
Hierarchical. I don't use this yet at scale. My orchestrator is n8n, which sees the entire workflow as a flat graph — there's no agent reasoning about what to delegate. What I've learned from hitting the limits of this: hierarchical orchestration makes sense when you have genuine ambiguity that requires judgment, not just coordination. If you can express the logic as a decision tree, use sequential with branches. Save hierarchical for things that can't be expressed as a decision tree. And don't add it until you have strong observability, because debugging a conversation between a supervisor agent and three sub-agents where one of them had context drift halfway through is not something you want to do at midnight.
The Circuit Breaker That Saved My Pipeline
After the contradicting-agents incident, I added a circuit breaker to every workflow step that touches an external system. The implementation is a single Postgres table:
circuit_breaker table (Postgres):
workflow_id TEXT, step_id TEXT, fail_count INT DEFAULT 0, last_fail TIMESTAMPTZ, cleared_by TEXT, cleared_at TIMESTAMPTZ
Before each n8n node runs:SELECT fail_count FROM circuit_breaker WHERE workflow_id=$1 AND step_id=$2. If fail_count >= 2, skip execution, fire ntfy alert, stop. On success: reset fail_count to 0. On failure: increment fail_count. A human reviews the alert, confirms the step is safe, runsUPDATE circuit_breaker SET fail_count=0, cleared_by='aj', cleared_at=now()to re-enable it.
Two consecutive failures on the same step = halt. The n8n execution history gives me the exact inputs and outputs for both failed runs. The ntfy alert fires to my phone within seconds. I can look at both failure records, understand what happened, and decide whether to clear the breaker or fix the underlying issue first.
This has triggered four times since I built it. Each time it caught something that would have become a worse problem with another retry: once a rate-limit that needed a 10-minute backoff, once a malformed target URL that was crashing the browser automation, once the portal processing delay issue described above, and once a network timeout that was masking a VPN connectivity drop on the NAS.
Most agent frameworks don't implement circuit breakers because the demos don't need them. Demos run once. They don't accumulate failure state. They don't run while the network is flapping or while an external API is rate-limiting or while a portal is mid-batch. The framework community is solving for "make the agent succeed in a demo" not "make the agent fail gracefully in production." Those are different problems.
The other thing frameworks skip: idempotency. If an agent submits a form and the orchestrator doesn't get the confirmation back (network error, timeout), it will retry — and you'll submit twice. Every action with a side effect needs an idempotency key checked before execution. My submit agent writes a fingerprint (record_id + entity_id + date) to Postgres before attempting. If the fingerprint exists, it skips the submission and returns the cached result. This is standard practice in payment systems. It's almost never discussed in AI agent architecture posts.
What Breaks First (In Order)
Six months of running this pipeline in production, here's the failure sequence:
Context drift arrives first. Agent A passes a summary to Agent B. Agent B summarizes further. By Agent C, the nuance from A's original observation is gone. The mitigation is to pass structured data — Postgres IDs, timestamps, status codes — not natural language. Let the LLM interpret; don't let it be the source of the record. The database is the source of truth. The agent's output is a candidate fact.
Retry amplification arrives second. An agent fails. The orchestrator retries. The agent fails again, differently, because the world changed between attempt one and attempt two. On attempt three the error message is about something completely different from attempt one. By now you've spent 3x the tokens, potentially made three partial side effects, and the error log is misleading because the most recent failure isn't the root cause. This is why the circuit breaker stops at two, not three, and why the ntfy alert includes the first failure's context, not just the most recent one.
The confident wrong answer arrives third, and it's the hardest. A 14B local model states an incorrect conclusion with exactly the same formatting and tone as a correct one. There is no built-in uncertainty signal. I handle this with a verify step after any consequential action, and with structured output that includes an explicit confidence field on any decision that routes to automatic action. Low confidence routes to human review. The agent is never the last gate.
"Agents are not sources of truth. They are producers of candidate facts. The database is the source of truth."
The One Thing Tutorials Never Cover
Every multi-agent tutorial shows you how to build the agents. Almost none show you how to watch them run. I use two layers: LangFuse for LLM call tracing (every call to Ollama, every prompt, every response, latency, token count) and n8n's built-in execution history for workflow-level visibility (every node, every input/output, every error). When the submit/verify contradiction happened, I opened the n8n execution for that run and saw both agent responses side-by-side with their timestamps. The 58-second gap between SUBMIT_AGENT's submission and VERIFY_AGENT's query was right there in the logs. Without that, I would have spent an hour hypothesizing. With it, I had root cause in four minutes.
If you don't have full execution logs with timestamps for every agent call, you are not running a production system. You are running a demo that hasn't failed yet.
Six Months In
The portal incident resolved itself. The submission was there all along — the verify agent had just queried too fast. The orchestrator now waits 60 seconds before running the verification step for any action that involves an external portal submission. That is a hack. It works. Most production systems are full of hacks like this, each one encoding a specific failure mode that someone hit at a bad time.
The pipeline running today is not elegant. It is auditable. Every decision is in a log. Every failure has a circuit breaker. Every irreversible action has an idempotency check and a verification step. It does not cascade failures when one part breaks, and it doesn't silently succeed when it actually failed. Those properties are harder to build than the agents themselves.
Build the observability first. Add the circuit breakers before you go to production. Define the merge contract before you write the parallel step. The agent prompts are the easy part. The distributed systems plumbing is where multi-agent pipelines actually live or die — and almost nobody is writing about it.