What Broke When We Let Agents Talk to Each Other

I wrote about building a three-agent orchestration system a few months back. That post was about what we built. This one is about what broke.

Every failure here cost us real time, real money, or real credibility with a client. Some cost all three. I'm writing this down because when we were building the system I couldn't find anything honest about what actually goes wrong in production multi-agent setups. Everything I read was either theoretical or suspiciously clean.

The Cascading Hallucination

A client's workflow involved ingesting product data from an ERP system, normalizing it, and pushing it to their e-commerce platform. The ERP export had a field called product_category_l3 that was empty for roughly 30% of the records. The planner noticed the missing data and, trying to be helpful, added a step to the execution plan: "Infer category from product description using reference taxonomy."

There was no reference taxonomy. The planner had invented a data source that didn't exist. But the hallucination was plausible enough that the executor didn't flag it as an error. Instead it interpreted the instruction as "classify the product based on description text" and used its own judgment to assign categories. Confidently. About 70% of the assignments were reasonable. The other 30% were nonsense — kitchen appliances classified as automotive parts, software licenses categorized as perishable food.

Then the reviewer looked at the output and saw that category fields were populated where they had previously been empty. It compared the output against a schema that said the field should be a non-null string from a known set. The hallucinated categories were valid strings. They passed schema validation. The reviewer approved the workflow.

The client's e-commerce platform dutifully displayed kitchen mixers in the automotive accessories section. We got a phone call.

We did not catch this ourselves. The client caught it. That was the worst part — our entire three-agent pipeline, designed with validation as a first-class concern, had confidently produced and approved garbage without triggering a single alert.

What we changed: the planner now produces a manifest of every external resource its plan references. Before the executor begins, a deterministic (non-LLM) step checks that every resource actually exists and is accessible. Hard gate — anything missing, the workflow pauses for human review. We also built semantic drift detection that flags when the distribution of values in a field shifts significantly between input and output. And we added provenance tracking so every data transformation carries a tag: where did this value come from? "The agent inferred it" is held to a different scrutiny standard than "it came from a source system."

That took about three weeks of engineering to get right. Not a fast fix.

The Retry Storm

We were running about fourteen client workflows simultaneously. Several of them used the same third-party data enrichment API, which had a rate limit of 100 requests per minute across our account. Each executor instance was tracking its own usage against the rate limit, but they weren't coordinating with each other.

Eight workflows hit the enrichment step within a 90-second window. Each executor instance calculated that it had budget for its 15-20 requests. Collectively they fired around 140 requests in one minute. The API returned 429s for about 40 of them. Each instance, seeing failures, backed off and retried. Eight instances doing synchronized backoff meant the retries hit the next minute's quota window simultaneously. This oscillation ran for over 20 minutes before circuit breakers finally tripped.

Eight workflows stalled. A stern email from the API provider about our usage patterns.

Our cost monitoring caught it within a few minutes — eight workflows simultaneously burning retry budgets was an obvious anomaly. The fix took longer because we had to understand why eight independent agents were all misbehaving at once.

We introduced a shared resource broker. Instead of each executor instance directly calling rate-limited APIs, they request a resource ticket from a central broker that tracks global usage across all concurrent workflows. Token bucket algorithm per external API, queue requests when the bucket is empty. Adds a small amount of latency per call, acceptable. Gives us one place to see every external API interaction across every workflow.

The deeper lesson: agents that operate independently but share resources aren't truly independent. Either give them fully isolated resources or build explicit coordination. The middle ground — each agent thinking it's independent while actually competing for shared resources — is where incidents live.

The Delegation Loop

This one was the most expensive. A workflow step required transforming address data from a European format to a US format. The planner treated it as a single transformation step. The executor attempted it, produced output, the reviewer rejected it because the postal code format was wrong. Executor retried with adjusted logic. Reviewer rejected again — state abbreviation wasn't standard.

On the third failure, our escalation logic kicked in. But a recently deployed update had introduced an "auto-remediation" path: on repeated failures, the step would be sent back to the planner for re-planning. The planner looked at the failure context and decided the step needed to be broken into two sub-steps. It sent the new plan to the executor.

The postal code sub-step succeeded. The state abbreviation step failed because the input data had province names, not state names. Reviewer rejected. Back to the planner. The planner decided the step needed a country detection sub-step first. New plan to the executor. Country detection worked, but now the workflow had branched into a tree of sub-steps, each generating more sub-steps on failure.

By the time we caught it, the workflow had generated 847 sub-steps from what was originally a single address transformation task. LLM costs for this one workflow: over $100. Elapsed time: nearly four hours.

Our cost-per-workflow alert had triggered, but it was a weekend and the on-call engineer didn't respond for an hour and a half. By then the damage was done.

The fixes: workflows now have a maximum re-planning depth of two. If a step fails after being re-planned twice, it goes to a human — no exceptions, no further auto-remediation. Each workflow has a maximum step count budgeted at three times the original plan size. Cost alerts now automatically pause execution at twice the estimated cost rather than just notifying. And after the planner re-plans a step, there's a cooldown before it can re-plan any step in the same workflow again.

None of that is technically complex. The incident exposed a fundamental assumption we'd made: that agent escalation paths always converge. They don't. Without explicit convergence guarantees, agent-to-agent delegation can diverge indefinitely.

What I Tell People Now

Single-agent systems fail predictably. Multi-agent systems fail in emergent ways — agents interacting to produce failures no individual agent would produce alone.

Post this

When someone asks me about building multi-agent systems, three things:

Your agents will find failure modes you didn't imagine. Single-agent systems fail in ways you can roughly predict. Multi-agent systems fail in emergent ways — agents interacting to produce failures that no individual agent would produce alone. You can't enumerate these in advance. You need observability that lets you investigate novel failures quickly, because there will be novel failures.

Treat agent communication as a distributed systems problem. Everything you know about distributed systems applies: messages can be lost, duplicated, or reordered. State can be inconsistent. Timeouts happen. Partial failures are the norm. If your multi-agent architecture doesn't have idempotency, retry budgets, dead-letter queues, and circuit breakers, you're not ready for production.

The monitoring system is not optional infrastructure — it's the product. Our multi-agent platform is only as trustworthy as our ability to understand what it's doing. We have more code in our monitoring and evaluation stack than in the agents themselves. That ratio feels right to me now. It didn't feel right when we were building it.

References

  1. Anthropic. "Building Effective Agents." Anthropic Research, 2024. anthropic.com/research/building-effective-agents
  2. Chase, H. "Multi-Agent Failure Modes." LangChain Blog, 2025. blog.langchain.dev
  3. OpenTelemetry. "Distributed Tracing for LLM Applications." OpenTelemetry Documentation, 2025. opentelemetry.io/docs
  4. Microsoft Research. "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv:2308.08155, 2023. arxiv.org/abs/2308.08155
  5. Nygard, M. "Release It! Design and Deploy Production-Ready Software." Pragmatic Bookshelf, 2018. pragprog.com/titles/mnee2
  6. Google SRE. "Monitoring Distributed Systems." Site Reliability Engineering, 2016. sre.google/sre-book/monitoring-distributed-systems
← Back to Home