Multi-Agent Orchestration Patterns That Survived Production

A year ago, I would have told you multi-agent systems were an interesting research direction but not really ready for production. I was wrong about that. Over the past eight months or so, my team has deployed several distinct multi-agent architectures across three enterprise products — processing millions of tasks in production. Some worked beautifully. Some failed in ways the academic papers never warned us about.

The gap between multi-agent research and production engineering is enormous. Papers describe elegant architectures. Production has retry queues, partial failures, and cost overruns.

Post this

The gap between multi-agent research literature and production engineering is enormous. Papers describe elegant architectures with clean diagrams. Production has retry queues, partial failures, cost overruns, and users who submit inputs that no one anticipated. The patterns I'm describing here aren't theoretically optimal. They're the ones that kept running at 3 AM when I was asleep.

A bit of context: we build enterprise data processing and content intelligence products. Our agents handle document analysis, data extraction and enrichment, compliance checking, content generation with quality assurance. The team is 14 engineers. Our monthly LLM spend across all products runs somewhere in the $8K-$14K range, which I mention because cost is a real constraint that shapes every architectural decision we make.

I'm going to focus on three patterns. Honestly, we're still figuring out when to use what — but these three are the ones I'd reach for first.

Pattern 1: The Supervisor

One coordinator agent receives a complex task, breaks it into subtasks, delegates each to a specialist agent, collects the results, and assembles the final output. The supervisor maintains a task queue and a result store. Specialist agents are stateless workers that process one subtask at a time.

In our implementation, the supervisor is a Claude-based agent with a system prompt that defines the decomposition strategy, the available specialists, and the assembly logic. The specialists are smaller, cheaper models — often fine-tuned for narrow tasks. Communication goes through a Redis-backed task queue; results land in PostgreSQL.

The supervisor pattern works best when tasks have clear, well-defined boundaries. Our best implementation processes incoming business documents. The supervisor receives a PDF, identifies its type — invoice, contract, report, correspondence — and delegates to specialists: one for financial data extraction, one for entity recognition, one for clause analysis, one for metadata tagging. Each specialist is deeply optimized for its domain. Because the specialists run in parallel, total latency is determined by the slowest specialist plus supervisor overhead, not the sum of all agents. That's the key performance insight.

Where it breaks down: ambiguous tasks that don't decompose cleanly. We tried applying it to open-ended research questions where the supervisor had to decide what information to gather before it even understood what the question really required. The decomposition step became a bottleneck — the supervisor would either over-decompose or miss critical aspects. Error rate on ambiguous tasks was several times higher than on well-structured documents.

Cascading errors are the other failure mode. If a specialist produces bad output, the supervisor often can't detect it and incorporates it into the final result. We added a validation layer between specialists and the supervisor for structural errors, but semantic validation — catching the wrong date extracted from a contract — requires a separate quality agent, which starts to look like a different pattern entirely.

Pattern 2: Chain of Responsibility

Agents process in sequence, each transforming the output of the previous agent. Agent A extracts raw data, Agent B validates and normalizes it, Agent C enriches it with external data, Agent D formats it for the target system. We implement this as a series of AWS Step Functions, with intermediate state passed through the state machine. Each agent has a well-defined input schema and output schema, enforced at the boundaries.

Pipelines with clear transformation stages are the sweet spot. Our data enrichment pipeline is the cleanest example: extract entities from raw text, validate them against known constraints, augment with additional context from our knowledge base, format for the target schema. Four stages, each independently testable.

The beauty of this pattern is debuggability. When something goes wrong, you inspect the output at each stage and identify exactly where the error was introduced. We log every intermediate output. Our debugging workflow is essentially: look at the final bad output, binary-search backward through the chain to find the stage that introduced the error. Average time to root-cause a production issue is somewhere around eight minutes. Compare that to debugging a monolithic agent that does everything in one pass, where root cause analysis can take hours.

The chain pattern also makes it trivial to swap individual stages. When we upgraded our extraction agent from one model to another, we only had to validate that the output schema was compatible. We ran both versions in parallel for a week, compared outputs, and switched over with zero downtime.

Error propagation is the Achilles heel. A bad extraction in Stage 1 flows through every subsequent stage, and each stage can amplify the error. We had a case where the extraction agent misidentified a document's language, which caused the validation agent to apply the wrong rule set, which caused the enrichment agent to pull irrelevant context, which produced a final output that was confidently wrong in four different ways. It passed all schema validations. That was a bad day.

Latency is also additive by design. Unlike the supervisor pattern, the chain is sequential. Under load, queue times between stages can push total latency to 25+ seconds. For real-time use cases, that's often too slow.

Pattern 3: The Debate

Two agents independently process the same task and produce competing outputs. A judge agent evaluates them — optionally asks the advocates to defend their positions — and selects or synthesizes the best answer. The competing agents don't see each other's outputs until the judge initiates the debate phase.

Our implementation uses three agents: two "advocates" that process the task independently (we use different models or different prompts to ensure diversity of approach), and one judge with access to both outputs simultaneously and a rubric defining evaluation criteria.

The debate pattern shines for high-stakes decisions where accuracy matters more than speed or cost. We use it for compliance document review, where a missed clause or misinterpreted obligation can have serious financial consequences. Advocate A is prompted to be thorough and conservative — flag anything that might be a compliance issue. Advocate B is prompted to be precise and specific — only flag definite issues with supporting evidence. The judge reconciles their findings.

Single-agent compliance review achieves about 89% accuracy on our benchmark dataset. The debate pattern gets us to roughly 97%. That gap might not sound dramatic, but in compliance work, the difference between catching 89% and 97% of issues is the difference between "mostly useful" and "production-grade."

The pattern also produces naturally better-calibrated confidence signals. When both advocates agree, confidence is high. When they disagree, we flag for human review. Items where advocates disagree are many times more likely to contain genuine edge cases than items where they agree. We use this disagreement signal as a triage mechanism: agreement goes straight through, disagreement goes to human review.

The cost problem is real. The debate pattern is 2.5x to 3x more expensive than a single agent. For compliance, the accuracy improvement justifies it. We tried it for content categorization and found that a small accuracy improvement absolutely did not justify the cost increase. Pick your battles.

Latency is also a concern — at least two sequential phases, sometimes three if you add a rebuttal round. Our compliance debate takes somewhere between 18 and 25 seconds per document. Fine for batch workloads, too slow for user-facing real-time applications.

The subtlest failure mode: when both advocates are wrong in the same way. If they share a systematic bias — which happens when both models were trained on similar data — the judge has no disagreement signal and confidently produces a wrong answer. We mitigate this by using genuinely different models for the two advocates, but certain classes of errors are correlated across models in ways that are hard to fully escape.

What I Wish I'd Known Earlier

Start with the simplest pattern that could work. We over-engineered our first system with a complex supervisor-debate hybrid that was nearly impossible to debug. Our second system used a simple chain and worked better in production despite being architecturally "boring." Complexity should be earned, not assumed.

Observability is not optional. Every agent interaction, every intermediate output, every decision point needs to be logged and traceable. We spent roughly 30% of our development time on observability infrastructure, and it was the best investment we made. When production breaks at 2 AM, the difference between "I can see exactly what happened" and "I need to reproduce this" is the difference between a 15-minute fix and a 3-hour investigation.

Cost modeling has to happen before architecture selection. We designed our first debate system without modeling the cost, deployed it, and discovered it was burning through budget on a task that wasn't worth it. The napkin math takes 30 minutes and can save thousands.

And honestly — we're still figuring out when to use what. After eight months of running these patterns in production, I can tell you what each one is good at and where each one falls apart. I can't tell you there's a clean decision tree that always gives you the right answer. The patterns are composable building blocks, not mutually exclusive choices. Your mileage will vary based on task characteristics, cost constraints, and how much pain you're willing to take on debugging complexity. The underlying principles hold, though: start simple, observe everything, design for failure, and let the problem drive the architecture.

References

Wu, Q., et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation," arXiv, 2023. arxiv.org
Park, J.S., et al., "Generative Agents: Interactive Simulacra of Human Behavior," arXiv, 2023. arxiv.org
Du, Y., et al., "Improving Factuality and Reasoning in Language Models through Multiagent Debate," arXiv, 2023. arxiv.org
Hong, S., et al., "MetaGPT: Meta Programming for Multi-Agent Collaborative Framework," arXiv, 2023. arxiv.org
Anthropic, "Building Effective Agents," 2024. docs.anthropic.com
LangChain, "Multi-Agent Architectures," LangGraph Documentation, 2024. langchain-ai.github.io
Shinn, N., et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," NeurIPS, 2023. arxiv.org

Updated February 19, 2026