Twelve Engineers, Three Agents, One Platform
Our CTO walked into a planning session mid-year and said something that stuck with me: "We're spending nearly half our engineering hours on workflow orchestration that a well-designed agent system could handle. Figure it out." He wasn't wrong, and he wasn't being glib about it. We'd been building enterprise integration products for mid-market companies, and the pattern was always the same — ingest data from three or four systems, apply business rules, transform it, validate it, push it somewhere else. Our developers were writing essentially the same state machines over and over, just with different field names. The repetition was fine. The edge cases were where it got expensive, and we were drowning in edge cases.
About six months later, we had a multi-agent orchestration system running in production. Built by twelve engineers. Handling workflow automation that previously required dedicated developer effort for every single client onboarding. This is roughly how that went — what worked, what blew up, and what I'd do differently.
Why Twelve Was the Right Number
I've managed teams of five and teams closer to forty. For this project, twelve was deliberate. Three senior engineers each owned one agent's architecture — they had real depth in LLM integrations, prompt engineering, distributed systems. Four mid-level engineers built the orchestration layer, the message bus between agents, the state management, the API gateway. Three engineers focused entirely on evaluation, testing, and observability — honestly the best staffing decision I made on the whole project. Two engineers handled infrastructure: Kubernetes, model serving, cost monitoring, the deployment pipeline.
Why not more? Because agent systems have a coordination overhead that scales badly with team size. When you're designing how three autonomous software entities talk to each other, every engineer needs to hold the full system in their head. At twenty people you get information loss at the seams. At twelve, everyone fits in one room, everyone hears every design decision, and nobody's surprised by what another agent does.
We also had a rule: no one works on more than one agent at a time. Cross-agent work happened in paired sessions between leads. This stopped the kind of implicit coupling that quietly kills multi-agent systems later.
The Three Agents
We tried a few architectures before this one. The first prototype was a single monolithic agent with a massive system prompt. Fine for demos, fell apart in production. It couldn't hold coherent reasoning across a 15-step workflow while also validating data quality and managing error recovery. Too much going on in one context window.
What we landed on was three specialized agents, each with a clear job.
The Planner
Given a workflow specification — what data sources, what transformations, what destination — the planner would decompose it into a directed acyclic graph of tasks, estimate resource requirements, identify dependencies, and produce a typed execution plan in JSON.
We ran it on Claude 3.5 Sonnet, later bumping to Opus for complex multi-source workflows. The reasoning capability mattered here because planning required understanding business logic, anticipating failure modes, making trade-off decisions about parallelism versus sequential safety. Not a job for a small model.
The planner consumed roughly 60% of our total LLM spend. When we experimented with a cheaper model for this role, the execution plans had subtle logical errors that only surfaced several steps downstream. Debugging those was miserable. You don't economize on the component that decides what to do.
The Executor
The executor took the planner's output and ran it. Each task was a structured operation: call an API, transform data according to a schema, write to a database, send a notification. The executor didn't need to reason about why it was doing something. It needed to do it reliably, handle retries, and report status.
We ran a mix of smaller models here. For straightforward data transformations and API calls, fine-tuned open-source models on our own infrastructure. For tasks that required interpretation — parsing unstructured data, handling ambiguous field mappings — we routed to Claude Haiku. The routing decision was based on a complexity score the planner assigned to each step.
That tiered approach cut execution costs by roughly 70% compared to running everything through a frontier model.
The Reviewer
This was, in my opinion, the agent that made the whole thing actually work in production. Its job was quality assurance: validating outputs against expected schemas, checking data consistency across steps, detecting anomalies, deciding whether a completed workflow met acceptance criteria.
The critical design decision was making it a gate, not an advisor. The reviewer didn't suggest fixes. It passed or failed. If it failed a workflow step, that step went back to the executor with the specific failure reason. The executor would retry with that additional context. Fail twice, the workflow paused and a human got paged.
It needed enough reasoning to catch subtle issues — a field that was technically valid but semantically wrong, a transformation that preserved format but lost precision — but it didn't need the planner's full capability. We ran it on Sonnet and gave it access to historical workflow data so it could compare current outputs against baselines and flag statistical outliers.
How the Agents Communicate
This is where most multi-agent tutorials get hand-wavy, and where we spent the most design time.
We built a typed message protocol. Every message between agents had a schema version, a correlation ID tracing back to the original workflow request, a payload conforming to a versioned schema, and a confidence score from the sending agent. That confidence score turned out to be essential — when the planner produced a plan with low confidence on certain steps (usually because the input data was ambiguous), the reviewer would apply stricter validation thresholds on those steps.
The message bus ran on Redis Streams. Not because it was the most sophisticated option, but because the team already knew it, it was fast, and it gave us the consumer group semantics we needed for exactly-once processing. We looked at Kafka briefly and decided the operational overhead wasn't worth it at our throughput levels.
State management was the hardest part. Each workflow had a state object in PostgreSQL tracking every agent interaction, every decision, every retry. We versioned these states so we could replay workflows for debugging. The state object averaged around 45KB per workflow, which sounds small until you multiply it by thousands of concurrent workflows and realize your database is doing a lot of read-modify-write cycles.
The First Month in Production
We deployed with a single client's workflows. The first week was humbling.
Day one, the planner generated a plan that referenced an API endpoint that didn't exist. It had hallucinated the endpoint based on patterns in its training data. The reviewer caught it during validation, but the error message was unhelpful — "endpoint unreachable" rather than "this endpoint was fabricated." We added a pre-execution validation step where the executor checks all external resources before beginning work.
Day three brought our first context window crisis. A complex workflow had 47 steps, and by the time the planner was handling steps 40-47, the earlier context had been pushed out of the effective window. The later steps contradicted decisions made earlier. We solved this by having the planner produce plans in chunks of about 15 steps, with a compressed summary of decisions and constraints carried forward.
Day five: cost spike. A malformed input triggered the executor into a retry loop where each retry escalated to a more expensive model. Forty-something retries later, one workflow had cost us around $34 in API calls. We implemented hard budget caps per workflow and circuit breakers that halt execution after three consecutive failures on the same step.
Day eight: the reviewer rejected a perfectly valid workflow because the output format had trailing whitespace that didn't match the reference schema. That led to a two-hour investigation and a disagreement on the team — one of my senior engineers argued we were over-engineering the QA layer, that it should be a deterministic pipeline rather than a full agent. We went back and forth on it, and honestly I'm still not sure he was wrong. What we ended up with was a tiered approach: structural and semantic validation stayed strict, cosmetic validation became lenient with warnings instead of failures.
By the end of the first month, we'd processed a few thousand workflows with roughly a 91% fully-automated success rate. About 9% required human intervention, mostly for edge cases in data formats we hadn't anticipated. That improved to somewhere around 95% by month three as we fed failure cases back into prompt engineering.
What the Numbers Looked Like
Before the agent system, setting up and validating a new client workflow integration took a developer roughly three to four days. We were onboarding maybe eight to ten new client workflows a month.
After: the agent system handles a new workflow setup in a few hours, including validation. We now onboard around 40-50 new workflows a month with the same team size, because developers are working on the platform instead of individual client setups. Error rates on agent-orchestrated workflows also dropped — manual setup averaged a couple of bugs found in the first week, agent-orchestrated workflows average less than one, and they tend to be edge cases in client data rather than logic errors.
Average LLM cost per workflow execution came in well under a dollar. At fully-loaded engineering costs, the ROI math was obvious enough that it wasn't a hard conversation with leadership.
What We Learned About Agent Boundaries
A few things I'd tell anyone building something like this:
Agents should have opinions, not flexibility. Early on we made the planner too accommodating — if input data was ambiguous, it would generate multiple possible plans and ask the user to choose. That sounds reasonable. In practice it just pushed complexity to the user. We changed it to be opinionated: it picks the best plan, documents its reasoning, and lets the user override if they disagree. The default is a decision, not a question.
The boundaries between agents should feel slightly awkward. If it ever feels natural for one agent to "just quickly" do something that's another agent's job, your boundaries are wrong. We had a constant temptation to let the executor do light validation — just check if the field's not null before moving on. Every time we gave in to that, we ended up with inconsistent validation logic split between the executor and the reviewer. The rule became absolute: executor executes, reviewer validates. Even if it means a round trip that adds latency.
Handoff protocols need more metadata than you think they do. The first version of our inter-agent messages had five fields. Production has somewhere around nineteen. We added confidence scores, alternative options that were considered and rejected, estimated resource usage, retry count, compliance flags. Every single one of those fields exists because we had a production incident where we needed that information and didn't have it.
Invest in the evaluation team before you think you need to. Those three engineers I put on evaluation and observability from day one built the system that let us sleep. They created a framework that replayed historical workflows through new agent versions. They built dashboards showing per-agent latency, cost, and error rates in real time. They wrote the alerting that caught the $34 retry loop on day five instead of day fifty. If I'd waited to build evaluation infrastructure "after we have something working," we'd still be debugging in production.
Cost management is an architecture concern. We initially treated LLM costs like cloud infrastructure costs — something you monitor and optimize later. Wrong. In a multi-agent system, cost is a design variable. Using a bigger model for planning and a smaller one for execution wasn't an optimization; it was an architectural choice that shaped the entire system. We eventually built cost awareness into every agent: each one tracks its own spend per workflow and can make degradation decisions if the budget's running low.
What I'd Do Differently
I'd probably start with two agents, not three. The reviewer could have been a validation layer within the orchestration framework rather than a full agent. We made it an agent because it felt architecturally clean, but the operational overhead of maintaining a third agent — its own prompts, its own model selection, its own scaling — was significant. I'm genuinely not sure the QA agent needs the autonomy of a full agent rather than a deterministic validation pipeline with LLM-assisted edge case handling. That debate never fully resolved on the team.
I'd also build the replay system before writing a single agent. The ability to replay workflows deterministically was the most valuable debugging tool we had. We built it in month two. If we'd built it in week one, the first month would've been far less painful.
And I'd hire a dedicated prompt engineer, or at least designate one person as the prompt standard owner across all three agents. We distributed prompt engineering across the three agent leads. Each developed their own style. By month three, the planner's prompts read like technical specifications, the executor's read like step-by-step instructions, the reviewer's read like exam rubrics. That inconsistency made cross-agent debugging harder than it needed to be.
Where It Stands Now
The system handles a couple hundred workflows per day across roughly a dozen and a half client integrations. LLM spend is somewhere around $2-3K per month, which is essentially a rounding error compared to the engineering salary costs it displaces. The twelve-person team has shifted focus from building the agent system to extending it — new data source connectors, a self-service workflow designer for non-technical users, and experiments with a fourth agent that handles monitoring and auto-remediation of failed workflows.
That fourth agent is still pretty rough, incidentally. Some things are still broken.
The agents themselves weren’t the hard part. The hard part was understanding the problem domain well enough to decompose it into clear responsibilities.
Post thisThe biggest surprise was that the agents themselves weren't the hard part. The hard part was the same as it always is: understanding the problem domain well enough to decompose it into clear responsibilities, building the connective tissue that lets components work together without stepping on each other, and instrumenting everything so you know what's actually happening in production. Agents just gave us a new way to express those decompositions. The engineering discipline underneath is the same.
References
- Anthropic. "Building Effective Agents." Anthropic Research, 2024. anthropic.com/research/building-effective-agents
- OpenAI. "A Practical Guide to Building Agents." OpenAI Cookbook, 2025. cookbook.openai.com/examples/agents
- LangChain. "Multi-Agent Architectures." LangGraph Documentation, 2025. langchain-ai.github.io/langgraph
- Microsoft Research. "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv:2308.08155, 2023. arxiv.org/abs/2308.08155
- Google DeepMind. "Scaling LLM-Based Multi-Agent Systems." DeepMind Blog, 2025. deepmind.google/discover/blog
Updated December 8, 2025