Part V · Agents, Tool Use, Workflow Orchestration
Chapter 68 ~15 min read

Multi-agent patterns and when they're worth it

"Multi-agent systems are interesting. Single-agent systems usually work better. Pick wisely"

In Chapter 67 we covered single-agent patterns — one LLM in a loop with tools. This chapter covers multi-agent patterns: systems where multiple LLMs collaborate, each with their own role, prompt, or specialty.

The honest pitch upfront: multi-agent systems are usually worse than single-agent systems. They’re more expensive, slower, harder to debug, and often produce worse results because the agents disagree, repeat work, or get stuck in loops. The exceptions where multi-agent actually wins are narrow.

This chapter covers the patterns, the cases where they help, and the cases where they don’t. The goal is to help you avoid the common trap of “multi-agent looks fancy, let’s use it” when single-agent would do better.

Outline:

  1. The motivating cases.
  2. The supervisor pattern.
  3. The critic-actor pattern.
  4. Hierarchical agents.
  5. The conversational multi-agent pattern.
  6. Sequential pipelines.
  7. Parallel agents.
  8. The honest assessment.
  9. Frameworks: AutoGen, CrewAI, LangGraph.
  10. When multi-agent is actually worth it.

68.1 The motivating cases

Why would you want multiple agents?

(1) Specialization. A “researcher” agent optimized for retrieving information and a “writer” agent optimized for synthesis might each be better at their narrower task than a single agent doing both.

(2) Different models for different tasks. Use a cheap model for simple subtasks and an expensive one for hard subtasks. The cheap agents handle the bulk; the expensive agent handles the synthesis.

(3) Parallelism. If a task can be decomposed, multiple agents can work in parallel and finish faster.

(4) Decomposition. A task too complex for a single agent (because of context length limits, tool overload, or sheer complexity) might be tractable when split into sub-tasks for sub-agents.

(5) Verification. Have one agent do the work and another agent check it. Two heads are better than one (sometimes).

(6) Adversarial dynamics. A “red team” agent tries to break the “blue team” agent’s output. Useful for robustness testing.

These are the reasons you might consider multi-agent. The reality, in most cases, is that single-agent works just as well or better. We’ll get to the honest assessment in §68.8.

68.2 The supervisor pattern

The most common multi-agent pattern: a supervisor agent that routes requests to specialized worker agents.

[User query]
     |
     v
[Supervisor]
     |
     +--> [Researcher agent] (handles factual questions)
     +--> [Coder agent]      (handles code-related tasks)
     +--> [Writer agent]     (handles creative tasks)
     +--> [Math agent]       (handles math problems)
     |
     v
[Synthesized response]

The supervisor reads the user’s query, decides which worker is best suited, dispatches the query to that worker, and returns the result. The supervisor might call multiple workers and combine the results.

Supervisor pattern: a top-level agent routes each query to one of several specialized worker agents and aggregates their results. User query Supervisor routes + merges Researcher Coder Writer Response Often replaceable by a single agent with all four tools — routing adds LLM overhead without guaranteed gain.
The supervisor pattern is intuitive but fragile: cross-category tasks confuse the router, and a single well-prompted agent with all tools usually matches or beats it.

The worker agents are each prompted to be specialists. Each has its own system prompt, possibly its own tool set, possibly its own model.

Implementation:

def supervisor_agent(query):
    # The supervisor decides which worker to call
    decision = llm.generate(
        f"Which worker should handle this query: {query}? "
        f"Options: researcher, coder, writer, math."
    )
    
    if "researcher" in decision:
        return researcher_agent(query)
    elif "coder" in decision:
        return coder_agent(query)
    elif "writer" in decision:
        return writer_agent(query)
    elif "math" in decision:
        return math_agent(query)

This pattern is intuitive and works for clearly-distinct task categories. It struggles when:

  • Tasks span categories (a query that needs both research and coding).
  • The supervisor’s routing is wrong (it picks the wrong worker).
  • The workers have overlapping capabilities (the routing is arbitrary).

For most production use, the supervisor pattern is barely better than a single agent with all the tools. The single agent can handle tool selection internally; you don’t need a separate routing layer.

68.3 The critic-actor pattern

A different pattern: one agent does the work, another agent critiques it.

[User query]
    |
    v
[Actor] → produces draft
    |
    v
[Critic] → evaluates draft, suggests improvements
    |
    v
[Actor] → produces revised draft
    |
    ... (loop until critic is satisfied)

The actor and critic might be the same model with different prompts, or two different models. The advantage of two different models: each has different blind spots.

This is structurally the same as the reflection pattern from Chapter 67, just with two agents instead of one. The conceptual distinction is mostly cosmetic.

The critic-actor pattern is useful for tasks where verification is meaningful (math, code, structured output). It’s the same as reflection.

68.4 Hierarchical agents

A more elaborate pattern: multi-level hierarchies. A top-level “manager” agent decomposes the task into sub-tasks. Each sub-task is dispatched to a “team lead” agent, which decomposes it further into specific actions for “worker” agents.

[Manager]
   |
   +-- [Team Lead 1]
   |     +-- [Worker A]
   |     +-- [Worker B]
   |
   +-- [Team Lead 2]
   |     +-- [Worker C]
   |     +-- [Worker D]

Each level is an LLM with its own prompt and role.

This pattern has been tried in research and in some commercial products. The empirical result: it usually doesn’t work better than a flat single-agent system. The hierarchy adds latency (each level is a round trip) and the lower-level agents often fail to follow the manager’s intent precisely.

Hierarchical agents are mostly an interesting research direction, not a production pattern. The exception is for very large tasks that genuinely don’t fit in a single agent’s context window.

68.5 The conversational multi-agent pattern

A different style: multiple agents talk to each other in a shared conversation. Each agent has its own role and personality. The conversation is the medium for collaboration.

This is the pattern in AutoGen (Microsoft) and similar frameworks. Example:

[User]: Plan a marketing campaign for our new product.
[Product Manager Agent]: I think we should focus on the technical features.
[Marketer Agent]: I disagree. We should focus on the customer benefits.
[Product Manager]: Good point. Let's combine both — features framed as benefits.
[Marketer]: I'll write the messaging. You handle the timeline.
... [agents continue to collaborate]

The conversation continues until some stopping condition (a target reached, max iterations, etc.).

This is the most “agent-like” pattern in the sense that it mimics human collaboration. It’s also the most expensive (many LLM calls), the slowest, and the most prone to failure (the agents can get stuck in loops, disagree forever, or drift from the original goal).

Conversational multi-agent is good for demos and research. For production, the simpler patterns usually work better.

68.6 Sequential pipelines

A simpler multi-agent pattern: a fixed pipeline of agents, each handling one stage of the work.

[Input]
   |
   v
[Agent 1: Parse and structure the request]
   |
   v
[Agent 2: Retrieve relevant information]
   |
   v
[Agent 3: Synthesize an answer]
   |
   v
[Agent 4: Format the response]
   |
   v
[Output]

Each agent has a specific job and a specific output format. The output of one agent is the input of the next. The pipeline is deterministic — no looping, no decision-making about which agent to call.

This is basically a workflow, not an agent pattern. It works well because:

  • Each stage is constrained to a specific task, so each LLM call is simpler.
  • The flow is predictable, easier to debug.
  • You can swap models per stage based on the task complexity.

The downside: it’s not very different from a single LLM with multiple prompts. You could implement the same thing with one LLM that does all four steps in one call, with similar results.

Sequential pipelines are useful when the stages have different latency requirements (you can use a fast model for parsing, a slow model for synthesis) or different cost requirements.

68.7 Parallel agents

When sub-tasks are independent, you can run multiple agents in parallel and merge their results.

Parallel agents fan out to independent sub-tasks and fan back in to a synthesizer, cutting wall-clock time to max(sub-task latency) instead of sum. query Web search agent Academic agent Internal KB agent Synthesizer merge results ans wall-clock ≈ max(sub-task) vs serial ≈ sum(sub-tasks)
Parallel agents are the clearest multi-agent win: independent sub-tasks run concurrently so total latency equals the slowest sub-task, not their sum.

For example, a research task that requires gathering information from three different sources could spawn three parallel agents (one per source), then merge their findings.

async def parallel_research(query):
    results = await asyncio.gather(
        web_research_agent(query),
        academic_research_agent(query),
        internal_kb_agent(query)
    )
    return synthesize(results)

This pattern is the most clearly beneficial multi-agent case. The parallelism gives you actual speedup (not just complexity for its own sake). The agents don’t need to coordinate (they’re independent). The merging is straightforward.

For tasks with parallelizable structure, parallel agents are worth using. They genuinely beat single-agent loops on latency.

68.8 The honest assessment

The empirical observation across multi-agent research and production: multi-agent systems usually don’t outperform well-designed single-agent systems.

Why?

(1) Single-agent systems can use tools. A single agent with the right tools can do anything multi-agent systems do. The “specialization” of multi-agent often disappears when the single agent has access to specialized tools.

(2) Inter-agent communication is lossy. When agent A passes information to agent B via text, some information is lost (compared to A doing the work itself). The handoff costs accuracy.

(3) Agents disagree. Multiple LLMs reading the same input often produce different opinions. Resolving the disagreement is its own coordination problem.

(4) Compounding failure rates. If each agent has a 5% failure rate, three agents in series have ~14% failure rate. Multi-agent systems are more fragile.

(5) Latency multiplies. Each agent is its own LLM call. Multi-agent systems are slower than single-agent systems, often by 3-10×.

(6) Cost multiplies. Same reason as latency.

The Anthropic research blog has been particularly clear about this. Anthropic’s internal experiments consistently find that a well-prompted single agent with good tools beats most multi-agent systems.

Multi-agent cost factors: compounding failure rates, latency multiplied by agent count, inter-agent information loss at each handoff. Agent A fail rate 5% lossy handoff Agent B fail rate 5% lossy handoff Agent C fail rate 5% 3 agents in series: P(all succeed) = 0.95³ = 0.857 → 14% failure rate N agents in series: cost × N latency × N failure rate compound
Three agents in series with individual 5% failure rates produce a 14% combined failure rate — reliability degradation is one of the strongest arguments for keeping multi-agent systems flat and minimal.

The places where multi-agent does help:

Parallel decomposable tasks. When you genuinely have independent sub-tasks, parallel agents win on latency.

Different model tiers for different stages. When you can save cost by using a small model for some stages and a large model for others.

Verification with a different lens. When having a critic from a different model family catches errors the actor missed.

Roleplay / simulation. When the use case is genuinely about multiple personas (e.g., simulating a debate).

Very large tasks that don’t fit in a single agent’s context.

For everything else, start with a single agent. Add a second agent only when you can articulate a specific reason it’ll help.

68.9 Frameworks

The major multi-agent frameworks:

AutoGen (Microsoft). The original conversational multi-agent framework. Popular for prototyping. Has a learning curve.

CrewAI. Role-based agents that “crew” together. Easy to start, opinionated. Good for proof-of-concept.

LangGraph. Graph-based workflow framework that supports multi-agent patterns. The most flexible and the one I’d use for production.

OpenAI Swarm (now Agents SDK). OpenAI’s lightweight multi-agent framework. Simple and unopinionated.

Anthropic Claude Code / Computer Use. Anthropic’s tools for building agents that interact with computers. Single-agent by default but supports multi-agent patterns.

Custom code. As always, the patterns are simple enough that you can roll your own.

For production multi-agent, LangGraph or custom code are the right defaults. AutoGen and CrewAI are good for prototyping but the abstractions can get in the way.

68.10 When multi-agent is actually worth it

A short list of cases where I’d reach for multi-agent over single-agent:

(1) Genuinely parallel decomposable tasks. E.g., research from multiple sources, code review from multiple perspectives, generation of multiple drafts.

(2) Cost optimization with heterogeneous model tiers. E.g., use Llama 8B for routine sub-tasks and Llama 70B (or GPT-4) for critical reasoning. The cost savings can be 5-10× without sacrificing much quality.

(3) Verification by a different model. E.g., have GPT-4 produce code and Claude review it. Different blind spots catch different bugs.

(4) Simulation / roleplay. E.g., simulating a customer-agent interaction, or running a debate between AI personas.

(5) Adversarial testing. E.g., a “red team” agent tries to break a “blue team” agent’s output for safety testing.

(6) Tasks too large for one context window. E.g., processing a 1000-page document with a 32k-context model. Each agent handles a chunk; a synthesizer combines.

For most other use cases — chatbots, RAG, tool use, automation — single-agent is the right choice.

68.11 The mental model

Eight points to take into Chapter 69:

  1. Multi-agent systems usually don’t outperform single-agent systems. Be skeptical of multi-agent designs.
  2. The supervisor pattern routes to specialized workers. Often replaceable with a single agent + tools.
  3. The critic-actor pattern is structurally the same as reflection.
  4. Hierarchical agents are research-stage. Rarely production-worthy.
  5. Conversational multi-agent is the most “agent-like” but also the most fragile.
  6. Parallel agents are the clearest multi-agent win when sub-tasks are independent.
  7. Cost multiplies, latency multiplies, failure rates compound. Multi-agent costs add up.
  8. For most use cases: single agent + good tools + good prompt > multi-agent.

In Chapter 69 we look at the protocol that’s standardizing how agents access tools: MCP.


Read it yourself

  • Wu et al., AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Microsoft, 2023).
  • The CrewAI documentation.
  • The LangGraph multi-agent examples.
  • The Anthropic blog post “Building effective agents” (the case for keeping it simple).
  • Examples of multi-agent failures in production (search “AutoGen production issues”).

Practice

  1. Why does single-agent often beat multi-agent? List five reasons.
  2. Construct a use case where parallel multi-agent is clearly the right choice.
  3. Why does the critic-actor pattern have the same effect as reflection?
  4. For a task that needs research + synthesis + writing, would you use multi-agent or single-agent? Argue.
  5. Compute the latency for a 4-agent sequential pipeline at 5 seconds per agent. Compare to a single agent doing all four steps in one call.
  6. Read the Anthropic “Building effective agents” post. Identify the main claims.
  7. Stretch: Build the same task as both single-agent and multi-agent. Compare quality, cost, and latency. Which wins?