Course  /  14 · Multi-Agent Orchestration
SECTION 14 LAB 2026 NEW

Multi-Agent
Orchestration

A single agent has a single context window, a single thread of execution, and a single set of tools. Multi-agent systems break those limits: tasks too large for one context window get decomposed across specialists; independent subtasks run in parallel; a critic agent reviews the worker's output before it is accepted. This section covers why and how to structure multiple agents, and the lab builds a three-agent research pipeline — orchestrator, researcher, and critic — from scratch.

01 · WHY MULTI-AGENT?

Three Problems That One Agent Cannot Solve Alone

Multi-agent systems add significant engineering complexity — more LLM calls, more failure surfaces, harder debugging. They are only justified when they solve a problem that a single agent genuinely cannot. Anthropic's guidance on multi-agent systems identifies three classes of tasks where the multi-agent approach pays off.

📐
Tasks too large for a single context window

A 500-page codebase, a year of email history, a corpus of 10,000 documents — no context window holds all of it. A multi-agent system decomposes the task: the orchestrator divides the work, subagents process chunks in parallel, and results are synthesized. This is the most common and clearest justification for multiple agents.

Independent subtasks that benefit from parallelism

A research task with three independent sub-questions can be answered three times faster by running three agents simultaneously than by running one agent sequentially. Parallelism is the clearest performance win in multi-agent systems — but only works when subtasks truly do not depend on each other's output.

🔍
Quality improvement through specialized roles

A worker agent and a critic agent — each with a focused system prompt — produce better outputs than a single agent asked to both do the work and check it. The generator and evaluator roles actively conflict when merged: the agent that wrote the answer is poorly positioned to objectively critique it. Separation of concerns applies to agents too.

The complexity cost is real. Every agent you add is another LLM call, another failure point, another source of inconsistent output format, and another thing to log and debug. Multi-agent systems are not a default upgrade from single-agent — they are a deliberate architectural choice with a concrete justification.
02 · ORCHESTRATION PATTERNS

How Agents Are Structured and Connected

The relationship between agents in a multi-agent system determines how information flows, where errors propagate, and what each agent is responsible for. Three patterns dominate production deployments.

PATTERN A — ORCHESTRATOR + WORKERS

A central orchestrator LLM decomposes the goal, delegates subtasks to worker agents, collects results, and synthesizes the final output. Workers are stateless — they receive a task, complete it, and return a result. The orchestrator maintains the overall plan and handles worker failures. Workers can be specialized: a web-search agent, a code-execution agent, a data-analysis agent.

Orchestrator → [Worker A, Worker B, Worker C] → Synthesizer
PATTERN B — GENERATOR + CRITIC

A worker agent generates an output (draft, code, plan, summary). A separate critic agent evaluates it against a rubric and returns structured feedback — score, specific issues, and suggested revisions. The orchestrator (or a loop) decides whether to accept the output, send it back to the generator for revision, or escalate to a human. This pattern significantly improves output quality for writing, coding, and analysis tasks.

Generator → output → Critic → [accept | revise] → (loop or done)
PATTERN C — PARALLEL WORKERS + FAN-IN

The orchestrator fans out independent subtasks to multiple workers running concurrently (asyncio.gather or ThreadPoolExecutor). When all workers complete, a fan-in step collects results and synthesizes. Use only when subtasks are genuinely independent — if any worker's output must feed into another worker, use sequential orchestration instead.

Orchestrator → [Worker A ∥ Worker B ∥ Worker C] → Fan-in → Result
03 · AGENT COMMUNICATION

How Agents Pass Information Between Each Other

Agents in a multi-agent system communicate by passing structured messages — not by sharing memory directly. Each agent has its own context window; the only way one agent's output reaches another is by being explicitly passed as input. This is not a limitation — it is a feature that makes multi-agent systems debuggable and auditable.

CHANNEL 01
Direct function calls (same process)
The orchestrator calls a Python function that runs a subagent: result = run_researcher(subtask). The simplest implementation — no networking, no serialization overhead. Works when all agents run in the same process. Used in this section's lab.
SIMPLEST — START HERE
CHANNEL 02
Tool calls (subagent as tool)
The orchestrator treats a subagent as a tool it can call. The orchestrator's tool schema includes a delegate_to_researcher tool. When invoked, the tool executor calls the subagent and returns its result as a tool result. The orchestrator decides when and how to delegate, giving it full autonomy over subagent invocation.
WIDELY USED (2024–2026)
CHANNEL 03
Message queues (distributed)
For distributed multi-agent systems where agents run on separate machines, messages are passed via a queue (Redis, SQS, Kafka). The orchestrator publishes subtask messages; workers consume and publish results. Enables true parallelism across hosts and handles backpressure when workers are slower than the orchestrator.
EMERGING (2024–2026)
CHANNEL 04
Shared state store
Agents read and write to a shared store (a database, a key-value store, a file) rather than passing messages directly. The orchestrator writes the task spec; the worker reads it, writes results back; the orchestrator reads results. Requires careful concurrency handling — two agents writing to the same key simultaneously produces a race condition.
PRODUCTION PATTERN
04 · TRUST BETWEEN AGENTS

Orchestrators, Subagents, and the Trust Hierarchy

When an orchestrator instructs a subagent to take an action, the subagent should not automatically trust that instruction any more than it would trust a user message. A subagent that blindly executes any instruction it receives from an orchestrator is vulnerable to two attack vectors: a compromised orchestrator, and prompt injection through data the orchestrator retrieved and passed along.

ANTHROPIC'S GUIDANCE ON MULTI-AGENT TRUST

Anthropic's agentic documentation explicitly states that subagents should behave safely and ethically regardless of the instruction source. A subagent receiving an instruction from an orchestrator should apply the same safety checks it would apply to a human user's instruction. It should refuse requests that violate its principles, even if the requester claims to be another Claude model or a trusted orchestrator.

The implication: there is no elevated trust level for LLM-to-LLM messages. The orchestrator's instructions arrive in the user turn of the subagent's context, not the system prompt turn — so they carry user-level trust, not operator-level trust.

Message position in subagent contextTrust levelTypical source
System promptOperator-level (high)The developer who built the subagent
User turnUser-level (standard)Orchestrator messages, human input, tool results
Tool resultUntrusted dataExternal APIs, web content, retrieved documents
Prompt injection via orchestrator: If the orchestrator retrieves content from the web, reads files, or processes user-supplied data and then passes it to a subagent as instructions, that content may contain embedded prompt injection. The subagent must treat all content that originated outside the developer-controlled system prompt as untrusted — regardless of which agent delivered it.
05 · COORDINATION FAILURE MODES

What Goes Wrong in Multi-Agent Systems

Multi-agent systems inherit all single-agent failure modes — plus a new class of failures that arise from the coordination between agents. These are subtler, harder to reproduce in development, and can compound across agents before surfacing as a visible problem.

FAILURE 01
Error Amplification
A small inaccuracy in an early agent's output becomes the premise for a later agent's work. By the time the final agent synthesizes results, the error has been amplified through multiple reasoning steps. No single agent made an obvious mistake — the pipeline compounded them.
ACTIVE RISK
FAILURE 02
Output Format Mismatch
Worker agent A returns results in a format that the orchestrator (or worker agent B) does not expect. One agent returns a list, another expects a dict. Without explicit output schema enforcement at each handoff, format mismatches cause silent failures — the receiving agent misinterprets the data rather than raising an error.
ACTIVE RISK
FAILURE 03
Runaway Cost
A bug causes the orchestrator to re-run workers indefinitely — each retry spawning more API calls. Without per-agent token budgets AND a system-level budget cap, a single orchestration failure can exhaust the entire month's API quota before an alert fires.
ACTIVE RISK
FAILURE 04
Context Pollution
An orchestrator accumulates all worker outputs in its context window as the system runs. After many worker calls, the orchestrator's context is dominated by prior results and reasoning traces — crowding out the original task instructions and causing instruction-following degradation (Lost in the Middle, Section 04).
OPERATIONAL RISK
SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources.

SourceTypeCoversRecency
Anthropic — Building Effective Agents Official guide (Anthropic) Why multi-agent, orchestration patterns, trust model, when not to use 2024
Anthropic — Agentic & Multi-Agent Docs Official docs Subagent trust levels, orchestrator guidance, minimal footprint Maintained 2024–2026
Wu et al. — AutoGen (arXiv:2308.08155) Academic paper (Microsoft) Multi-agent conversation framework, agent roles, human-in-the-loop patterns 2023
LangGraph Documentation Official docs Graph-based multi-agent orchestration, state management, parallel execution Maintained 2024–2026
HANDS-ON LAB

Build a Three-Agent Research Pipeline

You will build a pipeline with three specialized agents: an Orchestrator that decomposes a research question into subtasks, a Researcher that answers each subtask using a search tool, and a Critic that scores each answer before the orchestrator synthesizes a final report. The complete script is multi_agent.py.

🔬
Section 14 Lab — Multi-Agent Research Pipeline
6 STEPS · PYTHON · ~50 MIN
1
Create the file and shared infrastructure
BASH
touch multi_agent.py
PYTHON — multi_agent.py
import os, json
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = "claude-haiku-4-5-20251001"  # cheap model for all three agents

# ── Shared knowledge base (stands in for real search) ─────────────
KNOWLEDGE = {
    "transformer architecture":
        "The Transformer uses self-attention to process all tokens in parallel. "
        "Introduced by Vaswani et al. (2017). Replaced recurrent networks for NLP.",
    "rlhf":
        "RLHF fine-tunes LLMs using human preference rankings. Three stages: SFT, "
        "reward model training, PPO. Introduced at scale by Ouyang et al. (2022).",
    "rag":
        "RAG retrieves relevant documents at query time and injects them as context. "
        "Solves the knowledge staleness problem without retraining. Lewis et al. (2020).",
    "react agents":
        "ReAct agents interleave Thought, Action, and Observation in a loop. "
        "Enables dynamic tool use with visible reasoning. Yao et al. (2022).",
    "mcp":
        "Model Context Protocol is Anthropic's open standard for connecting LLM agents "
        "to tools and data sources via a standard wire protocol. Released Nov 2024.",
}

SEARCH_TOOLS = [{
    "name": "search",
    "description": "Search the knowledge base for information on a topic.",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
}]

def search(query: str) -> str:
    q = query.lower()
    for key, val in KNOWLEDGE.items():
        if key in q:
            return val
    return f'No results found for "{query}".'
2
Implement the Orchestrator agent

The orchestrator takes the user's research question, decomposes it into 2–3 focused subtasks, and returns them as a JSON list. It will also be responsible for the final synthesis step.

PYTHON — multi_agent.py (continued)
ORCHESTRATOR_SYSTEM = """You are a research orchestrator.
Given a research question, decompose it into 2–3 focused subtasks.
Each subtask should be answerable with a single targeted search.
Return ONLY a JSON array of subtask strings.
Example: ["What is X?", "How does Y work?", "What are the tradeoffs of Z?"]"""

SYNTHESIZER_SYSTEM = """You are a research synthesizer.
Given a research question and a list of (subtask, answer, quality_score) triples,
write a concise 3–5 sentence summary that answers the original question.
Only use information from the provided answers — do not add outside knowledge."""


def orchestrate(question: str) -> list[str]:
    """Decompose a research question into focused subtasks."""
    response = client.messages.create(
        model=MODEL, max_tokens=256,
        system=ORCHESTRATOR_SYSTEM,
        messages=[{"role": "user", "content": f"Research question: {question}"}],
    )
    raw = response.content[0].text.strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        start, end = raw.find("["), raw.rfind("]") + 1
        return json.loads(raw[start:end])


def synthesize(question: str, results: list[dict]) -> str:
    """Synthesize all research results into a final answer."""
    results_text = "\n\n".join(
        f"Subtask: {r['subtask']}\nAnswer: {r['answer']}\nQuality: {r['score']}/5"
        for r in results
    )
    response = client.messages.create(
        model=MODEL, max_tokens=400,
        system=SYNTHESIZER_SYSTEM,
        messages=[{
            "role": "user",
            "content": f"Original question: {question}\n\nResearch results:\n{results_text}",
        }],
    )
    return response.content[0].text
3
Implement the Researcher subagent

The researcher runs a mini ReAct loop — it can call the search tool as many times as needed, then produces a concise answer. Note the system prompt explicitly states its role and scope.

PYTHON — multi_agent.py (continued)
RESEARCHER_SYSTEM = """You are a focused research assistant.
You will receive a single subtask. Use the search tool to find relevant information,
then write a concise 2–3 sentence answer grounded in what you found.
Do not add information beyond what the search tool returns."""


def research(subtask: str) -> str:
    """Run the researcher agent on a single subtask. Returns a text answer."""
    messages = [{"role": "user", "content": subtask}]

    for _ in range(4):  # small iteration cap per subagent
        response = client.messages.create(
            model=MODEL, max_tokens=256,
            system=RESEARCHER_SYSTEM,
            tools=SEARCH_TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return "".join(b.text for b in response.content if hasattr(b, "text"))

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = search(block.input["query"])
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            messages.append({"role": "user", "content": tool_results})

    return "Researcher did not complete within iteration budget."
4
Implement the Critic subagent

The critic receives the subtask and the researcher's answer, and returns a structured JSON critique with a 1–5 quality score and specific feedback. The orchestrator uses the score to decide whether to accept or flag the answer.

PYTHON — multi_agent.py (continued)
CRITIC_SYSTEM = """You are a research quality critic.
Given a subtask and a proposed answer, evaluate the answer on:
- Relevance: does it address the subtask?
- Groundedness: does it only use verifiable information?
- Conciseness: is it clear and appropriately brief?

Return ONLY a JSON object with this exact structure:
{"score": integer 1-5, "feedback": "one sentence of specific feedback"}
5 = excellent, 3 = acceptable but could improve, 1 = missing or wrong."""


def critique(subtask: str, answer: str) -> dict:
    """Run the critic agent. Returns {"score": int, "feedback": str}."""
    response = client.messages.create(
        model=MODEL, max_tokens=128,
        system=CRITIC_SYSTEM,
        messages=[{
            "role": "user",
            "content": f"Subtask: {subtask}\n\nProposed answer: {answer}",
        }],
    )
    raw = response.content[0].text.strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        # Fallback: extract JSON from response if wrapped in prose
        start = raw.find("{")
        end = raw.rfind("}") + 1
        return json.loads(raw[start:end])
5
Wire the pipeline together and run it
PYTHON — multi_agent.py (continued)
def run_pipeline(question: str) -> None:
    print(f"\n{'='*60}\nQUESTION: {question}\n{'='*60}")

    # 1. Orchestrator decomposes the question
    subtasks = orchestrate(question)
    print(f"\n[Orchestrator] Decomposed into {len(subtasks)} subtasks:")
    for i, s in enumerate(subtasks, 1):
        print(f"  {i}. {s}")

    results = []
    for subtask in subtasks:
        print(f"\n[Researcher] Working on: {subtask}")

        # 2. Researcher answers the subtask
        answer = research(subtask)
        print(f"  Answer: {answer[:100]}...")

        # 3. Critic scores the answer
        review = critique(subtask, answer)
        score = review.get("score", 0)
        feedback = review.get("feedback", "no feedback")
        print(f"  [Critic] Score: {score}/5 — {feedback}")

        results.append({"subtask": subtask, "answer": answer, "score": score})

    # 4. Orchestrator synthesizes the final report
    report = synthesize(question, results)
    avg_score = sum(r["score"] for r in results) / len(results)

    print(f"\n{'='*60}")
    print(f"FINAL REPORT (avg quality: {avg_score:.1f}/5):")
    print(f"{'='*60}")
    print(report)


if __name__ == "__main__":
    run_pipeline(
        "How do modern LLM agents learn to follow instructions "
        "and use tools effectively?"
    )
BASH
python multi_agent.py
EXPECTED OUTPUT (abridged)
============================================================
QUESTION: How do modern LLM agents learn to follow instructions and use tools?
============================================================

[Orchestrator] Decomposed into 3 subtasks:
  1. How does RLHF train LLMs to follow instructions?
  2. How do ReAct agents use tools at runtime?
  3. What role does the Transformer architecture play?

[Researcher] Working on: How does RLHF train LLMs to follow instructions?
  Answer: RLHF fine-tunes LLMs using human preference rankings...
  [Critic] Score: 4/5 — Clear and grounded but could mention the KL penalty.

[Researcher] Working on: How do ReAct agents use tools at runtime?
  Answer: ReAct agents interleave Thought, Action, and Observation...
  [Critic] Score: 5/5 — Concise, accurate, and well-sourced.

[Researcher] Working on: What role does the Transformer architecture play?
  Answer: The Transformer uses self-attention to process all tokens...
  [Critic] Score: 4/5 — Good but could connect more directly to tool use.

============================================================
FINAL REPORT (avg quality: 4.3/5):
============================================================
Modern LLM agents learn to follow instructions through RLHF, which fine-tunes
the model on human preference rankings across three stages... [continues]
What to observe: Four separate LLM calls per subtask (orchestrate, research, critique, synthesize = 3 researchers + 3 critics + 1 orchestrator + 1 synthesizer = 8 calls total). Each agent has a focused system prompt and receives only the information it needs — no agent accumulates the full pipeline context.
6
Extension: run researcher subtasks in parallel

The three researcher calls are independent — they can run simultaneously. Replace the sequential loop with ThreadPoolExecutor to run all researchers in parallel and cut wall-clock time by ~3×.

PYTHON — replace the research+critique loop in run_pipeline
from concurrent.futures import ThreadPoolExecutor, as_completed

def research_and_critique(subtask: str) -> dict:
    """Run researcher + critic for one subtask. Safe to call in parallel."""
    answer = research(subtask)
    review = critique(subtask, answer)
    return {
        "subtask": subtask,
        "answer": answer,
        "score": review.get("score", 0),
        "feedback": review.get("feedback", ""),
    }


# Replace the sequential loop with this:
with ThreadPoolExecutor(max_workers=len(subtasks)) as executor:
    futures = {executor.submit(research_and_critique, s): s for s in subtasks}
    results = [future.result() for future in as_completed(futures)]

print(f"\nAll {len(results)} subtasks completed in parallel.")
for r in results:
    print(f"  [{r['score']}/5] {r['subtask'][:60]}")
Rate limit awareness: Running 3 agents in parallel triples your tokens-per-minute consumption. If you hit a 429, the SDK's built-in retry with backoff handles it — but set max_workers conservatively (2–3) for accounts on lower-tier rate limits.

Finished the theory and completed the lab? Mark this section complete to track your progress.

Last updated: