Course  /  11 · Deploying Agents
SECTION 11 LAB PRODUCTION

Deploying
Agents

An agent that works in a notebook is not a production agent. Production means handling rate limits and API errors gracefully, enforcing token and cost budgets, logging every decision in a structured way so you can debug failures after the fact, and rolling out changes safely. This section covers the engineering discipline of taking an agent from prototype to deployed service.

01 · DEPLOYMENT PATTERNS

Where and How Agents Run in Production

The right deployment pattern depends on the agent's latency requirements, task duration, and invocation frequency. LLM agents have unusual execution profiles compared to traditional services: they are compute-light on the host (calls are API-bound, not CPU-bound), but they run for seconds to minutes rather than milliseconds, and they may make many sequential API calls within a single invocation.

PatternExecution modelBest forLimit to watch
Serverless function
(AWS Lambda, GCP Cloud Run, etc.)
Event-triggered, short-lived container Low-frequency tasks, chatbot webhooks, simple tool agents Execution timeout (15 min on Lambda). Multi-step agents may exceed it.
Async job queue
(Celery, Cloud Tasks, SQS worker)
Task enqueued, worker picks up and runs to completion Long-running agents, batch processing, background research tasks Queue depth, dead-letter handling, result retrieval pattern
Containerized service
(Docker + Kubernetes / Cloud Run)
Persistent process, HTTP or gRPC API surface High-frequency agents with shared state (vector store, session cache) Horizontal scaling requires stateless design or distributed state
Streaming / SSE
(Server-Sent Events, WebSocket)
Long-lived connection, incremental output tokens streamed to client User-facing chat agents where latency to first token matters Connection timeout, client reconnect logic, partial state on disconnect
Secrets management: Never hardcode API keys. Inject them as environment variables at runtime via your platform's secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault). The Anthropic SDK reads ANTHROPIC_API_KEY from the environment by default — no code change needed, just configure the secret at deploy time.
02 · OBSERVABILITY

Logs, Traces, and Metrics for Agent Loops

Traditional services emit request-level logs: one log line per HTTP call. Agents are different — a single user request may trigger 5–20 LLM calls, dozens of tool calls, and thousands of tokens. Without structured, per-iteration logging, debugging a failure means reconstructing a black box.

The three pillars of observability apply to agents, but with agent-specific dimensions:

PILLAR 01 — LOGS
Structured JSON logs per iteration
Log every LLM call as a structured JSON event: timestamp, model, iteration number, stop_reason, input_tokens, output_tokens, tool calls made, tool results received, latency_ms. Store in a log aggregator (CloudWatch, Datadog, Loki) for querying.
PRODUCTION REQUIREMENT
PILLAR 02 — TRACES
Distributed traces across LLM calls
A trace groups all LLM calls and tool calls in a single agent run under one trace_id. OpenTelemetry is the vendor-neutral standard. Anthropic-specific SDKs (LangSmith, Langfuse, Weave) add LLM-aware span types with token counts and prompt content.
WIDELY USED (2023–2026)
PILLAR 03 — METRICS
Agent-level metrics
Track per-run aggregates: total_tokens, total_cost_usd, num_iterations, num_tool_calls, success_rate, p50/p95/p99 latency. Alert when cost-per-run exceeds budget or when error rate spikes. Prometheus + Grafana or a managed APM tool.
PRODUCTION REQUIREMENT
AGENT-SPECIFIC — EVALS
Offline evaluation datasets
A curated set of (input, expected_behavior) pairs run on every code change. Not just unit tests — LLM-as-judge or rule-based scorers on agent outputs. Without evals, you cannot safely change prompts, tools, or models without risking regression.
EMERGING STANDARD (2024–2026)
03 · RETRIES & ERROR HANDLING

Making Agents Resilient to Transient Failures

In production, the Anthropic API — like any external service — returns transient errors: rate limit responses (HTTP 429), overload responses (HTTP 529), and occasional network timeouts. An agent that crashes on the first 429 is not a production agent.

EXPONENTIAL BACKOFF WITH JITTER — THE STANDARD RETRY PATTERN

Wait base_delay × 2^attempt + random_jitter between retries. The exponential growth prevents hammering a service that is already overloaded. The jitter (random offset) prevents the "thundering herd" — all clients retrying at the same instant after a shared outage.

Retry pattern — conceptual
delay = base_delay * (2 ** attempt) + random.uniform(0, jitter_max)
time.sleep(min(delay, max_delay))
Error typeHTTP statusCorrect response
Rate limit429Retry with exponential backoff. Check retry-after header if present.
API overload529Same as rate limit — retry with backoff.
Server error500, 502, 503Retry up to max_retries. If still failing, surface error to caller — do not silently swallow.
Invalid request400Do NOT retry. Fix the request. 400s are client errors — retrying wastes quota.
Auth error401Do NOT retry. Check API key. Alert immediately — this is a configuration or secret-rotation issue.
Tool exceptionn/aCatch all exceptions in the tool executor. Return as a string error result — never propagate to the agent loop as an exception.
The Anthropic Python SDK handles retries automatically: By default it retries up to 2 times on 429 and 529 with exponential backoff. You can configure max_retries when creating the client: anthropic.Anthropic(max_retries=4). For most production uses, the SDK's built-in retry is sufficient — you only need custom retry logic if you need fine-grained control over backoff strategy or circuit breaking.
04 · COST & BUDGET CONTROL

Agents Can Be Expensive — By Design or By Bug

A ReAct agent that loops 50 times because it cannot find a tool result will burn significant API budget before timing out. In production, you need both soft limits (warn when approaching budget) and hard limits (terminate the run when exceeded) enforced in code — not just as billing alerts.

CONTROL 01
Token budget per run
Accumulate usage.input_tokens + usage.output_tokens across every API call in a single agent run. When the total crosses a threshold, stop the loop and return partial results with a clear explanation. The Anthropic API returns usage on every response.
PRODUCTION REQUIREMENT
CONTROL 02
Iteration cap
Hard limit on the number of LLM calls per run (e.g., MAX_ITERATIONS = 20). Prevents infinite loops caused by tool failures or model confusion. Always terminate with a message — never silently stop. Return whatever was accomplished before the limit.
PRODUCTION REQUIREMENT
CONTROL 03
Wall-clock timeout
Set a timeout on the entire agent run (e.g., 120 seconds). Tool calls can hang, and a stuck tool blocks the loop indefinitely. Use Python's signal module, asyncio.wait_for, or a thread with concurrent.futures.ThreadPoolExecutor for timeout enforcement.
PRODUCTION REQUIREMENT
CONTROL 04
Model tier selection
Match model capability to task complexity. Use a smaller/faster model for classification, routing, and simple Q&A; reserve the most capable models for synthesis, complex reasoning, and high-stakes decisions. Routing logic pays for itself within hours at scale.
COST OPTIMIZATION
05 · TESTING & ROLLOUT

How to Change Agents Without Breaking Production

Changing a system prompt, swapping a model, or modifying a tool schema can alter agent behavior in ways that are invisible until you see them in production logs. The solution is a testing and rollout discipline borrowed from software engineering — adapted for the stochastic, hard-to-unit-test nature of LLM outputs.

LAYER 1 — UNIT TESTS FOR TOOLS

Tool executor functions are pure Python — test them like any function. Call execute_tool("web_search", {"query": "test"}) and assert on the output format. These tests are fast, deterministic, and catch regressions in tool logic before they affect the agent loop.

LAYER 2 — EVALUATION DATASETS

A curated set of (user_input, expected_behavior) pairs. "Expected behavior" for agents is rarely an exact string — it is a rubric: "agent called the correct tool," "agent did not hallucinate a source," "agent completed the task in under 5 iterations." Score with rule-based checks or LLM-as-judge. Run on every PR before merge.

LAYER 3 — SHADOW MODE

Run the new agent version on real traffic in parallel with the production version, but suppress its output. Compare logged outputs for divergence. When divergence is low and eval scores are equal or better, promote the new version. This is the safest way to validate model or prompt changes.

LAYER 4 — CANARY / A/B ROLLOUT

Route a small fraction (1–5%) of real traffic to the new version. Monitor error rate, cost-per-run, latency, and downstream success metrics. Gradually increase traffic to 100% if metrics hold. Roll back immediately if any metric degrades beyond a threshold.

SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources.

SourceTypeCoversRecency
Anthropic — Agentic & Tool Use Docs Official docs Agentic patterns, minimal footprint, iteration limits, token budgets Maintained 2024–2026
Anthropic — API Error Reference Official docs Error codes, retry guidance, rate limit headers Maintained 2024–2026
Anthropic — Rate Limits Official docs Tokens-per-minute, requests-per-minute limits by tier Maintained 2024–2026
Anthropic — Models Overview Official docs Model tiers, capability vs cost trade-offs, recommended use cases Maintained 2024–2026
OpenTelemetry Documentation Official docs (CNCF) Traces, spans, metrics, logs — vendor-neutral observability standard Maintained 2021–2026
HANDS-ON LAB

Build a Production-Ready Agent Wrapper

You will wrap a basic ReAct-style agent with production engineering: structured JSON logging, exponential backoff retries, a token budget guard, an iteration cap, and a simple evaluation harness. The complete script is prod_agent.py.

🔬
Section 11 Lab — Production Agent Wrapper
6 STEPS · PYTHON · ~45 MIN
1
Create the file and set up imports
BASH
pip install anthropic   # already installed if you did prior labs
touch prod_agent.py
PYTHON — prod_agent.py
import os, json, time, random, logging
from datetime import datetime, timezone
import anthropic

# ── Structured logger ────────────────────────────────────────────
# Use JSON lines format so every log entry is machine-parseable.
logging.basicConfig(level=logging.INFO, format="%(message)s")
log = logging.getLogger("prod_agent")

def jlog(event: str, **kwargs) -> None:
    """Emit a structured JSON log line."""
    record = {"ts": datetime.now(timezone.utc).isoformat(), "event": event, **kwargs}
    log.info(json.dumps(record))

# ── Client ────────────────────────────────────────────────────────
# max_retries=4: SDK retries 429/529 with built-in exponential backoff.
client = anthropic.Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_retries=4,
)

# ── Budget constants ──────────────────────────────────────────────
MAX_ITERATIONS = 10      # hard iteration cap per run
TOKEN_BUDGET    = 20_000  # total tokens across all calls in one run
MODEL           = "claude-haiku-4-5-20251001"
2
Implement the tool registry and a logged LLM call

The llm_call wrapper logs every API call as a structured event and accumulates token usage into a run-level counter. All tool calls are also logged.

PYTHON — prod_agent.py (continued)
TOOLS = [
    {
        "name": "web_search",
        "description": "Search the web for factual information.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    },
    {
        "name": "calculator",
        "description": "Evaluate a Python arithmetic expression.",
        "input_schema": {
            "type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"],
        },
    },
]

SEARCH_DB = {
    "population of earth": "Earth's population is approximately 8.1 billion as of 2024.",
    "speed of light": "The speed of light in a vacuum is 299,792,458 metres per second.",
}

def execute_tool(name: str, tool_input: dict) -> str:
    jlog("tool_call", tool=name, input=tool_input)
    t0 = time.monotonic()
    try:
        if name == "web_search":
            q = tool_input["query"].lower()
            for key, val in SEARCH_DB.items():
                if key in q:
                    result = val
                    break
            else:
                result = f'No results for "{tool_input["query"]}".'
        elif name == "calculator":
            result = str(eval(tool_input["expression"], {"__builtins__": {}}))
        else:
            result = f"Unknown tool: {name}"
    except Exception as e:
        result = f"Tool error: {type(e).__name__}: {e}"

    latency_ms = int((time.monotonic() - t0) * 1000)
    jlog("tool_result", tool=name, result=result[:120], latency_ms=latency_ms)
    return result


def llm_call(messages: list, run_state: dict, iteration: int) -> anthropic.types.Message:
    """Thin wrapper: calls the API and logs usage."""
    t0 = time.monotonic()
    response = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        tools=TOOLS,
        messages=messages,
    )
    latency_ms = int((time.monotonic() - t0) * 1000)

    # Accumulate tokens
    run_state["input_tokens"]  += response.usage.input_tokens
    run_state["output_tokens"] += response.usage.output_tokens
    total = run_state["input_tokens"] + run_state["output_tokens"]

    jlog(
        "llm_call",
        iteration=iteration,
        stop_reason=response.stop_reason,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        total_tokens_so_far=total,
        latency_ms=latency_ms,
    )
    return response
3
Implement the agent loop with budget guards

The agent loop enforces both the iteration cap and the token budget. When either limit is hit, the loop terminates and returns partial results with a clear explanation of what happened.

PYTHON — prod_agent.py (continued)
SYSTEM = """You are a research assistant. Use tools to answer the user's question.
When you have a complete answer, respond with a final text message."""


def run_agent(task: str, trace_id: str | None = None) -> dict:
    """Run the agent loop with full production instrumentation."""
    trace_id = trace_id or str(random.randint(100_000, 999_999))
    run_state = {"input_tokens": 0, "output_tokens": 0}

    jlog("run_start", trace_id=trace_id, task=task)
    t_run = time.monotonic()

    messages = [{"role": "user", "content": task}]
    final_answer = None
    stop_reason_outer = "unknown"

    for iteration in range(1, MAX_ITERATIONS + 1):

        # ── Token budget check ───────────────────────────────────
        total_so_far = run_state["input_tokens"] + run_state["output_tokens"]
        if total_so_far >= TOKEN_BUDGET:
            stop_reason_outer = "token_budget_exceeded"
            jlog("budget_exceeded", trace_id=trace_id, total_tokens=total_so_far)
            final_answer = (
                f"[Agent stopped: token budget of {TOKEN_BUDGET:,} exceeded at "
                f"{total_so_far:,} tokens. Partial work completed in {iteration-1} iterations.]"
            )
            break

        response = llm_call(messages, run_state, iteration)
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            stop_reason_outer = "end_turn"
            final_answer = "".join(
                b.text for b in response.content if hasattr(b, "text")
            )
            break

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            messages.append({"role": "user", "content": tool_results})

    else:
        # for-loop exhausted — iteration cap hit
        stop_reason_outer = "iteration_cap"
        final_answer = (
            f"[Agent stopped: reached {MAX_ITERATIONS} iteration limit without completing the task.]"
        )

    duration_ms = int((time.monotonic() - t_run) * 1000)
    total_tokens = run_state["input_tokens"] + run_state["output_tokens"]

    jlog(
        "run_end",
        trace_id=trace_id,
        stop_reason=stop_reason_outer,
        total_input_tokens=run_state["input_tokens"],
        total_output_tokens=run_state["output_tokens"],
        total_tokens=total_tokens,
        duration_ms=duration_ms,
    )

    return {
        "trace_id": trace_id,
        "answer": final_answer,
        "total_tokens": total_tokens,
        "stop_reason": stop_reason_outer,
    }
4
Add a simple evaluation harness

The eval harness runs a set of test cases and scores each answer with a keyword check. In a real system you would use an LLM-as-judge or a domain-specific rubric — this simple version demonstrates the pattern.

PYTHON — prod_agent.py (continued)
EVAL_CASES = [
    {
        "task": "What is the approximate population of Earth?",
        "must_contain": ["8", "billion"],
        "must_not_contain": ["trillion", "million people"],
    },
    {
        "task": "What is 1234 multiplied by 5678?",
        "must_contain": ["7,006,652", "7006652"],
        "must_not_contain": [],
    },
    {
        "task": "What is the speed of light and how many kilometres does it travel in one second?",
        "must_contain": ["299", "792"],
        "must_not_contain": [],
    },
]


def run_eval() -> None:
    print("\n" + "="*60)
    print("EVALUATION RUN")
    print("="*60)

    passed = 0
    for i, case in enumerate(EVAL_CASES, 1):
        print(f"\n[Case {i}] {case['task']}")
        result = run_agent(case["task"])
        answer = (result["answer"] or "").lower()

        ok = (
            all(kw.lower() in answer for kw in case["must_contain"]) and
            all(kw.lower() not in answer for kw in case["must_not_contain"])
        )

        status = "PASS ✓" if ok else "FAIL ✗"
        print(f"  Status:  {status}")
        print(f"  Tokens:  {result['total_tokens']:,}")
        print(f"  Answer:  {result['answer'][:120]}")
        if ok:
            passed += 1

    print(f"\nResult: {passed}/{len(EVAL_CASES)} cases passed.")
    print("="*60)
5
Run the agent and the eval harness
PYTHON — add to the bottom of prod_agent.py
if __name__ == "__main__":
    # Single run with a trace ID
    result = run_agent(
        "What is the population of Earth, and how many people is that per square kilometre of land area? Land area = 148,940,000 km².",
        trace_id="demo-001",
    )
    print(f"\nFINAL ANSWER:\n{result['answer']}")
    print(f"Trace: {result['trace_id']} | Tokens: {result['total_tokens']:,} | Stop: {result['stop_reason']}")

    # Eval harness
    run_eval()
BASH
python prod_agent.py 2>&1 | tee run.jsonl
EXPECTED LOG OUTPUT (abridged)
{"ts": "2026-04-04T10:00:00Z", "event": "run_start", "trace_id": "demo-001", "task": "What is the population..."}
{"ts": "...", "event": "llm_call", "iteration": 1, "stop_reason": "tool_use", "input_tokens": 412, "output_tokens": 48, "total_tokens_so_far": 460, "latency_ms": 621}
{"ts": "...", "event": "tool_call", "tool": "web_search", "input": {"query": "population of earth"}}
{"ts": "...", "event": "tool_result", "tool": "web_search", "result": "Earth's population is approximately 8.1 billion...", "latency_ms": 0}
{"ts": "...", "event": "llm_call", "iteration": 2, "stop_reason": "tool_use", "input_tokens": 521, "output_tokens": 54, "total_tokens_so_far": 1035, "latency_ms": 588}
{"ts": "...", "event": "tool_call", "tool": "calculator", "input": {"expression": "8100000000 / 148940000"}}
{"ts": "...", "event": "tool_result", "tool": "calculator", "result": "54.38...", "latency_ms": 0}
{"ts": "...", "event": "llm_call", "iteration": 3, "stop_reason": "end_turn", "input_tokens": 630, "output_tokens": 72, "total_tokens_so_far": 1737, "latency_ms": 701}
{"ts": "...", "event": "run_end", "trace_id": "demo-001", "stop_reason": "end_turn", "total_tokens": 1737, "duration_ms": 1950}
What to observe: Every LLM call is a parseable JSON event with tokens, latency, and stop_reason. Tool calls and results are separately logged. The run_end event gives you the complete cost summary. Pipe this to a file and you can query it with jq or load it into any log aggregator.
6
Extension: add a custom exponential backoff retry for tool calls

The Anthropic SDK handles LLM API retries. But your tool calls to external services need their own retry logic. Add a decorator that wraps any tool function with exponential backoff and jitter.

PYTHON — add to prod_agent.py
import functools

def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 30.0):
    """Decorator: retry a function on exception with exponential backoff + jitter."""
    def decorator(fn):
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries + 1):
                try:
                    return fn(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries:
                        raise
                    delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
                    jlog("tool_retry", attempt=attempt + 1, error=str(e), delay_s=round(delay, 2))
                    time.sleep(delay)
        return wrapper
    return decorator


# Usage: decorate any external tool call that may fail transiently
@retry_with_backoff(max_retries=3)
def call_external_api(endpoint: str) -> str:
    # Replace with a real requests.get() or httpx call
    if random.random() < 0.5:
        raise ConnectionError("Simulated transient failure")
    return f"Response from {endpoint}"

Finished the theory and completed the lab? Mark this section complete to track your progress.

Last updated: