Deploying
Agents
An agent that works in a notebook is not a production agent. Production means handling rate limits and API errors gracefully, enforcing token and cost budgets, logging every decision in a structured way so you can debug failures after the fact, and rolling out changes safely. This section covers the engineering discipline of taking an agent from prototype to deployed service.
Where and How Agents Run in Production
The right deployment pattern depends on the agent's latency requirements, task duration, and invocation frequency. LLM agents have unusual execution profiles compared to traditional services: they are compute-light on the host (calls are API-bound, not CPU-bound), but they run for seconds to minutes rather than milliseconds, and they may make many sequential API calls within a single invocation.
| Pattern | Execution model | Best for | Limit to watch |
|---|---|---|---|
| Serverless function (AWS Lambda, GCP Cloud Run, etc.) |
Event-triggered, short-lived container | Low-frequency tasks, chatbot webhooks, simple tool agents | Execution timeout (15 min on Lambda). Multi-step agents may exceed it. |
| Async job queue (Celery, Cloud Tasks, SQS worker) |
Task enqueued, worker picks up and runs to completion | Long-running agents, batch processing, background research tasks | Queue depth, dead-letter handling, result retrieval pattern |
| Containerized service (Docker + Kubernetes / Cloud Run) |
Persistent process, HTTP or gRPC API surface | High-frequency agents with shared state (vector store, session cache) | Horizontal scaling requires stateless design or distributed state |
| Streaming / SSE (Server-Sent Events, WebSocket) |
Long-lived connection, incremental output tokens streamed to client | User-facing chat agents where latency to first token matters | Connection timeout, client reconnect logic, partial state on disconnect |
ANTHROPIC_API_KEY from the environment by default — no code change needed, just configure the secret at deploy time.
Logs, Traces, and Metrics for Agent Loops
Traditional services emit request-level logs: one log line per HTTP call. Agents are different — a single user request may trigger 5–20 LLM calls, dozens of tool calls, and thousands of tokens. Without structured, per-iteration logging, debugging a failure means reconstructing a black box.
The three pillars of observability apply to agents, but with agent-specific dimensions:
Making Agents Resilient to Transient Failures
In production, the Anthropic API — like any external service — returns transient errors: rate limit responses (HTTP 429), overload responses (HTTP 529), and occasional network timeouts. An agent that crashes on the first 429 is not a production agent.
Wait base_delay × 2^attempt + random_jitter between retries. The exponential growth prevents hammering a service that is already overloaded. The jitter (random offset) prevents the "thundering herd" — all clients retrying at the same instant after a shared outage.
delay = base_delay * (2 ** attempt) + random.uniform(0, jitter_max)
time.sleep(min(delay, max_delay))
| Error type | HTTP status | Correct response |
|---|---|---|
| Rate limit | 429 | Retry with exponential backoff. Check retry-after header if present. |
| API overload | 529 | Same as rate limit — retry with backoff. |
| Server error | 500, 502, 503 | Retry up to max_retries. If still failing, surface error to caller — do not silently swallow. |
| Invalid request | 400 | Do NOT retry. Fix the request. 400s are client errors — retrying wastes quota. |
| Auth error | 401 | Do NOT retry. Check API key. Alert immediately — this is a configuration or secret-rotation issue. |
| Tool exception | n/a | Catch all exceptions in the tool executor. Return as a string error result — never propagate to the agent loop as an exception. |
max_retries when creating the client: anthropic.Anthropic(max_retries=4). For most production uses, the SDK's built-in retry is sufficient — you only need custom retry logic if you need fine-grained control over backoff strategy or circuit breaking.
Agents Can Be Expensive — By Design or By Bug
A ReAct agent that loops 50 times because it cannot find a tool result will burn significant API budget before timing out. In production, you need both soft limits (warn when approaching budget) and hard limits (terminate the run when exceeded) enforced in code — not just as billing alerts.
usage.input_tokens + usage.output_tokens across every API call in a single agent run. When the total crosses a threshold, stop the loop and return partial results with a clear explanation. The Anthropic API returns usage on every response.signal module, asyncio.wait_for, or a thread with concurrent.futures.ThreadPoolExecutor for timeout enforcement.How to Change Agents Without Breaking Production
Changing a system prompt, swapping a model, or modifying a tool schema can alter agent behavior in ways that are invisible until you see them in production logs. The solution is a testing and rollout discipline borrowed from software engineering — adapted for the stochastic, hard-to-unit-test nature of LLM outputs.
Tool executor functions are pure Python — test them like any function. Call execute_tool("web_search", {"query": "test"}) and assert on the output format. These tests are fast, deterministic, and catch regressions in tool logic before they affect the agent loop.
A curated set of (user_input, expected_behavior) pairs. "Expected behavior" for agents is rarely an exact string — it is a rubric: "agent called the correct tool," "agent did not hallucinate a source," "agent completed the task in under 5 iterations." Score with rule-based checks or LLM-as-judge. Run on every PR before merge.
Run the new agent version on real traffic in parallel with the production version, but suppress its output. Compare logged outputs for divergence. When divergence is low and eval scores are equal or better, promote the new version. This is the safest way to validate model or prompt changes.
Route a small fraction (1–5%) of real traffic to the new version. Monitor error rate, cost-per-run, latency, and downstream success metrics. Gradually increase traffic to 100% if metrics hold. Roll back immediately if any metric degrades beyond a threshold.
Verified References
Every claim in this section is grounded in one of these sources.
| Source | Type | Covers | Recency |
|---|---|---|---|
| Anthropic — Agentic & Tool Use Docs | Official docs | Agentic patterns, minimal footprint, iteration limits, token budgets | Maintained 2024–2026 |
| Anthropic — API Error Reference | Official docs | Error codes, retry guidance, rate limit headers | Maintained 2024–2026 |
| Anthropic — Rate Limits | Official docs | Tokens-per-minute, requests-per-minute limits by tier | Maintained 2024–2026 |
| Anthropic — Models Overview | Official docs | Model tiers, capability vs cost trade-offs, recommended use cases | Maintained 2024–2026 |
| OpenTelemetry Documentation | Official docs (CNCF) | Traces, spans, metrics, logs — vendor-neutral observability standard | Maintained 2021–2026 |
Build a Production-Ready Agent Wrapper
You will wrap a basic ReAct-style agent with production engineering: structured JSON logging, exponential backoff retries, a token budget guard, an iteration cap, and a simple evaluation harness. The complete script is prod_agent.py.
pip install anthropic # already installed if you did prior labs touch prod_agent.py
import os, json, time, random, logging from datetime import datetime, timezone import anthropic # ── Structured logger ──────────────────────────────────────────── # Use JSON lines format so every log entry is machine-parseable. logging.basicConfig(level=logging.INFO, format="%(message)s") log = logging.getLogger("prod_agent") def jlog(event: str, **kwargs) -> None: """Emit a structured JSON log line.""" record = {"ts": datetime.now(timezone.utc).isoformat(), "event": event, **kwargs} log.info(json.dumps(record)) # ── Client ──────────────────────────────────────────────────────── # max_retries=4: SDK retries 429/529 with built-in exponential backoff. client = anthropic.Anthropic( api_key=os.environ["ANTHROPIC_API_KEY"], max_retries=4, ) # ── Budget constants ────────────────────────────────────────────── MAX_ITERATIONS = 10 # hard iteration cap per run TOKEN_BUDGET = 20_000 # total tokens across all calls in one run MODEL = "claude-haiku-4-5-20251001"
The llm_call wrapper logs every API call as a structured event and accumulates token usage into a run-level counter. All tool calls are also logged.
TOOLS = [
{
"name": "web_search",
"description": "Search the web for factual information.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
{
"name": "calculator",
"description": "Evaluate a Python arithmetic expression.",
"input_schema": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
},
]
SEARCH_DB = {
"population of earth": "Earth's population is approximately 8.1 billion as of 2024.",
"speed of light": "The speed of light in a vacuum is 299,792,458 metres per second.",
}
def execute_tool(name: str, tool_input: dict) -> str:
jlog("tool_call", tool=name, input=tool_input)
t0 = time.monotonic()
try:
if name == "web_search":
q = tool_input["query"].lower()
for key, val in SEARCH_DB.items():
if key in q:
result = val
break
else:
result = f'No results for "{tool_input["query"]}".'
elif name == "calculator":
result = str(eval(tool_input["expression"], {"__builtins__": {}}))
else:
result = f"Unknown tool: {name}"
except Exception as e:
result = f"Tool error: {type(e).__name__}: {e}"
latency_ms = int((time.monotonic() - t0) * 1000)
jlog("tool_result", tool=name, result=result[:120], latency_ms=latency_ms)
return result
def llm_call(messages: list, run_state: dict, iteration: int) -> anthropic.types.Message:
"""Thin wrapper: calls the API and logs usage."""
t0 = time.monotonic()
response = client.messages.create(
model=MODEL,
max_tokens=1024,
tools=TOOLS,
messages=messages,
)
latency_ms = int((time.monotonic() - t0) * 1000)
# Accumulate tokens
run_state["input_tokens"] += response.usage.input_tokens
run_state["output_tokens"] += response.usage.output_tokens
total = run_state["input_tokens"] + run_state["output_tokens"]
jlog(
"llm_call",
iteration=iteration,
stop_reason=response.stop_reason,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
total_tokens_so_far=total,
latency_ms=latency_ms,
)
return response
The agent loop enforces both the iteration cap and the token budget. When either limit is hit, the loop terminates and returns partial results with a clear explanation of what happened.
SYSTEM = """You are a research assistant. Use tools to answer the user's question. When you have a complete answer, respond with a final text message.""" def run_agent(task: str, trace_id: str | None = None) -> dict: """Run the agent loop with full production instrumentation.""" trace_id = trace_id or str(random.randint(100_000, 999_999)) run_state = {"input_tokens": 0, "output_tokens": 0} jlog("run_start", trace_id=trace_id, task=task) t_run = time.monotonic() messages = [{"role": "user", "content": task}] final_answer = None stop_reason_outer = "unknown" for iteration in range(1, MAX_ITERATIONS + 1): # ── Token budget check ─────────────────────────────────── total_so_far = run_state["input_tokens"] + run_state["output_tokens"] if total_so_far >= TOKEN_BUDGET: stop_reason_outer = "token_budget_exceeded" jlog("budget_exceeded", trace_id=trace_id, total_tokens=total_so_far) final_answer = ( f"[Agent stopped: token budget of {TOKEN_BUDGET:,} exceeded at " f"{total_so_far:,} tokens. Partial work completed in {iteration-1} iterations.]" ) break response = llm_call(messages, run_state, iteration) messages.append({"role": "assistant", "content": response.content}) if response.stop_reason == "end_turn": stop_reason_outer = "end_turn" final_answer = "".join( b.text for b in response.content if hasattr(b, "text") ) break if response.stop_reason == "tool_use": tool_results = [] for block in response.content: if block.type == "tool_use": result = execute_tool(block.name, block.input) tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result, }) messages.append({"role": "user", "content": tool_results}) else: # for-loop exhausted — iteration cap hit stop_reason_outer = "iteration_cap" final_answer = ( f"[Agent stopped: reached {MAX_ITERATIONS} iteration limit without completing the task.]" ) duration_ms = int((time.monotonic() - t_run) * 1000) total_tokens = run_state["input_tokens"] + run_state["output_tokens"] jlog( "run_end", trace_id=trace_id, stop_reason=stop_reason_outer, total_input_tokens=run_state["input_tokens"], total_output_tokens=run_state["output_tokens"], total_tokens=total_tokens, duration_ms=duration_ms, ) return { "trace_id": trace_id, "answer": final_answer, "total_tokens": total_tokens, "stop_reason": stop_reason_outer, }
The eval harness runs a set of test cases and scores each answer with a keyword check. In a real system you would use an LLM-as-judge or a domain-specific rubric — this simple version demonstrates the pattern.
EVAL_CASES = [
{
"task": "What is the approximate population of Earth?",
"must_contain": ["8", "billion"],
"must_not_contain": ["trillion", "million people"],
},
{
"task": "What is 1234 multiplied by 5678?",
"must_contain": ["7,006,652", "7006652"],
"must_not_contain": [],
},
{
"task": "What is the speed of light and how many kilometres does it travel in one second?",
"must_contain": ["299", "792"],
"must_not_contain": [],
},
]
def run_eval() -> None:
print("\n" + "="*60)
print("EVALUATION RUN")
print("="*60)
passed = 0
for i, case in enumerate(EVAL_CASES, 1):
print(f"\n[Case {i}] {case['task']}")
result = run_agent(case["task"])
answer = (result["answer"] or "").lower()
ok = (
all(kw.lower() in answer for kw in case["must_contain"]) and
all(kw.lower() not in answer for kw in case["must_not_contain"])
)
status = "PASS ✓" if ok else "FAIL ✗"
print(f" Status: {status}")
print(f" Tokens: {result['total_tokens']:,}")
print(f" Answer: {result['answer'][:120]}")
if ok:
passed += 1
print(f"\nResult: {passed}/{len(EVAL_CASES)} cases passed.")
print("="*60)
if __name__ == "__main__": # Single run with a trace ID result = run_agent( "What is the population of Earth, and how many people is that per square kilometre of land area? Land area = 148,940,000 km².", trace_id="demo-001", ) print(f"\nFINAL ANSWER:\n{result['answer']}") print(f"Trace: {result['trace_id']} | Tokens: {result['total_tokens']:,} | Stop: {result['stop_reason']}") # Eval harness run_eval()
python prod_agent.py 2>&1 | tee run.jsonl
{"ts": "2026-04-04T10:00:00Z", "event": "run_start", "trace_id": "demo-001", "task": "What is the population..."}
{"ts": "...", "event": "llm_call", "iteration": 1, "stop_reason": "tool_use", "input_tokens": 412, "output_tokens": 48, "total_tokens_so_far": 460, "latency_ms": 621}
{"ts": "...", "event": "tool_call", "tool": "web_search", "input": {"query": "population of earth"}}
{"ts": "...", "event": "tool_result", "tool": "web_search", "result": "Earth's population is approximately 8.1 billion...", "latency_ms": 0}
{"ts": "...", "event": "llm_call", "iteration": 2, "stop_reason": "tool_use", "input_tokens": 521, "output_tokens": 54, "total_tokens_so_far": 1035, "latency_ms": 588}
{"ts": "...", "event": "tool_call", "tool": "calculator", "input": {"expression": "8100000000 / 148940000"}}
{"ts": "...", "event": "tool_result", "tool": "calculator", "result": "54.38...", "latency_ms": 0}
{"ts": "...", "event": "llm_call", "iteration": 3, "stop_reason": "end_turn", "input_tokens": 630, "output_tokens": 72, "total_tokens_so_far": 1737, "latency_ms": 701}
{"ts": "...", "event": "run_end", "trace_id": "demo-001", "stop_reason": "end_turn", "total_tokens": 1737, "duration_ms": 1950}
run_end event gives you the complete cost summary. Pipe this to a file and you can query it with jq or load it into any log aggregator.
The Anthropic SDK handles LLM API retries. But your tool calls to external services need their own retry logic. Add a decorator that wraps any tool function with exponential backoff and jitter.
import functools def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 30.0): """Decorator: retry a function on exception with exponential backoff + jitter.""" def decorator(fn): @functools.wraps(fn) def wrapper(*args, **kwargs): for attempt in range(max_retries + 1): try: return fn(*args, **kwargs) except Exception as e: if attempt == max_retries: raise delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay) jlog("tool_retry", attempt=attempt + 1, error=str(e), delay_s=round(delay, 2)) time.sleep(delay) return wrapper return decorator # Usage: decorate any external tool call that may fail transiently @retry_with_backoff(max_retries=3) def call_external_api(endpoint: str) -> str: # Replace with a real requests.get() or httpx call if random.random() < 0.5: raise ConnectionError("Simulated transient failure") return f"Response from {endpoint}"