Long-Horizon
Agents
Single-turn agents answer questions. Long-horizon agents complete projects — spanning dozens of steps, hours of clock time, and multiple sessions. This section covers the unique engineering challenges of state, context, planning, recovery, and evaluation that arise when agents must sustain coherent work across an extended timeline.
Multi-Step, Persistent, and Fault-Tolerant by Design
A long-horizon task cannot be answered in a single LLM call. Each step depends on the outputs of earlier steps — tool results, written files, API responses — forming a directed dependency graph, not a linear list. Long-horizon agents run for minutes to hours, not seconds, and must survive failures mid-run.
Step N depends on step N−1. No single prompt can shortcut a sequence that requires reading files, running tests, interpreting failures, and revising — each phase builds on the last.
Runs lasting minutes to hours expose concerns single-turn agents never face: API timeouts, session expiry, machine restarts, billing cap hits, and user disconnects mid-task.
The agent must track what it has done and what remains. Without explicit state management, a restarted agent either repeats completed work (costly) or skips it (silent partial completion).
SWE-bench (Jimenez et al., 2023) gives an agent a real GitHub issue and checks whether its generated patch passes the repository's test suite. Each task requires reading dozens of files, reproducing a bug, editing code, running tests, interpreting failures, and iterating — a canonical long-horizon evaluation.
2. Explore file structure, find relevant code
3. Reproduce the bug via tests
4. Hypothesise fix → edit source files
5. Run tests → observe failures → refine fix
6. Repeat steps 4–5 until tests pass → submit patch
Durable State Outside the Context Window
The central challenge of long-horizon agents is durable state: persisting enough information that the agent can resume after a crash, timeout, or deliberate pause — without replaying all prior work. All mutable task state must live outside the context window, in a database, file system, or task queue.
"task_id": "proj-abc123",
"status": "in_progress", # pending | in_progress | done | failed
"plan": ["step1", "step2", ...],
"completed": ["step1"], # steps already done
"artifacts": {"step1": "/tmp/outline.md"},
"last_checkpoint": "2026-04-04T10:32:00Z"
}
Checkpoint only at milestone boundaries — natural stopping points where an artefact is complete. Checkpointing after every LLM call wastes I/O; checkpointing only at phase completion balances durability with overhead.
Analogy: a video game saves at the end of a level, not after every enemy killed. If you die mid-level you replay only that level.
When an agent resumes from a checkpoint, it may replay the step that failed. Tool calls should be idempotent: calling "write file X with content Y" twice should produce the same result as calling it once — preventing duplicate emails or duplicate DB inserts on recovery.
LangGraph's SqliteSaver checkpointer automatically snapshots graph state after each node execution. Pass a thread_id to resume from the last checkpoint — no manual serialisation required.
checkpointer = SqliteSaver.from_conn_string("tasks.db")
app = graph.compile(checkpointer=checkpointer)
# First run — starts the task
app.invoke({"task": "..."}, config={"configurable": {"thread_id": "run-1"}})
# Resume after crash — picks up from last checkpoint
app.invoke(None, config={"configurable": {"thread_id": "run-1"}})
Preventing Context Overflow Without Losing Progress
Every LLM has a finite context window. A long-horizon agent accumulates tool outputs, intermediate results, and prior reasoning until it hits this limit. Without active management, the agent either crashes or silently drops early context — losing knowledge of work it already completed.
When a phase completes, replace its detailed messages with a compact summary and archive the details externally. The active context always contains: (1) the original task + high-level plan, (2) summaries of completed phases, (3) full detail of the current phase only.
resp = client.messages.create(
model=MODEL, max_tokens=512,
system="Summarise key facts, decisions, and artefacts. Omit raw tool outputs.",
messages=[*messages,
{"role":"user","content":f"Summarise '{phase_name}' in ≤200 words."}]
)
return resp.content[0].text
Large outputs (file contents, API responses, test logs) should never live in the context window. Write to disk → store the path and a one-line description in context → retrieve on demand via a read-file tool.
Proactively track cumulative token usage. When approaching ~70% of the context window, trigger summarisation before the window fills — not after the API returns a context_length_exceeded error.
Planning at Two Levels — and Updating When Reality Diverges
A long task requires a plan at multiple granularities. When a step fails or new information arrives, the agent must replan — updating the execution plan without abandoning the overall objective.
Level 1 — Phase plan: stable, rarely changes. "Phase 1: gather requirements. Phase 2: scaffold codebase. Phase 3: implement. Phase 4: test."
Level 2 — Step plan: generated fresh at the start of each phase, using current agent state as context. Keeps detailed plans accurate without over-committing up front.
A replanning step should fire when:
- A required tool call fails
- An intermediate result invalidates downstream assumptions
- A step produces unexpectedly large or complex output
- A human checkpoint reveals a misunderstood requirement
Insert mandatory human approval gates at high-stakes phase transitions: before writing to production, before sending external communications, before deleting data. Human checkpoints prevent error amplification across phases.
Analogy: a contractor builds the frame, then the homeowner inspects before drywall goes up — because fixing mistakes is cheap before walls are closed and expensive after.
Measuring Trajectories, Not Just Final Answers
Standard LLM evals measure a single response. Long-horizon evaluation must assess an entire trajectory — including whether intermediate steps were correct, whether the agent recovered from errors, and whether the final output is usable.
Pass the test suite of a real GitHub repo with the agent's generated patch. Outcome is binary (tests pass / fail) — no human rater required. Fully automated, objective, and resistant to post-hoc rationalisation.
Pitfall: an agent that deletes the failing tests passes SWE-bench but is useless. Validate that changes are behaviorally correct, not just technically passing.
Binary pass/fail misses progress. Step-level scoring awards partial credit for completing phases correctly even if the final result fails — useful for diagnosing where agents break down.
Phase 2 — Locate cause: ✓ 20 pts
Phase 3 — Implement fix: ~ 10/20 pts
Phase 4 — Pass tests: ✗ 0/40 pts
────────────────────────────
Total: 50 / 100
| Trajectory metric | What it measures | Direction |
|---|---|---|
| Steps to completion | Efficiency of the path taken | Lower is better |
| Replanning rate | Quality of the initial plan | Lower is better |
| Recovery rate | Robustness to mid-task failures | Higher is better |
| Token cost per task | Economic viability | Lower is better |
| Human interrupts needed | Degree of autonomy achieved | Lower is better |
Verified References
Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.
| Source | Type | Covers | Recency |
|---|---|---|---|
| Jimenez et al. arXiv:2310.06770 | Peer-reviewed paper | SWE-bench — code agent benchmark on real GitHub issues | Oct 2023 |
| Anthropic — Building Effective Agents | Official guide | Patterns for long-running and multi-step agents | 2024 |
| Anthropic — Agentic AI docs | Official docs | State management, tool use, approval patterns | 2025 |
| Park et al. arXiv:2304.03442 | Peer-reviewed paper | Generative Agents — long-running three-tier memory architecture | Apr 2023 |
| Wang et al. arXiv:2305.04091 | Peer-reviewed paper | Plan-and-Solve — explicit planning before execution | May 2023 |
| Shinn et al. arXiv:2303.11366 | Peer-reviewed paper | Reflexion — verbal self-reflection and replanning | Mar 2023 |
| LangGraph docs | Official docs | Persistent state, checkpointing, resumable graphs | 2025 |
Section 16 Quiz
8 questions covering all theory blocks. Select one answer per question, then submit.