Course  /  16 · Long-Horizon Agents
SECTION 16 FINAL SECTION 2026 NEW

Long-Horizon
Agents

Single-turn agents answer questions. Long-horizon agents complete projects — spanning dozens of steps, hours of clock time, and multiple sessions. This section covers the unique engineering challenges of state, context, planning, recovery, and evaluation that arise when agents must sustain coherent work across an extended timeline.

01 · WHAT IS A LONG-HORIZON TASK?

Multi-Step, Persistent, and Fault-Tolerant by Design

A long-horizon task cannot be answered in a single LLM call. Each step depends on the outputs of earlier steps — tool results, written files, API responses — forming a directed dependency graph, not a linear list. Long-horizon agents run for minutes to hours, not seconds, and must survive failures mid-run.

WIDELY USED (2024–2026)
Multi-Step Dependency

Step N depends on step N−1. No single prompt can shortcut a sequence that requires reading files, running tests, interpreting failures, and revising — each phase builds on the last.

WIDELY USED (2024–2026)
Extended Duration

Runs lasting minutes to hours expose concerns single-turn agents never face: API timeouts, session expiry, machine restarts, billing cap hits, and user disconnects mid-task.

WIDELY USED (2024–2026)
Persistent State

The agent must track what it has done and what remains. Without explicit state management, a restarted agent either repeats completed work (costly) or skips it (silent partial completion).

EMERGING PATTERN (2025–2026) — SWE-BENCH: THE CANONICAL LONG-HORIZON BENCHMARK

SWE-bench (Jimenez et al., 2023) gives an agent a real GitHub issue and checks whether its generated patch passes the repository's test suite. Each task requires reading dozens of files, reproducing a bug, editing code, running tests, interpreting failures, and iterating — a canonical long-horizon evaluation.

1. Clone repo, read issue description
2. Explore file structure, find relevant code
3. Reproduce the bug via tests
4. Hypothesise fix → edit source files
5. Run tests → observe failures → refine fix
6. Repeat steps 4–5 until tests pass → submit patch
What changes at scale: A 5-step ReAct loop is a single-turn agent. A 50-step coding pipeline that writes files, runs a test suite, interprets failures, and iterates is a long-horizon agent. The LLM calls are the same — the engineering around them is fundamentally different.
02 · STATE MANAGEMENT & CHECKPOINTING

Durable State Outside the Context Window

The central challenge of long-horizon agents is durable state: persisting enough information that the agent can resume after a crash, timeout, or deliberate pause — without replaying all prior work. All mutable task state must live outside the context window, in a database, file system, or task queue.

MINIMAL DURABLE STATE SCHEMA
task_state = {
  "task_id": "proj-abc123",
  "status": "in_progress", # pending | in_progress | done | failed
  "plan": ["step1", "step2", ...],
  "completed": ["step1"], # steps already done
  "artifacts": {"step1": "/tmp/outline.md"},
  "last_checkpoint": "2026-04-04T10:32:00Z"
}
WIDELY USED (2024–2026) — MILESTONE CHECKPOINTING

Checkpoint only at milestone boundaries — natural stopping points where an artefact is complete. Checkpointing after every LLM call wastes I/O; checkpointing only at phase completion balances durability with overhead.

Analogy: a video game saves at the end of a level, not after every enemy killed. If you die mid-level you replay only that level.

EMERGING PATTERN (2025–2026) — IDEMPOTENT TOOL CALLS

When an agent resumes from a checkpoint, it may replay the step that failed. Tool calls should be idempotent: calling "write file X with content Y" twice should produce the same result as calling it once — preventing duplicate emails or duplicate DB inserts on recovery.

WIDELY USED (2024–2026) — LANGGRAPH PERSISTENT STATE

LangGraph's SqliteSaver checkpointer automatically snapshots graph state after each node execution. Pass a thread_id to resume from the last checkpoint — no manual serialisation required.

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("tasks.db")
app = graph.compile(checkpointer=checkpointer)

# First run — starts the task
app.invoke({"task": "..."}, config={"configurable": {"thread_id": "run-1"}})

# Resume after crash — picks up from last checkpoint
app.invoke(None, config={"configurable": {"thread_id": "run-1"}})
03 · CONTEXT MANAGEMENT OVER LONG TASKS

Preventing Context Overflow Without Losing Progress

Every LLM has a finite context window. A long-horizon agent accumulates tool outputs, intermediate results, and prior reasoning until it hits this limit. Without active management, the agent either crashes or silently drops early context — losing knowledge of work it already completed.

WIDELY USED (2024–2026) — HIERARCHICAL SUMMARISATION

When a phase completes, replace its detailed messages with a compact summary and archive the details externally. The active context always contains: (1) the original task + high-level plan, (2) summaries of completed phases, (3) full detail of the current phase only.

def summarize_phase(client, messages, phase_name):
  resp = client.messages.create(
    model=MODEL, max_tokens=512,
    system="Summarise key facts, decisions, and artefacts. Omit raw tool outputs.",
    messages=[*messages,
      {"role":"user","content":f"Summarise '{phase_name}' in ≤200 words."}]
  )
  return resp.content[0].text
WIDELY USED (2024–2026) — EXTERNAL ARTEFACT MEMORY

Large outputs (file contents, API responses, test logs) should never live in the context window. Write to disk → store the path and a one-line description in context → retrieve on demand via a read-file tool.

EMERGING PATTERN (2025–2026) — TOKEN BUDGET TRACKING

Proactively track cumulative token usage. When approaching ~70% of the context window, trigger summarisation before the window fills — not after the API returns a context_length_exceeded error.

Pitfall — Silent context truncation: Some providers silently drop messages when the context window fills rather than returning an error. Always monitor token usage explicitly and test with long runs during development.
04 · HIERARCHICAL PLANNING & DYNAMIC REPLANNING

Planning at Two Levels — and Updating When Reality Diverges

A long task requires a plan at multiple granularities. When a step fails or new information arrives, the agent must replan — updating the execution plan without abandoning the overall objective.

WIDELY USED (2024–2026) — TWO-LEVEL PLAN

Level 1 — Phase plan: stable, rarely changes. "Phase 1: gather requirements. Phase 2: scaffold codebase. Phase 3: implement. Phase 4: test."

Level 2 — Step plan: generated fresh at the start of each phase, using current agent state as context. Keeps detailed plans accurate without over-committing up front.

EMERGING PATTERN (2025–2026) — REPLANNING TRIGGERS

A replanning step should fire when:

  • A required tool call fails
  • An intermediate result invalidates downstream assumptions
  • A step produces unexpectedly large or complex output
  • A human checkpoint reveals a misunderstood requirement
WIDELY USED (2024–2026) — HUMAN-IN-THE-LOOP CHECKPOINTS

Insert mandatory human approval gates at high-stakes phase transitions: before writing to production, before sending external communications, before deleting data. Human checkpoints prevent error amplification across phases.

Analogy: a contractor builds the frame, then the homeowner inspects before drywall goes up — because fixing mistakes is cheap before walls are closed and expensive after.

05 · EVALUATING LONG-HORIZON AGENTS

Measuring Trajectories, Not Just Final Answers

Standard LLM evals measure a single response. Long-horizon evaluation must assess an entire trajectory — including whether intermediate steps were correct, whether the agent recovered from errors, and whether the final output is usable.

EMERGING PATTERN (2025–2026) — SWE-BENCH OUTCOME EVAL

Pass the test suite of a real GitHub repo with the agent's generated patch. Outcome is binary (tests pass / fail) — no human rater required. Fully automated, objective, and resistant to post-hoc rationalisation.

Pitfall: an agent that deletes the failing tests passes SWE-bench but is useless. Validate that changes are behaviorally correct, not just technically passing.

EMERGING PATTERN (2025–2026) — STEP-LEVEL PARTIAL CREDIT

Binary pass/fail misses progress. Step-level scoring awards partial credit for completing phases correctly even if the final result fails — useful for diagnosing where agents break down.

Phase 1 — Reproduce bug: ✓ 20 pts
Phase 2 — Locate cause: ✓ 20 pts
Phase 3 — Implement fix: ~ 10/20 pts
Phase 4 — Pass tests: ✗ 0/40 pts
────────────────────────────
Total: 50 / 100
Trajectory metricWhat it measuresDirection
Steps to completionEfficiency of the path takenLower is better
Replanning rateQuality of the initial planLower is better
Recovery rateRobustness to mid-task failuresHigher is better
Token cost per taskEconomic viabilityLower is better
Human interrupts neededDegree of autonomy achievedLower is better
Current state of the field (2026): Long-horizon agents remain an active research frontier. Performance on SWE-bench has improved sharply since 2024, but tasks requiring >100 interdependent steps, multi-day persistence, or complex human collaboration are still not reliably solved. The techniques in this section represent best current practice — not a solved problem.
SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

SourceTypeCoversRecency
Jimenez et al. arXiv:2310.06770 Peer-reviewed paper SWE-bench — code agent benchmark on real GitHub issues Oct 2023
Anthropic — Building Effective Agents Official guide Patterns for long-running and multi-step agents 2024
Anthropic — Agentic AI docs Official docs State management, tool use, approval patterns 2025
Park et al. arXiv:2304.03442 Peer-reviewed paper Generative Agents — long-running three-tier memory architecture Apr 2023
Wang et al. arXiv:2305.04091 Peer-reviewed paper Plan-and-Solve — explicit planning before execution May 2023
Shinn et al. arXiv:2303.11366 Peer-reviewed paper Reflexion — verbal self-reflection and replanning Mar 2023
LangGraph docs Official docs Persistent state, checkpointing, resumable graphs 2025
KNOWLEDGE CHECK

Section 16 Quiz

8 questions covering all theory blocks. Select one answer per question, then submit.

📝
Section 16 — Long-Horizon Agents
8 QUESTIONS · MULTIPLE CHOICE · UNLIMITED RETRIES
Question 1 of 8
Which characteristic most distinguishes a long-horizon agent task from a standard single-turn task?
Question 2 of 8
When designing a checkpointing strategy for a long-horizon agent, which approach best balances durability with I/O overhead?
Question 3 of 8
An agent is 40 steps into a 70-step task and approaching its context window limit. What is the most effective mitigation?
Question 4 of 8
In a hierarchical plan-and-execute system, what should ideally trigger a replanning step?
Question 5 of 8
SWE-bench evaluates agents by:
Question 6 of 8
Which of the following best describes the correct placement of a human-in-the-loop approval checkpoint in a long-horizon workflow?
Question 7 of 8
Which failure mode is most specific to long-horizon agents and rarely occurs in single-turn agents?
Question 8 of 8
Which combination of techniques most directly addresses both context overflow and mid-task failure recovery in long-horizon agents?

Finished the theory and passed the quiz? Mark this section complete to track your progress.

Last updated: