SECTION 16 FINAL SECTION 2026 NEW

Long-Horizon
Agents

Single-turn agents answer questions. Long-horizon agents complete projects — spanning dozens of steps, hours of clock time, and multiple sessions. This section covers the unique engineering challenges of state, context, planning, recovery, and evaluation that arise when agents must sustain coherent work across an extended timeline.

📖 Start Theory 📝 Take the Quiz

01 · WHAT IS A LONG-HORIZON TASK?

Multi-Step, Persistent, and Fault-Tolerant by Design

A long-horizon task cannot be answered in a single LLM call. Each step depends on the outputs of earlier steps — tool results, written files, API responses — forming a directed dependency graph, not a linear list. Long-horizon agents run for minutes to hours, not seconds, and must survive failures mid-run.

WIDELY USED (2024–2026)

Multi-Step Dependency

Step N depends on step N−1. No single prompt can shortcut a sequence that requires reading files, running tests, interpreting failures, and revising — each phase builds on the last.

WIDELY USED (2024–2026)

Extended Duration

Runs lasting minutes to hours expose concerns single-turn agents never face: API timeouts, session expiry, machine restarts, billing cap hits, and user disconnects mid-task.

WIDELY USED (2024–2026)

Persistent State

The agent must track what it has done and what remains. Without explicit state management, a restarted agent either repeats completed work (costly) or skips it (silent partial completion).

EMERGING PATTERN (2025–2026) — SWE-BENCH: THE CANONICAL LONG-HORIZON BENCHMARK

SWE-bench (Jimenez et al., 2023) gives an agent a real GitHub issue and checks whether its generated patch passes the repository's test suite. Each task requires reading dozens of files, reproducing a bug, editing code, running tests, interpreting failures, and iterating — a canonical long-horizon evaluation.

1. Clone repo, read issue description
2. Explore file structure, find relevant code
3. Reproduce the bug via tests
4. Hypothesise fix → edit source files
5. Run tests → observe failures → refine fix
6. Repeat steps 4–5 until tests pass → submit patch

What changes at scale: A 5-step ReAct loop is a single-turn agent. A 50-step coding pipeline that writes files, runs a test suite, interprets failures, and iterates is a long-horizon agent. The LLM calls are the same — the engineering around them is fundamentally different.

📎 Sources: Jimenez et al. — SWE-bench (arXiv:2310.06770, 2023) · Anthropic — Building Effective Agents (2024)

02 · STATE MANAGEMENT & CHECKPOINTING

Durable State Outside the Context Window

The central challenge of long-horizon agents is durable state: persisting enough information that the agent can resume after a crash, timeout, or deliberate pause — without replaying all prior work. All mutable task state must live outside the context window, in a database, file system, or task queue.

MINIMAL DURABLE STATE SCHEMA

task_state = {
  "task_id": "proj-abc123",
  "status": "in_progress", # pending | in_progress | done | failed
  "plan": ["step1", "step2", ...],
  "completed": ["step1"], # steps already done
  "artifacts": {"step1": "/tmp/outline.md"},
  "last_checkpoint": "2026-04-04T10:32:00Z"
}

WIDELY USED (2024–2026) — MILESTONE CHECKPOINTING

Checkpoint only at milestone boundaries — natural stopping points where an artefact is complete. Checkpointing after every LLM call wastes I/O; checkpointing only at phase completion balances durability with overhead.

Analogy: a video game saves at the end of a level, not after every enemy killed. If you die mid-level you replay only that level.

EMERGING PATTERN (2025–2026) — IDEMPOTENT TOOL CALLS

When an agent resumes from a checkpoint, it may replay the step that failed. Tool calls should be idempotent: calling "write file X with content Y" twice should produce the same result as calling it once — preventing duplicate emails or duplicate DB inserts on recovery.

WIDELY USED (2024–2026) — LANGGRAPH PERSISTENT STATE

LangGraph's SqliteSaver checkpointer automatically snapshots graph state after each node execution. Pass a thread_id to resume from the last checkpoint — no manual serialisation required.

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("tasks.db")
app = graph.compile(checkpointer=checkpointer)

# First run — starts the task
app.invoke({"task": "..."}, config={"configurable": {"thread_id": "run-1"}})

# Resume after crash — picks up from last checkpoint
app.invoke(None, config={"configurable": {"thread_id": "run-1"}})

📎 Sources: LangGraph docs — persistence & checkpointing (2024) · Anthropic — Agentic AI docs

03 · CONTEXT MANAGEMENT OVER LONG TASKS

Preventing Context Overflow Without Losing Progress

Every LLM has a finite context window. A long-horizon agent accumulates tool outputs, intermediate results, and prior reasoning until it hits this limit. Without active management, the agent either crashes or silently drops early context — losing knowledge of work it already completed.

WIDELY USED (2024–2026) — HIERARCHICAL SUMMARISATION

When a phase completes, replace its detailed messages with a compact summary and archive the details externally. The active context always contains: (1) the original task + high-level plan, (2) summaries of completed phases, (3) full detail of the current phase only.

def summarize_phase(client, messages, phase_name):
  resp = client.messages.create(
    model=MODEL, max_tokens=512,
    system="Summarise key facts, decisions, and artefacts. Omit raw tool outputs.",
    messages=[*messages,
      {"role":"user","content":f"Summarise '{phase_name}' in ≤200 words."}]
  )
  return resp.content[0].text

WIDELY USED (2024–2026) — EXTERNAL ARTEFACT MEMORY

Large outputs (file contents, API responses, test logs) should never live in the context window. Write to disk → store the path and a one-line description in context → retrieve on demand via a read-file tool.

EMERGING PATTERN (2025–2026) — TOKEN BUDGET TRACKING

Proactively track cumulative token usage. When approaching ~70% of the context window, trigger summarisation before the window fills — not after the API returns a context_length_exceeded error.

Pitfall — Silent context truncation: Some providers silently drop messages when the context window fills rather than returning an error. Always monitor token usage explicitly and test with long runs during development.

📎 Sources: Park et al. — Generative Agents (arXiv:2304.03442, 2023) · Anthropic — Long context tips

04 · HIERARCHICAL PLANNING & DYNAMIC REPLANNING

Planning at Two Levels — and Updating When Reality Diverges

A long task requires a plan at multiple granularities. When a step fails or new information arrives, the agent must replan — updating the execution plan without abandoning the overall objective.

WIDELY USED (2024–2026) — TWO-LEVEL PLAN

Level 1 — Phase plan: stable, rarely changes. "Phase 1: gather requirements. Phase 2: scaffold codebase. Phase 3: implement. Phase 4: test."

Level 2 — Step plan: generated fresh at the start of each phase, using current agent state as context. Keeps detailed plans accurate without over-committing up front.

EMERGING PATTERN (2025–2026) — REPLANNING TRIGGERS

A replanning step should fire when:

A required tool call fails
An intermediate result invalidates downstream assumptions
A step produces unexpectedly large or complex output
A human checkpoint reveals a misunderstood requirement

WIDELY USED (2024–2026) — HUMAN-IN-THE-LOOP CHECKPOINTS

Insert mandatory human approval gates at high-stakes phase transitions: before writing to production, before sending external communications, before deleting data. Human checkpoints prevent error amplification across phases.

Analogy: a contractor builds the frame, then the homeowner inspects before drywall goes up — because fixing mistakes is cheap before walls are closed and expensive after.

📎 Sources: Wang et al. — Plan-and-Solve (arXiv:2305.04091, 2023) · Shinn et al. — Reflexion (arXiv:2303.11366, 2023) · Anthropic — Building Effective Agents (2024)

05 · EVALUATING LONG-HORIZON AGENTS

Measuring Trajectories, Not Just Final Answers

Standard LLM evals measure a single response. Long-horizon evaluation must assess an entire trajectory — including whether intermediate steps were correct, whether the agent recovered from errors, and whether the final output is usable.

EMERGING PATTERN (2025–2026) — SWE-BENCH OUTCOME EVAL

Pass the test suite of a real GitHub repo with the agent's generated patch. Outcome is binary (tests pass / fail) — no human rater required. Fully automated, objective, and resistant to post-hoc rationalisation.

Pitfall: an agent that deletes the failing tests passes SWE-bench but is useless. Validate that changes are behaviorally correct, not just technically passing.

EMERGING PATTERN (2025–2026) — STEP-LEVEL PARTIAL CREDIT

Binary pass/fail misses progress. Step-level scoring awards partial credit for completing phases correctly even if the final result fails — useful for diagnosing where agents break down.

Phase 1 — Reproduce bug: ✓ 20 pts
Phase 2 — Locate cause: ✓ 20 pts
Phase 3 — Implement fix: ~ 10/20 pts
Phase 4 — Pass tests: ✗ 0/40 pts
────────────────────────────
Total: 50 / 100

Trajectory metric	What it measures	Direction
Steps to completion	Efficiency of the path taken	Lower is better
Replanning rate	Quality of the initial plan	Lower is better
Recovery rate	Robustness to mid-task failures	Higher is better
Token cost per task	Economic viability	Lower is better
Human interrupts needed	Degree of autonomy achieved	Lower is better

Current state of the field (2026): Long-horizon agents remain an active research frontier. Performance on SWE-bench has improved sharply since 2024, but tasks requiring >100 interdependent steps, multi-day persistence, or complex human collaboration are still not reliably solved. The techniques in this section represent best current practice — not a solved problem.

📎 Sources: Jimenez et al. — SWE-bench (arXiv:2310.06770, 2023) · Anthropic — Building Effective Agents (2024) · Anthropic — Agentic AI docs

SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

Source	Type	Covers	Recency
Jimenez et al. arXiv:2310.06770	Peer-reviewed paper	SWE-bench — code agent benchmark on real GitHub issues	Oct 2023
Anthropic — Building Effective Agents	Official guide	Patterns for long-running and multi-step agents	2024
Anthropic — Agentic AI docs	Official docs	State management, tool use, approval patterns	2025
Park et al. arXiv:2304.03442	Peer-reviewed paper	Generative Agents — long-running three-tier memory architecture	Apr 2023
Wang et al. arXiv:2305.04091	Peer-reviewed paper	Plan-and-Solve — explicit planning before execution	May 2023
Shinn et al. arXiv:2303.11366	Peer-reviewed paper	Reflexion — verbal self-reflection and replanning	Mar 2023
LangGraph docs	Official docs	Persistent state, checkpointing, resumable graphs	2025

Finished the theory and passed the quiz? Mark this section complete to track your progress.

Last updated: April 5, 2026

Long-HorizonAgents

Multi-Step, Persistent, and Fault-Tolerant by Design

Durable State Outside the Context Window

Preventing Context Overflow Without Losing Progress

Planning at Two Levels — and Updating When Reality Diverges

Measuring Trajectories, Not Just Final Answers

Verified References

Section 16 Quiz

Long-Horizon
Agents