SECTION 07 CORE LAB

Decision-Making
& Planning

Knowing how to call a tool is not enough. Real-world tasks require breaking a goal into ordered steps, deciding which step to take next given intermediate results, and recovering when a step fails. This section covers the planning techniques that separate toy demos from production agents: chain-of-thought decomposition, plan-and-execute architecture, Tree of Thoughts search, and principled replanning when things go wrong.

📖 Start Theory 🔬 Jump to Lab

01 · FROM PROMPT TO PLAN

How LLMs Decompose Goals

An LLM's base capability is completing a prompt. Planning is the structured use of that capability to break a complex goal into a sequence of smaller, achievable subgoals. The key insight from the Chain-of-Thought paper (Wei et al., 2022) is that prompting the model to show its reasoning step-by-step — rather than jumping straight to an answer — dramatically improves performance on multi-step tasks.

For agents, Chain-of-Thought is the mechanism behind every planning step. When you ask an agent to "research competitor pricing and write a summary report," it must decompose that into: identify competitors → search each → extract prices → compare → draft report. If the model tries to do all of this in one shot, it hallucinates. If it reasons through each step explicitly, each step becomes a verifiable, correctable unit.

WITHOUT PLANNING

User: "Research our top 3 competitors and write a pricing summary."

Agent: [one giant API call, guesses competitor names, fabricates prices, returns confident-sounding hallucination]

WITH PLANNING

Step 1: Search "top SaaS competitors in [category]"
Step 2: Search "[Competitor A] pricing page"
Step 3: Search "[Competitor B] pricing page"
Step 4: Compare extracted prices
Step 5: Write summary from verified data

Planning also enables verification. When an agent's steps are explicit, each output can be checked — either programmatically (does the search return results?) or by a human reviewer. An agent that thinks in one opaque step gives you no intervention point when something goes wrong.

📎 Sources: Wei et al. — Chain-of-Thought Prompting (arXiv:2201.11903, 2022) · Wang et al. — Plan-and-Solve (arXiv:2305.04091, 2023)

02 · PLAN-AND-EXECUTE ARCHITECTURE

Separate Planning from Execution

In a standard ReAct loop, the agent plans and acts interleaved — each iteration decides the next step based on the last observation. This works well for tasks where the path is truly unknown until runtime. But for tasks that can be decomposed upfront, a cleaner architecture is plan-and-execute: one LLM call to produce the full step list, then a separate executor loop that works through that list.

The Plan-and-Solve paper (Wang et al., 2023) showed that explicitly prompting the model to first devise a plan before solving improves accuracy on reasoning benchmarks. The architectural benefit for agents is even bigger: a pre-generated plan can be inspected, logged, and modified before execution begins — enabling human-in-the-loop approval of the plan before the agent takes any irreversible actions.

// PLAN-AND-EXECUTE FLOW

User Goal → PLANNER LLM CALL → [Step 1, Step 2, Step 3, ...]

↓ (optional: human reviews plan here)

[Step 1, Step 2, Step 3, ...] → EXECUTOR LOOP → Final Output

ADVANTAGES

✓ Plan is inspectable before execution
✓ Enables human-in-the-loop plan approval
✓ Independent steps can run in parallel
✓ Cleaner separation of concerns
✓ Easier to log, replay, and debug

LIMITATIONS

✗ Plan may be invalidated by early step results
✗ Requires a replanning mechanism for failures
✗ Extra LLM call upfront increases cost and latency
✗ Not suitable for highly dynamic, open-ended tasks

📎 Sources: Wang et al. — Plan-and-Solve (arXiv:2305.04091, 2023) · Anthropic — Agents & Tool Use Docs

03 · TREE OF THOUGHTS

Search Over Reasoning Paths

Chain-of-Thought generates a single reasoning path. Tree of Thoughts (ToT) (Yao et al., 2023) generalises this: instead of committing to one reasoning chain, the model explores multiple candidate next steps at each decision point, evaluates them, and pursues the most promising branch — like a search tree over the space of possible reasoning paths.

ToT is overkill for most agent tasks, but it is the right tool for problems where the solution space is large, where greedy step-by-step reasoning fails, and where you have a way to evaluate intermediate states. Classic examples: creative writing with constraints, multi-step math, and strategic planning with competing options.

// CHAIN-OF-THOUGHT vs TREE-OF-THOUGHTS

CoT — single path

Goal
└─ Step A
└─ Step B
└─ Step C → Answer

ToT — branching search

Goal
├─ Path A1 [score: 0.4 ✗]
├─ Path A2 [score: 0.7]
│ ├─ B1 [score: 0.3 ✗]
│ └─ B2 [score: 0.9 → Answer]
└─ Path A3 [score: 0.2 ✗]

ToT is expensive. Each branch requires at least one LLM call, and evaluating branches requires another. A ToT with branching factor 3 and depth 3 may require 12+ LLM calls for a single request. Use it only when output quality justifies the cost — and always set a hard branching budget.

📎 Sources: Yao et al. — Tree of Thoughts (arXiv:2305.10601, 2023) · Wei et al. — Chain-of-Thought (arXiv:2201.11903, 2022)

04 · DECISION-MAKING UNDER UNCERTAINTY

When to Act, When to Ask, When to Stop

Production agents constantly face ambiguous situations: the user's goal is underspecified, a tool returns unexpected data, or two valid next steps have unknown consequences. Anthropic's guidance is explicit: agents should prefer cautious actions, accept a worse expected outcome in exchange for lower variance, and err on the side of doing less and confirming with users when uncertain about intended scope. The key principle is minimal footprint.

❓

Ask for Clarification

When the goal is ambiguous enough that different interpretations lead to meaningfully different actions. Ask before a long irreversible run, not mid-way through it.

PREFER: before irreversible steps

🛑

Pause for Approval

Before any action that cannot be undone — sending emails, deleting data, spending money, deploying code. Gate these with an explicit human checkpoint.

REQUIRED: irreversible actions

🔁

Replan

When a step fails or produces unexpected results, do not blindly retry. Update the remaining step list based on new information and continue from the revised plan.

PREFER: over blind retry

// ACTION CLASSIFICATION — REVERSIBILITY

ACTION	REVERSIBLE?	POLICY
Read file / search web	Yes	Execute freely
Write to a draft / temp file	Yes	Execute freely
Overwrite existing file	Partially	Create backup first
Send email / Slack message	No	Require explicit human approval
Delete records from DB	No	Require explicit human approval
Deploy to production	No	Require explicit human approval

📎 Sources: Anthropic — Agentic Guidance (minimal footprint, irreversible actions) · Lilian Weng — LLM Powered Autonomous Agents (2023)

05 · REPLANNING & FAILURE RECOVERY

Plans Break — Agents Must Adapt

A plan generated upfront is a best-guess, not a contract. Tool calls fail, APIs return unexpected data, and intermediate results invalidate later steps. A robust agent must detect these situations and replan rather than crashing or blindly retrying the same failing step.

Retrying repeats the same action hoping for a different result. Replanning calls the LLM again with the current state — what has been done, what failed, what remains — and asks for a revised step list. The new plan incorporates the failure as information.

FAILURE TYPE 01 — Tool Error

Tool returns an error

The executor raised an exception or returned an error string. Catch it, return it as the tool result, and let the LLM decide whether to retry with different inputs, use a fallback tool, or report failure.

FAILURE TYPE 02 — Wrong Result

Tool succeeds but result is useless

Search returns irrelevant results; a file exists but has no useful content. The LLM must recognise this in its Observation and choose a different query or tool rather than proceeding on bad data.

FAILURE TYPE 03 — Stale Plan

Early step invalidates later steps

Step 1 reveals the task is different from what was assumed. Trigger explicit replanning — call the LLM with a summary of progress and the new information, and ask for a revised remaining plan.

FAILURE TYPE 04 — Max Iterations

Agent stuck in a loop

The agent keeps trying the same approach without progress. Hard iteration budget + escalation path: when the limit is reached, surface partial results with a clear explanation of where the agent got stuck.

Always return errors as tool results — never raise unhandled exceptions. If your tool executor crashes and the exception propagates, the agent loop terminates with no useful output. Wrap all tool calls in try/except and return the error message as a string. The LLM can recover from an error it can read; it cannot recover from a Python stack trace.

📎 Sources: Shinn et al. — Reflexion (arXiv:2303.11366, 2023) · Anthropic — Tool Use & Agentic Guidance

SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources.

Source	Type	Covers	Recency
Wei et al. — Chain-of-Thought Prompting	Academic paper	Chain-of-thought reasoning, step-by-step decomposition, emergent planning	2022
Wang et al. — Plan-and-Solve	Academic paper	Plan-and-execute architecture, planner/executor split, reasoning benchmarks	2023
Yao et al. — Tree of Thoughts	Academic paper	ToT framework, branching search over reasoning paths, evaluation heuristics	2023
Shinn et al. — Reflexion	Academic paper	Failure recovery, verbal reinforcement, replanning from failure	2023
Anthropic — Tool Use & Agentic Guidance	Official docs	Minimal footprint, irreversible actions, human-in-the-loop checkpoints	Maintained 2024–2026
Lilian Weng — LLM Powered Autonomous Agents	Blog / Survey	Planning survey, task decomposition, decision-making under uncertainty	June 2023

HANDS-ON LAB

Build a Plan-and-Execute Agent with Replanning

You will build a plan-and-execute agent that first generates a structured step list, then works through each step with a tool-calling executor. When a step fails, the agent replans — calling the LLM again with current state to generate a revised plan. This is the pattern behind production research and automation agents.

🔬

Plan-and-Execute Agent with Replanning

PYTHON · ~120 LINES · ANTHROPIC API KEY REQUIRED

Set up the file

Use the same agent-lab environment from previous sections:

BASH

cd agent-lab && source .venv/bin/activate
touch plan_execute_agent.py

Define tools including a deliberately flaky one

Three tools: simulated web search, a calculator, and a flaky_api that fails 50% of the time — so you can watch replanning trigger reliably.

PYTHON — plan_execute_agent.py

import os, json, random
import anthropic
from dotenv import load_dotenv

load_dotenv()
client = anthropic.Anthropic()

TOOLS = [
    {
        "name": "web_search",
        "description": "Search the web. Returns a short text result.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"]
        }
    },
    {
        "name": "calculator",
        "description": "Evaluate a Python arithmetic expression. Returns the result as a string.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "A safe arithmetic expression, e.g. '365 * 24 * 60 * 60'"}
            },
            "required": ["expression"]
        }
    },
    {
        "name": "flaky_api",
        "description": "Fetch data from an external API. Sometimes fails — use web_search as a fallback.",
        "input_schema": {
            "type": "object",
            "properties": {"endpoint": {"type": "string"}},
            "required": ["endpoint"]
        }
    },
]

SEARCH_DB = {
    "python popularity": "Python is ranked #1 in the TIOBE index as of 2024–2026.",
    "tiobe index": "TIOBE index ranks Python #1, C #2, C++ #3, Java #4 (2024–2026).",
    "javascript popularity": "JavaScript dominates web development, ranked top 5 in all major surveys.",
}

def execute_tool(name: str, tool_input: dict) -> str:
    try:
        if name == "web_search":
            q = tool_input["query"].lower()
            for key, val in SEARCH_DB.items():
                if key in q:
                    return val
            return f'No results for "{tool_input["query"]}".'
        if name == "calculator":
            result = eval(tool_input["expression"], {"__builtins__": {}})
            return str(result)
        if name == "flaky_api":
            if random.random() < 0.5:
                raise ConnectionError("API timeout after 30s")
            return f"API response for {tool_input['endpoint']}: status=200, data=42"
        return f"Unknown tool: {name}"
    except Exception as e:
        # Always return errors as strings — never let them propagate
        return f"Tool error: {type(e).__name__}: {e}"

Implement the planner

One LLM call that returns a JSON array of steps. When called with a context argument, it acts as a replanner — revising the remaining steps given what has already happened.

PYTHON — plan_execute_agent.py (continued)

PLANNER_SYSTEM = """You are a planning assistant.
Given a user goal, produce a step-by-step plan as a JSON array.
Each step: {"step": integer, "description": string, "tool_hint": string}
tool_hint must be one of: web_search, calculator, flaky_api, none
Return ONLY the JSON array, no other text."""


def generate_plan(goal: str, context: str = "") -> list[dict]:
    prompt = f"Goal: {goal}"
    if context:
        prompt += f"\n\nContext (work done and failures so far):\n{context}"
        prompt += "\n\nRevise the remaining steps given this context."

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        system=PLANNER_SYSTEM,
        messages=[{"role": "user", "content": prompt}]
    )
    raw = response.content[0].text.strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        start, end = raw.find("["), raw.rfind("]") + 1
        return json.loads(raw[start:end])

Implement the executor loop with replanning

The executor works through each plan step. When a step fails, it triggers replanning with a summary of what has been done and what failed.

PYTHON — plan_execute_agent.py (continued)

EXECUTOR_SYSTEM = """You are an execution assistant.
Complete the given step using available tools.
When done, summarise what you did and what you found in 1-2 sentences."""

MAX_REPLAN = 2


def execute_step(step: dict) -> tuple[str, bool]:
    """Execute one plan step. Returns (summary, success)."""
    messages = [{"role": "user", "content": f"Complete this step: {step['description']}"}]

    for _ in range(5):
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=512,
            system=EXECUTOR_SYSTEM,
            tools=TOOLS,
            messages=messages
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            summary = "".join(b.text for b in response.content if hasattr(b, "text"))
            failed = "error" in summary.lower() or "failed" in summary.lower()
            return summary, not failed

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    print(f"    tool: {block.name}({block.input}) → {result[:80]}")
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
            messages.append({"role": "user", "content": tool_results})

    return "Step did not complete within iteration budget.", False


def run_plan_execute(goal: str) -> None:
    print(f"\n{'='*60}\nGOAL: {goal}\n{'='*60}")

    plan = generate_plan(goal)
    replan_count = 0
    completed: list[str] = []

    while plan and replan_count <= MAX_REPLAN:
        step = plan.pop(0)
        print(f"\n[step {step['step']}] {step['description']}")

        summary, success = execute_step(step)
        print(f"  → {summary}")

        if success:
            completed.append(f"Step {step['step']}: {summary}")
        else:
            print(f"\n  ⚠ Step failed. Replanning... ({replan_count + 1}/{MAX_REPLAN})")
            context = "\n".join(completed) + f"\nFailed: {step['description']}\nReason: {summary}"
            plan = generate_plan(goal, context=context)
            replan_count += 1
            print(f"  Revised plan has {len(plan)} remaining steps.")

    print(f"\n{'='*60}\nCOMPLETED STEPS:")
    for c in completed:
        print(f"  ✓ {c}")
    print(f"{'='*60}")


if __name__ == "__main__":
    run_plan_execute(
        "Research the most popular programming language in 2024–2026, "
        "then calculate how many seconds are in a year."
    )

Run it and observe plan generation and replanning

BASH

python plan_execute_agent.py

EXPECTED OUTPUT (replanning triggered)

============================================================
GOAL: Research the most popular programming language ...
============================================================

[step 1] Search for the most popular programming language 2024-2026
    tool: web_search({'query': 'python popularity'}) → Python is ranked #1...
  → Found that Python is ranked #1 by the TIOBE index for 2024–2026.

[step 2] Fetch supporting data from the TIOBE API
    tool: flaky_api({'endpoint': 'tiobe/rankings'}) → Tool error: ConnectionError: API timeout
  → Step failed due to API timeout error.

  ⚠ Step failed. Replanning... (1/2)
  Revised plan has 2 remaining steps.

[step 2] Search for TIOBE index rankings as fallback
    tool: web_search({'query': 'tiobe index'}) → TIOBE index ranks Python #1, C #2...
  → Confirmed via web search: Python #1, C #2, C++ #3.

[step 3] Calculate seconds in a year
    tool: calculator({'expression': '365 * 24 * 60 * 60'}) → 31536000
  → There are 31,536,000 seconds in a year.

============================================================
COMPLETED STEPS:
  ✓ Step 1: Python is ranked #1 by TIOBE for 2024–2026.
  ✓ Step 2: Confirmed Python #1 via web search fallback.
  ✓ Step 3: 31,536,000 seconds in a year.
============================================================

What to observe: Run it several times. About half the time flaky_api succeeds and the plan runs straight through. The other half it fails and you see the replanning trigger — the revised plan replaces the broken step with a web search fallback. This is how production agents handle unreliable external dependencies.

Extension: add a human-in-the-loop plan approval gate

After generate_plan returns, print the plan and ask the user to approve it before execution begins. This implements the minimal footprint principle — the user reviews intended actions before any tool runs.

PYTHON — add after generate_plan in run_plan_execute

# Print plan and request approval before execution
print("\nProposed plan:")
for s in plan:
    print(f"  {s['step']}. {s['description']} [tool: {s['tool_hint']}]")

approval = input("\nApprove this plan? (yes/no): ").strip().lower()
if approval != "yes":
    print("Plan rejected. Exiting.")
    return

Finished the theory and completed the lab? Mark this section complete to track your progress.

Last updated: April 5, 2026

Decision-Making& Planning

How LLMs Decompose Goals

Separate Planning from Execution

Search Over Reasoning Paths

When to Act, When to Ask, When to Stop

Plans Break — Agents Must Adapt

Verified References

Build a Plan-and-Execute Agent with Replanning

Decision-Making
& Planning