Decision-Making
& Planning
Knowing how to call a tool is not enough. Real-world tasks require breaking a goal into ordered steps, deciding which step to take next given intermediate results, and recovering when a step fails. This section covers the planning techniques that separate toy demos from production agents: chain-of-thought decomposition, plan-and-execute architecture, Tree of Thoughts search, and principled replanning when things go wrong.
How LLMs Decompose Goals
An LLM's base capability is completing a prompt. Planning is the structured use of that capability to break a complex goal into a sequence of smaller, achievable subgoals. The key insight from the Chain-of-Thought paper (Wei et al., 2022) is that prompting the model to show its reasoning step-by-step — rather than jumping straight to an answer — dramatically improves performance on multi-step tasks.
For agents, Chain-of-Thought is the mechanism behind every planning step. When you ask an agent to "research competitor pricing and write a summary report," it must decompose that into: identify competitors → search each → extract prices → compare → draft report. If the model tries to do all of this in one shot, it hallucinates. If it reasons through each step explicitly, each step becomes a verifiable, correctable unit.
Agent: [one giant API call, guesses competitor names, fabricates prices, returns confident-sounding hallucination]
Step 2: Search "[Competitor A] pricing page"
Step 3: Search "[Competitor B] pricing page"
Step 4: Compare extracted prices
Step 5: Write summary from verified data
Separate Planning from Execution
In a standard ReAct loop, the agent plans and acts interleaved — each iteration decides the next step based on the last observation. This works well for tasks where the path is truly unknown until runtime. But for tasks that can be decomposed upfront, a cleaner architecture is plan-and-execute: one LLM call to produce the full step list, then a separate executor loop that works through that list.
The Plan-and-Solve paper (Wang et al., 2023) showed that explicitly prompting the model to first devise a plan before solving improves accuracy on reasoning benchmarks. The architectural benefit for agents is even bigger: a pre-generated plan can be inspected, logged, and modified before execution begins — enabling human-in-the-loop approval of the plan before the agent takes any irreversible actions.
✓ Enables human-in-the-loop plan approval
✓ Independent steps can run in parallel
✓ Cleaner separation of concerns
✓ Easier to log, replay, and debug
✗ Requires a replanning mechanism for failures
✗ Extra LLM call upfront increases cost and latency
✗ Not suitable for highly dynamic, open-ended tasks
Search Over Reasoning Paths
Chain-of-Thought generates a single reasoning path. Tree of Thoughts (ToT) (Yao et al., 2023) generalises this: instead of committing to one reasoning chain, the model explores multiple candidate next steps at each decision point, evaluates them, and pursues the most promising branch — like a search tree over the space of possible reasoning paths.
ToT is overkill for most agent tasks, but it is the right tool for problems where the solution space is large, where greedy step-by-step reasoning fails, and where you have a way to evaluate intermediate states. Classic examples: creative writing with constraints, multi-step math, and strategic planning with competing options.
└─ Step A
└─ Step B
└─ Step C → Answer
├─ Path A1 [score: 0.4 ✗]
├─ Path A2 [score: 0.7]
│ ├─ B1 [score: 0.3 ✗]
│ └─ B2 [score: 0.9 → Answer]
└─ Path A3 [score: 0.2 ✗]
When to Act, When to Ask, When to Stop
Production agents constantly face ambiguous situations: the user's goal is underspecified, a tool returns unexpected data, or two valid next steps have unknown consequences. Anthropic's guidance is explicit: agents should prefer cautious actions, accept a worse expected outcome in exchange for lower variance, and err on the side of doing less and confirming with users when uncertain about intended scope. The key principle is minimal footprint.
| ACTION | REVERSIBLE? | POLICY |
|---|---|---|
| Read file / search web | Yes | Execute freely |
| Write to a draft / temp file | Yes | Execute freely |
| Overwrite existing file | Partially | Create backup first |
| Send email / Slack message | No | Require explicit human approval |
| Delete records from DB | No | Require explicit human approval |
| Deploy to production | No | Require explicit human approval |
Plans Break — Agents Must Adapt
A plan generated upfront is a best-guess, not a contract. Tool calls fail, APIs return unexpected data, and intermediate results invalidate later steps. A robust agent must detect these situations and replan rather than crashing or blindly retrying the same failing step.
Retrying repeats the same action hoping for a different result. Replanning calls the LLM again with the current state — what has been done, what failed, what remains — and asks for a revised step list. The new plan incorporates the failure as information.
Verified References
Every claim in this section is grounded in one of these sources.
| Source | Type | Covers | Recency |
|---|---|---|---|
| Wei et al. — Chain-of-Thought Prompting | Academic paper | Chain-of-thought reasoning, step-by-step decomposition, emergent planning | 2022 |
| Wang et al. — Plan-and-Solve | Academic paper | Plan-and-execute architecture, planner/executor split, reasoning benchmarks | 2023 |
| Yao et al. — Tree of Thoughts | Academic paper | ToT framework, branching search over reasoning paths, evaluation heuristics | 2023 |
| Shinn et al. — Reflexion | Academic paper | Failure recovery, verbal reinforcement, replanning from failure | 2023 |
| Anthropic — Tool Use & Agentic Guidance | Official docs | Minimal footprint, irreversible actions, human-in-the-loop checkpoints | Maintained 2024–2026 |
| Lilian Weng — LLM Powered Autonomous Agents | Blog / Survey | Planning survey, task decomposition, decision-making under uncertainty | June 2023 |
Build a Plan-and-Execute Agent with Replanning
You will build a plan-and-execute agent that first generates a structured step list, then works through each step with a tool-calling executor. When a step fails, the agent replans — calling the LLM again with current state to generate a revised plan. This is the pattern behind production research and automation agents.
Use the same agent-lab environment from previous sections:
cd agent-lab && source .venv/bin/activate touch plan_execute_agent.py
Three tools: simulated web search, a calculator, and a flaky_api that fails 50% of the time — so you can watch replanning trigger reliably.
import os, json, random import anthropic from dotenv import load_dotenv load_dotenv() client = anthropic.Anthropic() TOOLS = [ { "name": "web_search", "description": "Search the web. Returns a short text result.", "input_schema": { "type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"] } }, { "name": "calculator", "description": "Evaluate a Python arithmetic expression. Returns the result as a string.", "input_schema": { "type": "object", "properties": { "expression": {"type": "string", "description": "A safe arithmetic expression, e.g. '365 * 24 * 60 * 60'"} }, "required": ["expression"] } }, { "name": "flaky_api", "description": "Fetch data from an external API. Sometimes fails — use web_search as a fallback.", "input_schema": { "type": "object", "properties": {"endpoint": {"type": "string"}}, "required": ["endpoint"] } }, ] SEARCH_DB = { "python popularity": "Python is ranked #1 in the TIOBE index as of 2024–2026.", "tiobe index": "TIOBE index ranks Python #1, C #2, C++ #3, Java #4 (2024–2026).", "javascript popularity": "JavaScript dominates web development, ranked top 5 in all major surveys.", } def execute_tool(name: str, tool_input: dict) -> str: try: if name == "web_search": q = tool_input["query"].lower() for key, val in SEARCH_DB.items(): if key in q: return val return f'No results for "{tool_input["query"]}".' if name == "calculator": result = eval(tool_input["expression"], {"__builtins__": {}}) return str(result) if name == "flaky_api": if random.random() < 0.5: raise ConnectionError("API timeout after 30s") return f"API response for {tool_input['endpoint']}: status=200, data=42" return f"Unknown tool: {name}" except Exception as e: # Always return errors as strings — never let them propagate return f"Tool error: {type(e).__name__}: {e}"
One LLM call that returns a JSON array of steps. When called with a context argument, it acts as a replanner — revising the remaining steps given what has already happened.
PLANNER_SYSTEM = """You are a planning assistant. Given a user goal, produce a step-by-step plan as a JSON array. Each step: {"step": integer, "description": string, "tool_hint": string} tool_hint must be one of: web_search, calculator, flaky_api, none Return ONLY the JSON array, no other text.""" def generate_plan(goal: str, context: str = "") -> list[dict]: prompt = f"Goal: {goal}" if context: prompt += f"\n\nContext (work done and failures so far):\n{context}" prompt += "\n\nRevise the remaining steps given this context." response = client.messages.create( model="claude-opus-4-6", max_tokens=512, system=PLANNER_SYSTEM, messages=[{"role": "user", "content": prompt}] ) raw = response.content[0].text.strip() try: return json.loads(raw) except json.JSONDecodeError: start, end = raw.find("["), raw.rfind("]") + 1 return json.loads(raw[start:end])
The executor works through each plan step. When a step fails, it triggers replanning with a summary of what has been done and what failed.
EXECUTOR_SYSTEM = """You are an execution assistant. Complete the given step using available tools. When done, summarise what you did and what you found in 1-2 sentences.""" MAX_REPLAN = 2 def execute_step(step: dict) -> tuple[str, bool]: """Execute one plan step. Returns (summary, success).""" messages = [{"role": "user", "content": f"Complete this step: {step['description']}"}] for _ in range(5): response = client.messages.create( model="claude-opus-4-6", max_tokens=512, system=EXECUTOR_SYSTEM, tools=TOOLS, messages=messages ) messages.append({"role": "assistant", "content": response.content}) if response.stop_reason == "end_turn": summary = "".join(b.text for b in response.content if hasattr(b, "text")) failed = "error" in summary.lower() or "failed" in summary.lower() return summary, not failed if response.stop_reason == "tool_use": tool_results = [] for block in response.content: if block.type == "tool_use": result = execute_tool(block.name, block.input) print(f" tool: {block.name}({block.input}) → {result[:80]}") tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result }) messages.append({"role": "user", "content": tool_results}) return "Step did not complete within iteration budget.", False def run_plan_execute(goal: str) -> None: print(f"\n{'='*60}\nGOAL: {goal}\n{'='*60}") plan = generate_plan(goal) replan_count = 0 completed: list[str] = [] while plan and replan_count <= MAX_REPLAN: step = plan.pop(0) print(f"\n[step {step['step']}] {step['description']}") summary, success = execute_step(step) print(f" → {summary}") if success: completed.append(f"Step {step['step']}: {summary}") else: print(f"\n ⚠ Step failed. Replanning... ({replan_count + 1}/{MAX_REPLAN})") context = "\n".join(completed) + f"\nFailed: {step['description']}\nReason: {summary}" plan = generate_plan(goal, context=context) replan_count += 1 print(f" Revised plan has {len(plan)} remaining steps.") print(f"\n{'='*60}\nCOMPLETED STEPS:") for c in completed: print(f" ✓ {c}") print(f"{'='*60}") if __name__ == "__main__": run_plan_execute( "Research the most popular programming language in 2024–2026, " "then calculate how many seconds are in a year." )
python plan_execute_agent.py
============================================================
GOAL: Research the most popular programming language ...
============================================================
[step 1] Search for the most popular programming language 2024-2026
tool: web_search({'query': 'python popularity'}) → Python is ranked #1...
→ Found that Python is ranked #1 by the TIOBE index for 2024–2026.
[step 2] Fetch supporting data from the TIOBE API
tool: flaky_api({'endpoint': 'tiobe/rankings'}) → Tool error: ConnectionError: API timeout
→ Step failed due to API timeout error.
⚠ Step failed. Replanning... (1/2)
Revised plan has 2 remaining steps.
[step 2] Search for TIOBE index rankings as fallback
tool: web_search({'query': 'tiobe index'}) → TIOBE index ranks Python #1, C #2...
→ Confirmed via web search: Python #1, C #2, C++ #3.
[step 3] Calculate seconds in a year
tool: calculator({'expression': '365 * 24 * 60 * 60'}) → 31536000
→ There are 31,536,000 seconds in a year.
============================================================
COMPLETED STEPS:
✓ Step 1: Python is ranked #1 by TIOBE for 2024–2026.
✓ Step 2: Confirmed Python #1 via web search fallback.
✓ Step 3: 31,536,000 seconds in a year.
============================================================
flaky_api succeeds and the plan runs straight through. The other half it fails and you see the replanning trigger — the revised plan replaces the broken step with a web search fallback. This is how production agents handle unreliable external dependencies.
After generate_plan returns, print the plan and ask the user to approve it before execution begins. This implements the minimal footprint principle — the user reviews intended actions before any tool runs.
# Print plan and request approval before execution print("\nProposed plan:") for s in plan: print(f" {s['step']}. {s['description']} [tool: {s['tool_hint']}]") approval = input("\nApprove this plan? (yes/no): ").strip().lower() if approval != "yes": print("Plan rejected. Exiting.") return