The agent loop: ReAct, plan-and-execute, reflection
"An agent is a loop of `model.generate()` calls with tool calls in between. The loop is the entire pattern"
In Chapter 66 we covered tool calling — how an LLM invokes external functions. This chapter is about composing tool calls into agents: systems that go beyond a single query/response and instead iterate, taking multiple steps to accomplish a goal.
By the end you’ll understand the canonical agent patterns (ReAct, plan-and-execute, reflection) and when each is appropriate.
Outline:
- The agent definition.
- The basic agent loop.
- ReAct — reasoning + acting.
- Plan-and-execute.
- Reflection.
- The cost of agents.
- When to use which pattern.
- Implementation in modern frameworks.
67.1 The agent definition
An agent is a system that uses an LLM to iteratively take actions in pursuit of a goal. Each iteration:
- The LLM observes the current state.
- The LLM decides on an action.
- The action is executed (typically as a tool call).
- The result is fed back to the LLM as new state.
- The loop continues until the goal is achieved or the agent gives up.
This is different from a single-turn LLM call (where the model produces one response and you’re done) and from a conversational chatbot (where the model produces responses but doesn’t take real-world actions). An agent acts.
The defining property: the agent makes multiple LLM calls in pursuit of a single user goal, with tool calls between them, and the tool results inform the next LLM call. The LLM is in a loop, not a single-shot.
The simplest agent is: take a user query, decide which tool(s) to call, call them, decide what to do next, possibly call more tools, eventually produce a final answer. That’s it.
The patterns in this chapter are different ways to structure this loop. Each has trade-offs around accuracy, latency, and cost.
67.2 The basic agent loop
The simplest agent loop in pseudocode:
def run_agent(user_query, max_iterations=10):
messages = [
{"role": "system", "content": "You are an agent. Use tools to answer."},
{"role": "user", "content": user_query}
]
for i in range(max_iterations):
response = llm.generate(messages, tools=AVAILABLE_TOOLS)
if response.has_tool_calls:
for tool_call in response.tool_calls:
result = call_tool(tool_call.name, tool_call.arguments)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
messages.append(response.message)
else:
# Final answer
return response.message.content
return "Agent exceeded max iterations"
Read this carefully. Every agent in Part V is some elaboration of this loop. The structure is:
- Add the user query to the messages.
- Loop:
- Call the LLM with the current messages and the available tools.
- If the LLM emits tool calls, execute them and append the results to messages.
- If the LLM doesn’t emit tool calls, it’s done — return its response.
- Stop after
max_iterationsto prevent infinite loops.
The max_iterations cap is critical. Without it, an agent that gets stuck calls tools forever. Typical values: 10-30.
This basic loop already works for many agent use cases. It’s the foundation. The patterns below are refinements.
67.3 ReAct — reasoning + acting
ReAct (Yao et al., 2022) is the foundational agent paper. The idea: in each iteration, the LLM produces both a reasoning step (what it’s thinking) and an action (the tool call). The two together form a “thought + action” pair.
The format:
Thought: I need to find the population of France. I'll search for it.
Action: search("population of France")
Observation: 67.4 million
Thought: I have the answer. I should also confirm it's the current population.
Action: search("France population 2025")
Observation: 67.5 million
Thought: I have the current answer.
Final Answer: France's population is approximately 68 million as of 2025.
The “Thought” steps are the model explicitly reasoning before each action. This makes the model’s decisions visible and gives it a place to do chain-of-thought reasoning that improves its action selection.
Empirically, ReAct outperforms agents that don’t have explicit reasoning steps. The act of writing out the reasoning improves the quality of the decisions. It’s the same effect as chain-of-thought prompting (Chapter 42), applied to tool selection.
In modern implementations, ReAct is often implicit — the model is trained to think before acting, and the structured output format naturally has thinking and tool-call sections. You don’t have to explicitly prompt for “Thought: …” anymore; modern instruction-tuned models do it by default.
ReAct is the default pattern for most agent applications. Almost every framework (LangChain, LlamaIndex, custom) supports it.
67.4 Plan-and-execute
A different pattern: plan first, then execute. The agent first writes a complete plan (a sequence of steps), then executes each step. This is more structured than ReAct’s interleaved thought/action.
The plan-and-execute pattern:
[Planning phase]
LLM call 1: "Here's the user's question. Write a plan to answer it."
Plan:
1. Find the inventor of the lightbulb.
2. Find the country where they were born.
3. Find the population of that country.
[Execution phase]
For each step in the plan:
LLM call: execute the step (with tool calls).
Record the result.
[Synthesis]
LLM call: "Here's the original question and the plan results. Produce the final answer."
The advantages of plan-and-execute over ReAct:
- More structured. Easier to debug and reason about.
- Better for multi-step problems. The model can think about the full plan before committing to actions.
- Less prone to wandering. ReAct can get distracted mid-loop; plan-and-execute is committed to the plan.
The disadvantages:
- Less adaptive. The plan is fixed; if a step’s result changes the picture, the agent can’t easily replan.
- More LLM calls. Planning adds an extra round trip.
- Plan errors are catastrophic. A bad plan executed step-by-step is harder to recover from than a bad ReAct trajectory that can self-correct.
Plan-and-execute is good for complex multi-step tasks where the structure is known in advance. For exploratory tasks, ReAct is better.
A hybrid approach is plan-and-replan: write a plan, execute the first step, evaluate the result, possibly update the plan, execute the next step, etc. This combines the structure of planning with the adaptivity of ReAct.
67.5 Reflection
Reflection is the pattern where the agent critiques its own output and tries again if it’s not satisfied.
The structure:
Step 1: Generate an initial response.
Step 2: Critique the response. (Is it complete? Correct? Helpful?)
Step 3: If the critique identifies issues, regenerate with feedback.
Step 4: Repeat until satisfied or max iterations.
Reflection adds a self-evaluation layer. The model is asked to look at its own output and find flaws, then improve.
A simple implementation:
response = generate_response(query)
for _ in range(max_reflection_steps):
critique = llm.generate(
f"Critique this response to '{query}': {response}"
)
if "looks good" in critique.lower():
break
response = llm.generate(
f"Improve this response based on the critique. "
f"Original: {response}. Critique: {critique}"
)
return response
Reflection is most useful for tasks where the model can verify its own work: math problems (the model can re-derive and check), code (the model can read and check the code), reasoning problems (the model can re-examine its logic). It’s less useful for open-ended creative tasks where there’s no clear “correct.”
The Reflexion paper (Shinn et al., 2023) is the canonical reference. It showed that reflection can substantially improve agent performance on coding and reasoning benchmarks.
The cost: 2-5× more LLM calls per query (one for the initial response, several for critique and rewriting).
67.6 The cost of agents
A realistic agent on a moderately complex query might make:
- 1 initial LLM call to plan or react.
- 2-5 tool calls.
- 2-5 LLM calls between tool calls (one per iteration).
- 1 final LLM call to synthesize the answer.
Total: 5-12 LLM calls per user query. Each is a full prefill + decode cycle.
For a typical chat application, a single user query maps to one LLM call. An agent maps to 5-12. The cost is 5-12× higher per user query, both in compute and in latency.
The implications:
Latency. A single LLM call takes 5-30 seconds. An agent doing 10 calls takes 50-300 seconds. Agents are slow. Users have to wait much longer.
Cost. At $0.50/million tokens, a single chat request costs ~$0.001. An agent costs $0.005-0.01 per query. 10× the cost.
Reliability. Each LLM call has a small chance of going wrong (hallucination, wrong tool, infinite loop). 10 calls multiply the failure probability. Agents fail more often than single-shot LLMs.
The trade-off: agents do things that single-shot LLMs can’t. For complex tasks (multi-step research, code writing with execution, multi-tool workflows), the cost and latency are worth it. For simple queries (define a word, summarize a document), agents are overkill.
The skill is knowing when to use an agent vs a single-shot LLM. Most teams over-use agents.
67.7 When to use which pattern
The decision tree:
Single-shot LLM:
- Simple Q&A.
- Summarization.
- Translation.
- Anything that doesn’t need tool calls.
Single-shot LLM + RAG:
- Q&A over a knowledge base.
- Anything that needs retrieved information but no other tool calls.
ReAct agent (or basic loop):
- Tasks needing 1-3 tool calls.
- Tasks where the structure is unclear in advance.
- Conversational tool use.
Plan-and-execute:
- Multi-step tasks with known structure.
- Long-running automation.
- Tasks where the user wants visibility into the plan.
Reflection:
- Tasks where verification is possible (math, code, structured output).
- High-stakes tasks where quality matters more than latency.
- After another agent pattern as a refinement step.
Multi-agent (Chapter 68):
- Tasks where specialization helps.
- Very large tasks decomposed across agents.
The default for most agent applications: ReAct. It’s simple, flexible, and works for most cases. Add planning or reflection as needed for specific quality issues.
67.8 Implementation in modern frameworks
The major agent frameworks:
LangChain Agents. The first widely-used framework. Supports ReAct, plan-and-execute, OpenAI tools, custom agents. Mature but sometimes over-abstracted.
LangGraph (LangChain). A graph-based agent framework where you define states and transitions explicitly. More structured than the older LangChain agents. The modern recommended approach within the LangChain ecosystem.
LlamaIndex Agents. Similar feature surface to LangChain. Strong RAG integration.
OpenAI Assistants API. OpenAI’s hosted agent framework. Handles state management, tool calls, threads, files. Easy to start, less flexible.
Anthropic Tool Use. Anthropic’s tool calling primitives, used directly without a framework. Many teams just use the raw API.
CrewAI. A multi-agent framework with role-based agents. Popular for prototyping multi-agent systems.
AutoGen (Microsoft). Multi-agent framework with conversational agents.
Custom code. Many teams build their own agent loops in raw Python. The patterns are simple enough that a framework isn’t always necessary.
For most production teams, LangGraph or custom code is the right choice. LangGraph gives you structure without too much abstraction; custom code gives you full control. The other frameworks are useful for specific scenarios but have higher abstraction costs.
The implementation patterns are simple enough that you should be able to build your own agent loop in 100 lines of Python. Don’t reach for a framework unless you need its features.
67.9 The mental model
Eight points to take into Chapter 68:
- An agent is an LLM in a loop with tool calls. That’s the entire pattern.
- The basic loop: generate → if tool calls, execute and feed back → else return answer. Cap with max iterations.
- ReAct interleaves reasoning and acting. Default pattern.
- Plan-and-execute writes the plan first, then executes. Better for structured tasks.
- Reflection has the agent critique its own output and improve. Useful for verifiable tasks.
- Agents are 5-12× more expensive and slower than single-shot LLMs. Use them when needed, not by default.
- LangGraph or custom code is the right default for most production agents.
- Most teams over-use agents. Single-shot + RAG is enough for most tasks.
In Chapter 68 we look at multi-agent patterns: when it makes sense to have multiple agents collaborating.
Read it yourself
- Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models (2022). The ReAct paper.
- Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning (2023).
- The LangGraph documentation and examples.
- The OpenAI Assistants API documentation.
- The LlamaIndex agents documentation.
Practice
- Implement the basic agent loop in 50 lines of Python with one tool (calculator).
- Convert it to ReAct format with explicit Thought/Action/Observation steps.
- Why does plan-and-execute work better for multi-step tasks with known structure? Argue.
- What’s the failure mode of reflection on creative tasks? Construct an example.
- For a task requiring 5 tool calls, compute the latency at 5 seconds per LLM call.
- Why is “use a single-shot LLM” often the right choice over “use an agent”? Make the case.
- Stretch: Build a ReAct agent with three tools (web search, calculator, current date) and test it on a multi-step query.