You are viewing a preview of this lesson. Sign in to start learning
Back to Agentic AI as a Part of Software Development

Agent Anatomy & Patterns

Explore the core components of an agent: LLM, tools, memory, and loop. Learn the ReAct pattern and determinism vs. emergence.

Last generated

What Makes an Agent an Agent

You've probably written code that calls an LLM and gets back an answer. Maybe you've chained a few of those calls together — summarize this, then classify that, then format the result. That pipeline is useful, but it isn't an agent. The difference isn't cosmetic, and it isn't just about capability. It's architectural: who decides what happens next, and when. In a scripted pipeline, you do — at design time. In an agent, the model does — at runtime. That single shift in where control flow lives is what makes agent systems both more powerful and fundamentally harder to reason about.

The Core Distinction: Feedback Loops vs. Single Passes

A standard LLM call is a single input-output pass: you assemble a prompt, send it to the model, and receive a response. A chain wires multiple calls together, but the sequence of steps and what data flows where are all determined by the developer when writing the code. The shape of execution is fixed.

An agent is different in kind, not just degree. An agent operates on a feedback loop: it receives an observation, produces an action (which might be a tool call, a question, or a final answer), observes the result of that action, and then decides what to do next. Crucially, the model itself chooses the next action based on what it observes. No developer has enumerated the possible sequences in advance.

Agent Feedback Loop
───────────────────

  ┌─────────────┐
  │  Observation │  ◄─────────────────────────────┐
  └──────┬──────┘                                  │
         │                                         │
         ▼                                         │
  ┌─────────────┐   tool call?   ┌──────────────┐  │
  │    Model     │──────────────►│  Tool / Env  │  │
  │  (Reasoning) │               └──────┬───────┘  │
  └──────┬──────┘                       │           │
         │                              │ result    │
         │  final answer?               └───────────┘
         ▼
  ┌─────────────┐
  │   Response  │
  └─────────────┘

The loop continues until the model signals it is done or a stopping condition the developer has set is triggered. In a well-designed agent, neither the number of iterations nor the specific tool sequence is knowable before the run begins. That's not a limitation — it's the point.

💡 Mental Model: A chain is a recipe: every step is written down before you start cooking. An agent is a chef: given a goal and a pantry, they decide in the moment what to do next based on what they see in front of them. The chef might taste the sauce, adjust, taste again, and substitute an ingredient you didn't anticipate.

Control Flow Lives in the Model, Not the Code

In a scripted pipeline, control flow is encoded in code. The logic is explicit, deterministic, and inspectable by reading the source. In an agent, control flow is deferred to the model at runtime. The developer provides a set of tools and a description of the goal. The model reads the current state and emits either a tool call or a final answer.

Here's a concrete contrast. Suppose you're building a system that answers questions about a codebase. A scripted approach:

## Scripted pipeline — control flow is in the code
def answer_code_question(question: str) -> str:
    search_results = search_codebase(question)  # Step 1: always search
    summary = summarize(search_results)          # Step 2: always summarize
    return format_answer(summary)                # Step 3: always format

Every question goes through the same three steps regardless of what the question actually requires. An agent approach exposes the tools and lets the model decide:

## Agent approach — control flow deferred to the model
tools = [
    search_codebase_tool,   # search the repo index
    read_file_tool,         # read a specific file by path
    run_tests_tool,         # execute the test suite for a module
]

## The developer provides the tools and the goal.
## The model decides which tools to call, in what order,
## and when it has enough information to answer.
agent_loop(goal=question, tools=tools)

The model might call search_codebase once, notice the result points to a specific file, call read_file to inspect it, and then answer. Or it might call search_codebase twice. Or it might determine from the question alone that no tool is needed. None of these paths are enumerated in code; they emerge from the model's reasoning over observations.

🎯 Key Principle: The developer controls what the model can do (the tool set, the prompt, the stopping conditions). The model controls what it chooses to do at each step. The loop scaffolding and tool execution layer are squarely the developer's responsibility — something we'll make concrete in the sections that follow.

Non-Determinism Is a First-Class Design Concern

Once control flow lives in the model, you inherit a property most backend developers are unaccustomed to designing around: non-determinism at the architectural level.

This is subtler than the well-known fact that LLMs sample stochastically. Even at temperature zero, two logically equivalent phrasings of a question can lead the model to invoke different tool sequences. Add tool results that vary slightly between runs — a database query that returns records in a different order, a clock call that returns a later time — and the model's next decision can diverge.

⚠️ Common Mistake: Treating the tool call sequence as a stable artifact. A test that asserts "when asked to summarize the repository, the agent calls search_codebase first, then read_file" will pass reliably in development and break unpredictably in CI — not because the agent is broken, but because its execution path is legitimately variable. The right thing to test is the outcome (was the summary accurate?) not the path (which tools were called, in what order). We'll examine this testing pattern in detail in Common Pitfalls When Building Agents.

Non-determinism also surfaces in error handling and retry logic. If an agent calls a tool that returns an error on step three, the model's behavior on the next step is probabilistic. Designing for this means thinking about worst-case execution traces, not just the happy path.

When Agents Are the Wrong Choice

The shift to agentic architecture introduces real costs: harder testing, non-deterministic execution, more complex debugging, and more moving parts that can fail. That cost is justified only in specific circumstances.

Agents are valuable when the task structure cannot be known in advance. If you don't know how many steps a task will take, which sources of information will be relevant, or what intermediate results will look like before you start, an agent's ability to adapt is genuinely useful. Research tasks, debugging assistants, and multi-step planning problems all fit this profile.

Wrong thinking: "Agents are more powerful, so I should use them whenever I use an LLM."

Correct thinking: "Does this task have a variable, unpredictable structure that requires adaptive control flow? If yes, an agent may be warranted. If not, a simpler approach will be cheaper, faster, and easier to maintain."

📋 Quick Reference: Choosing the Right Architecture

Characteristic Best Match
Fixed sequence of steps, all known upfront Single LLM call or scripted chain
Output format is fully specified Structured generation (JSON mode, etc.)
Steps are dynamic but bounded and enumerable Conditional chain or router
Number of steps and which tools are needed can't be determined without running Agent
Task requires reacting to observations mid-execution Agent

Generating a formatted invoice from structured data is a fixed task: call the model once with the data and a template. Contrast that with a debugging assistant that takes a stack trace — it might need to search documentation, read source files, look up similar issues, or run a reproducing example, depending entirely on what it finds at each step. That unpredictable, observation-driven structure is exactly what agents are designed for.

A common mistake is reaching for an agent to handle what is actually a routing problem. If you have five possible response templates and want the model to pick one, that's a classification call — one inference call, one structured output. Building a multi-step agent around it adds retry risk and latency with no gain.

With this distinction established — agents as feedback loops where the model owns control flow — the natural next questions become: what raw material does the model need to make good decisions, and what scaffolding does the developer need to build around it? Those are exactly the questions the next section addresses.


The Four Core Components of an Agent

Every agent, regardless of the framework packaging it or the task it was built for, is assembled from the same four structural elements: a model, tools, memory, and a loop. These are not abstractions — each maps to a specific piece of code with a specific runtime responsibility. Understanding what each component owns, and more importantly what it does not own, is the foundation of building agents that are maintainable and debuggable rather than opaque and fragile.

The Model: Reasoning Engine, Not Execution Engine

The LLM sits at the center of an agent, but its role is narrower than it might first appear. The model does exactly two things: it receives a context window full of information, and it emits a structured response. That response is either a final answer or a tool call — a structured declaration that says "I want to invoke this function with these arguments."

The model does not run code. It does not read files. It does not make HTTP requests. It produces text that describes an action, and the runtime decides whether and how to carry it out. This distinction has direct implications for security, error handling, and trust boundaries.

## What the model actually produces when it "decides" to call a tool
response = {
    "role": "assistant",
    "content": None,
    "tool_calls": [
        {
            "id": "call_abc123",
            "type": "function",
            "function": {
                "name": "search_knowledge_base",
                "arguments": '{"query": "refund policy for digital goods", "top_k": 3}'
            }
        }
    ]
}
## The model produced structured data. Your code decides what happens next.

Notice that arguments is a JSON-encoded string, not a parsed object. The model serialized its intent into text; your runtime must deserialize and validate it before passing anything to the actual function. The model's output is a proposal, not an instruction with elevated privileges.

💡 Mental Model: Think of the LLM as a capable advisor sitting in a meeting room. It reads every document you slide under the door, writes a recommendation on a notepad, and slides it back. Whether you act on that recommendation — and exactly how — is your decision. The advisor never leaves the room.

🎯 Key Principle: The model's reasoning quality is a function of context quality. Vague tool descriptions, missing history, and poorly formatted observations degrade its decision-making in ways that are hard to diagnose because the model will still produce confident-sounding output.

Tools: Contracts Between the Model and the World

A tool is not just a Python function. It is a function plus a contract that the model can read. That contract has three required parts:

  • Name — a short, unambiguous identifier (search_knowledge_base, not search or do_search_thing)
  • Description — a prose explanation of what the tool does, what it returns, and critically, when to call it versus when not to
  • Input schema — a typed specification of each parameter, typically expressed as a JSON Schema object

At inference time, the tool definitions are serialized and injected into the model's context. The model reads these definitions and uses them to decide which tool to invoke and with what arguments.

## A well-formed tool definition passed to the model at inference time
tool_definition = {
    "type": "function",
    "function": {
        "name": "get_order_status",
        "description": (
            "Retrieves the current status and estimated delivery date for a customer order. "
            "Call this when the user asks about a specific order. "
            "Returns a status string ('processing', 'shipped', 'delivered') and an ISO 8601 date. "
            "Do NOT call this for general product availability questions."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID from the user's confirmation email, e.g. 'ORD-20489'"
                }
            },
            "required": ["order_id"]
        }
    }
}

The description here does more than describe: it specifies the return shape and includes an explicit negative case (Do NOT call this for...). Without negative guidance, a model reasoning about a general product question might still reach for get_order_status because the name sounds related. The description is the primary interface through which the model learns tool intent.

⚠️ Common Mistake: Describing what a tool does without specifying what it returns. If the model doesn't know the output format, it cannot reason about whether the result satisfies the user's need or whether another tool call is required.

When the model emits a tool call, your code must look up the function by name, validate the arguments against the schema, execute the function, and return the result as a new message. This dispatch logic is yours to write and own.

Memory: Context Window and Beyond

Memory in an agent is a spectrum, and conflating its layers is a reliable source of subtle bugs.

At minimum, every agent has in-context memory: the ordered list of messages that constitutes the conversation so far. Each user message, assistant message, tool call, and tool result is appended to this list and fed back to the model on every iteration. This message history is the agent's working memory during a run — temporary, bounded by the model's context window, and lost when the process ends.

+--------------------------------------------------+
| Context Window (in-context memory)               |
|                                                  |
|  [system prompt]                                 |
|  [user: "What's the status of order ORD-20489?"] |
|  [assistant: <tool_call: get_order_status>]      |
|  [tool result: {"status": "shipped", ...}]       |
|  [assistant: "Your order shipped and will..."]   |
|                                                  |
+--------------------------------------------------+
         ^
         | fed back to model on each loop iteration

When tasks grow complex, the message list grows toward the context limit. This is where external memory becomes necessary. External memory generally takes two forms:

  • Semantic / fact storage: a vector store or structured database holding documents, facts, or summarized knowledge. The agent retrieves relevant chunks and injects them into context rather than carrying everything at once.
  • Episodic storage: records of what the agent did in previous runs — past tool call sequences, outcomes, or user preferences. Useful when you need continuity across sessions.

Both external forms require an explicit retrieval step: something must query the store and insert the results into the current message list before the model sees them. That retrieval logic is code you write, not behavior the model provides automatically.

🎯 Key Principle: The context window is the working surface on which the model reasons. What goes onto that surface — and how it gets there — is a design decision that affects both cost and capability.

The Loop: The Scaffolding You Own

The loop is the host code that orchestrates the other three components. It is the piece most newcomers underestimate, because frameworks often obscure it behind convenience abstractions. But the loop is where your control over agent behavior actually lives.

┌─────────────────────────────────────────────────────────┐
│                    Agent Loop                           │
│                                                         │
│  ┌──────────┐     ┌──────────┐     ┌─────────────────┐ │
│  │  Model   │────▶│ Response │────▶│ Tool call?      │ │
│  │  call    │     │  parse   │     │                 │ │
│  └──────────┘     └──────────┘     │  YES ──▶ execute│ │
│       ▲                            │         append  │ │
│       │                            │         result  │ │
│       │           ┌──────────┐     │                 │ │
│       └───────────│ Append   │◀────│  NO  ──▶ return │ │
│                   │ to msgs  │     │         answer  │ │
│                   └──────────┘     └─────────────────┘ │
│                                                         │
│  Stopping conditions checked on every iteration:        │
│   • Final answer signal from model                      │
│   • Max iteration count reached                         │
│   • Terminal tool result flagged                        │
└─────────────────────────────────────────────────────────┘

Each time the model returns a tool call, your loop executes the tool, appends the result as a new message with role "tool", and calls the model again with the updated message list. The model now sees its previous reasoning plus the tool result as a new observation.

🎯 Key Principle: The loop is where you enforce policy. Maximum steps, timeouts, blocked tool names, human-in-the-loop approval gates — all of these are implemented in the loop, not inside the model. If you cede control of the loop to a framework you don't understand, you cede control of agent behavior.

How the Four Components Interact

With each component defined, tracing a single agent turn end-to-end shows how they compose:

  1. The loop assembles the current message list (in-context memory) and passes it — along with tool definitions — to the model.
  2. The model reads the context, selects a tool, and emits a structured tool call.
  3. The loop receives the tool call, dispatches it to the matching function, and captures the return value.
  4. The result is appended to the message list as a new "tool" message, extending the in-context memory.
  5. Steps 1–4 repeat until the model emits a final answer or a stopping condition fires.

External memory enters at step 1: before calling the model, the loop may query a vector store and inject retrieved documents into the message list. The model never sees the raw function implementation — only the tool definition and the return value you chose to give it. This means you can summarize, filter, or structure the return value before appending it.

📋 Quick Reference: Component Responsibilities

Component Owns Does NOT own
Model Reasoning, action selection Tool execution, loop control
Tools Typed contracts + implementations Deciding when to be called
Memory Context window + optional external store Deciding what to retrieve
Loop Orchestration, stopping conditions What the model decides

These four interactions collapse into fewer than 40 lines of concrete code — which is exactly what the next section demonstrates.


Anatomy of a Minimal Agent in Code

The best way to understand how a model, tools, memory, and a loop relate to each other is to write the loop yourself — without a framework abstracting it away. What follows is a minimal but complete implementation. The goal is not production code; it is a clear skeleton that shows you exactly which lines correspond to which architectural decisions.

A Minimal Implementation

The example uses a generic chat-completion API shape exposed by most major providers. The tool is a simple calculator that evaluates arithmetic expressions.

import json
from typing import Any

## --- Tool definition: data passed to the model at inference time ---
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": (
                "Evaluates a mathematical expression and returns the numeric result. "
                "Use this when the user asks you to compute something. "
                "Input must be a valid arithmetic expression, e.g. '(3 + 4) * 2'."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The arithmetic expression to evaluate."
                    }
                },
                "required": ["expression"]
            }
        }
    }
]

def calculate(expression: str) -> dict[str, Any]:
    """Execute the tool and return a structured result."""
    try:
        # eval is used here for brevity; never use it on untrusted input.
        result = eval(expression, {"__builtins__": {}}, {})
        return {"result": result, "error": None}
    except Exception as e:
        return {"result": None, "error": str(e)}

## Dispatch table: maps tool names to callables
TOOL_REGISTRY = {"calculate": calculate}

def run_agent(client, model: str, user_message: str, max_iterations: int = 10) -> str:
    """A minimal agent loop. Returns the model's final answer."""
    # The message list IS the agent's working memory for this run.
    messages = [{"role": "user", "content": user_message}]

    for iteration in range(max_iterations):  # stopping condition: max iterations
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        )
        choice = response.choices[0]

        if choice.finish_reason == "stop":  # stopping condition: final answer
            return choice.message.content

        if choice.finish_reason == "tool_calls":
            # Append the assistant's tool-call request to memory
            messages.append(choice.message)

            for tool_call in choice.message.tool_calls:
                name = tool_call.function.name
                args = json.loads(tool_call.function.arguments)

                # Developer's responsibility: dispatch the call
                tool_fn = TOOL_REGISTRY.get(name)
                if tool_fn is None:
                    tool_result = {"error": f"Unknown tool: {name}"}
                else:
                    tool_result = tool_fn(**args)

                # Append result as an observation — this is how the
                # model "sees" what happened.
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(tool_result)
                })

    # Stopping condition: max iterations exhausted
    return "[Agent did not reach a final answer within the iteration limit.]"

Count the lines from for iteration in range to the final return — comfortably under forty. Everything else is setup: tool definitions, the tool function, and the registry that dispatches calls.

Tool Definitions Are Just Data

TOOLS is a plain Python list of dictionaries. The model never imports your code, never calls calculate directly, and has no awareness of your Python environment. At inference time, the tool definitions are serialized and included in the request — they become part of the prompt context in a structured form the model has been trained to reason about.

When the model decides a tool should be called, the response contains structured tool-call data: a function name and a JSON-encoded argument payload. Parsing it, validating the arguments, looking up the right function in TOOL_REGISTRY, calling it, and serializing the result are all your responsibility as the developer.

🎯 Key Principle: The model produces descriptions of actions. The runtime produces the actions themselves. This boundary gives you control over what an agent can actually do.

The Message List as Working Memory

Every observation the agent accumulates during a run lives in messages. Here's what the list looks like after one tool call on the query "What is (17 * 3) + 4?":

## State of `messages` after one tool round-trip
[
    # Turn 0: the original user request
    {"role": "user", "content": "What is (17 * 3) + 4?"},

    # Turn 1: the model decides to call a tool
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "id": "call_abc123",
                "type": "function",
                "function": {
                    "name": "calculate",
                    "arguments": "{\"expression\": \"(17 * 3) + 4\"}"
                }
            }
        ]
    },

    # Turn 2: the tool result appended as an observation
    {
        "role": "tool",
        "tool_call_id": "call_abc123",
        "content": "{\"result\": 55, \"error\": null}"
    }
    # On the next iteration, the model sees all three messages
    # and produces a final natural-language answer.
]

The model has no persistent state between API calls. When you call it on the second iteration, it re-reads the entire messages list from scratch. This is why appending every tool result as a new message is not optional — skipping that step means the model answers the next question without knowing what just happened.

⚠️ Common Mistake: Discarding intermediate messages and only passing the latest tool result to the model. If the task requires three tool calls in sequence, the model needs to see all prior results to chain them correctly.

Stopping Conditions Must Be Explicit

The loop has exactly three exit paths, each deliberate:

Exit path What it signals
finish_reason == "stop" Model issued a final answer
for iteration in range(max_iterations) Safety ceiling on iterations
Return after the loop Max iterations exhausted

A fourth pattern common in practice is a terminal tool result — a tool that explicitly signals completion:

## Example: a terminal tool that commits a structured answer
def submit_answer(answer: str, confidence: float) -> dict[str, Any]:
    """Called by the model when it is ready to commit a final answer.
    The agent loop checks the 'terminal' flag to exit.
    """
    return {"answer": answer, "confidence": confidence, "terminal": True}

## Inside the loop, after getting the tool result:
if tool_result.get("terminal"):
    return tool_result["answer"]  # stopping condition: terminal tool result

The submit_answer pattern is particularly useful for tasks producing structured outputs — a classification decision, a form filled with extracted fields — where a free-text stop response is too ambiguous to parse reliably downstream.

💡 Pro Tip: Set max_iterations based on the expected complexity of your task. If your agent legitimately needs more than ten tool calls to answer a question, that is a signal to reconsider the task decomposition, not to raise the ceiling to one hundred.

This implementation handles the happy path cleanly. A production system would also handle malformed tool-call arguments, rate-limit retries, and context-window exhaustion — but those are extensions of the same loop structure. Frameworks like LangChain and LlamaIndex provide scaffolding for exactly these concerns; underneath every one of them is a loop with the same shape.


Common Pitfalls When Building Agents

Building your first agent loop is surprisingly quick. What takes longer to learn is why agents that look correct on paper fail expensively in production. The failures cluster around a small set of structural mistakes that almost every developer encounters when first moving from single-pass LLM calls to iterative agent loops.

Pitfall 1: Unbounded Loops

An agent loop runs until a stopping condition is met. When developers first sketch a loop, they typically handle the happy path and forget to enforce an upper bound on how many steps that process can take.

What happens when the model becomes confused? It may call the same tool repeatedly with slightly different arguments, never converging. It may alternate between two tools in a cycle. Without a ceiling, the loop runs indefinitely, consuming tokens and API credits.

⚠️ Common Mistake: Treating the model's eventual convergence as guaranteed. Language models do not have a built-in termination guarantee. The stopping condition is your responsibility.

def run_agent(messages: list[dict], tools: list[dict], max_steps: int = 10) -> str:
    """
    Run an agent loop with an explicit upper bound on iterations.
    Returns the final answer text, or raises if the limit is exceeded.
    """
    for step in range(max_steps):
        response = call_model(messages=messages, tools=tools)

        if response.finish_reason == "stop":
            return response.content

        tool_call = response.tool_call
        result = dispatch_tool(tool_call)

        messages.append({"role": "assistant", "tool_call": tool_call})
        messages.append({"role": "tool", "content": result, "tool_call_id": tool_call.id})

    raise RuntimeError(
        f"Agent did not terminate within {max_steps} steps. "
        "Check tool descriptions and model instructions."
    )

Two details matter: the exception message is written for the developer, not passed back to the model (connecting to Pitfall 3 below), and max_steps is a parameter rather than a hard-coded constant, making it adjustable per task.

💡 Pro Tip: Keep a step count in your logs even when the loop terminates normally. Agents that regularly consume eight or nine of a ten-step budget are a warning sign that either the task is too complex for the current tool set, or something in the tool descriptions is causing the model to work harder than necessary.

Pitfall 2: Vague Tool Descriptions

The model cannot inspect your tool's source code. Its only window into what a tool does is the description you provide alongside the name and input schema. A vague or incomplete description does not just reduce readability — it actively degrades the model's ability to reason about when and how to use the tool.

Most vague descriptions fail in one of two ways: they describe the mechanism but not the output, or they omit when to call the tool.

## ❌ Vague: describes only what the tool does mechanically
vague_tool = {
    "name": "get_order_status",
    "description": "Gets the status of an order.",
    "parameters": {
        "type": "object",
        "properties": {
            "order_id": {"type": "string"}
        },
        "required": ["order_id"]
    }
}

## ✅ Specific: describes the input, the output shape, and when to call it
specific_tool = {
    "name": "get_order_status",
    "description": (
        "Look up the current fulfillment status of a single order by its ID. "
        "Returns a JSON object with fields: status (one of 'processing', "
        "'shipped', 'delivered', 'cancelled'), estimated_delivery_date "
        "(ISO 8601 string, null if not yet shipped), and carrier_tracking_url "
        "(string or null). Call this tool only after you have a valid order_id "
        "from the user or from a prior tool result — do not attempt to infer "
        "or fabricate order IDs."
    ),
    "parameters": {
        "type": "object",
        "properties": {
            "order_id": {
                "type": "string",
                "description": "The order identifier, formatted as ORD-XXXXXXXX."
            }
        },
        "required": ["order_id"]
    }
}

The specific description names the return fields, gives each field's type and possible values, states when calling is appropriate, and warns against fabricating input.

🎯 Key Principle: Write tool descriptions as if for a competent junior developer who has never seen your codebase and will only have access to this text. If that person could not confidently call the tool correctly from your description alone, neither can the model.

Description quality checklist
──────────────────────────────────────────────
 ✅  What does the tool RETURN? (fields, types)
 ✅  What does each return value MEAN?
 ✅  When is it appropriate to call this tool?
 ✅  When should it NOT be called?
 ✅  What does a valid input look like?
──────────────────────────────────────────────

⚠️ Common Mistake: Updating tool behavior without updating the description. The model's reasoning is driven by the description provided at inference time — if the underlying function changes but the description does not, the model will reason about behavior that no longer exists.

Pitfall 3: Returning Raw Tracebacks as Tool Results

When a tool call fails, the loop needs to record a result — the message history must stay coherent regardless of what happened. A common first implementation catches the exception and returns str(e) or the full traceback as the tool result.

This creates a subtle failure mode. A raw Python traceback contains information useful to a developer — file paths, line numbers, internal variable names — but it is largely noise to the model. Worse, it may signal to the model that the correct strategy is to retry the same call, which is rarely true when the error is a configuration problem, a missing external dependency, or a permissions failure. The model loops on the error, compounding Pitfall 1.

## ❌ Returns a raw traceback — model typically retries indefinitely
def dispatch_tool_naive(tool_call) -> str:
    try:
        return execute_tool(tool_call.name, tool_call.arguments)
    except Exception as e:
        import traceback
        return traceback.format_exc()  # Noisy, encourages retry loops


## ✅ Returns a structured, actionable error message
def dispatch_tool(tool_call) -> str:
    try:
        return execute_tool(tool_call.name, tool_call.arguments)
    except ValueError as e:
        # Input validation failure — the model should correct its argument
        return f'{{"error": "invalid_input", "message": "{e}"}}'
    except PermissionError:
        # The model cannot fix this by retrying
        return '{"error": "permission_denied", "message": "Access to this resource is not allowed. Inform the user and stop."}'
    except Exception as e:
        # Catch-all: short, non-technical description
        return f'{{"error": "tool_failed", "message": "The tool encountered an unexpected error: {type(e).__name__}. Do not retry. Report the failure to the user."}}'

The structured error message does three things the raw traceback cannot: it classifies the error so the model can reason about the right response, it strips away implementation details irrelevant to the reasoning task, and it includes explicit behavioral guidance — "do not retry," "inform the user" — which short-circuits the loop behavior that raw tracebacks produce.

Pitfall 4: Testing Agent Output as if It Were Deterministic

The natural instinct — sharpened by years of writing unit tests for deterministic functions — is to record a successful run, capture the exact sequence of tool calls, and assert that future runs reproduce it. This produces a passing test suite on the day you write it and a brittle CI pipeline thereafter.

As established in What Makes an Agent an Agent: agents are non-deterministic by design. The model may reach the same correct outcome via a different sequence of tool calls. A test asserting tool_calls == ["search", "summarize"] will fail on a run where the model correctly calls ["search", "filter", "summarize"] — even though both runs produce a valid answer.

FRAGILE: Testing exact tool call sequences
──────────────────────────────────────────────────────────────────
 Run 1:  search("Q3 revenue") → summarize() → "Revenue was $4.2M"
 Run 2:  search("Q3 revenue") → filter(year=2024) → summarize() → "Revenue was $4.2M"

  ❌ assert tool_calls == ["search", "summarize"]  ← fails on Run 2
  ✅ assert final_answer contains "4.2M"            ← passes on both
  ✅ assert summarize_tool.was_called              ← passes on both
──────────────────────────────────────────────────────────────────
## ❌ Brittle: asserts on the exact sequence of tool calls
def test_order_lookup_brittle(agent, mock_tools):
    result = agent.run("What's the status of order ORD-00123?")
    assert mock_tools.calls == [
        ("get_order_status", {"order_id": "ORD-00123"})
    ]


## ✅ Robust: asserts on observable outcomes and side effects
def test_order_lookup_robust(agent, mock_tools):
    result = agent.run("What's the status of order ORD-00123?")

    assert result.final_answer is not None
    assert "ORD-00123" in result.final_answer
    assert mock_tools.was_called_with("get_order_status", order_id="ORD-00123")
    assert result.step_count <= 5

Wrong thinking: "If the tool call sequence changes, the agent is broken." ✅ Correct thinking: "If the outcome or side effects change, the agent is broken. Tool call sequences are the model's implementation detail."

💡 Pro Tip: End-to-end smoke tests can use temperature=0 to check that the system wires together correctly. Behavioral tests should be written to tolerate path variance. Note: temperature=0 reduces but does not eliminate variance — treat deterministic-seeming runs as a convenience, not a guarantee.

📋 Quick Reference: Test Strategy for Agents

Test Type Assert On Do Not Assert On
Behavioral Final answer correctness Exact tool call sequence
Side effects DB writes, API calls made Argument phrasing variations
Budget Step count within limit Which step a tool was called on
Smoke test Loop terminates without error Specific intermediate messages

How the Pitfalls Compound

These four pitfalls are not independent. An unbounded loop (Pitfall 1) is often triggered by a vague tool description (Pitfall 2) causing the model to call the wrong tool, which then returns a raw traceback (Pitfall 3), which the model attempts to recover from by looping. The test suite misses all of this because it was written to check exact call sequences rather than outcomes (Pitfall 4).

Pitfall interaction map

  Vague description
       │
       ▼
  Model calls wrong tool ──► Tool fails ──► Raw traceback returned
                                                   │
                                                   ▼
                                          Model retries (loops)
                                                   │
                                                   ▼
                                          No max_steps ──► Token budget exhausted
                                                   │
                                                   ▼
                                          Test checks call sequence ──► Misses the failure

Addressing all four together — enforce a step ceiling, write precise descriptions, return structured errors, test on outcomes — gives you an agent loop that fails loudly and quickly rather than silently and expensively.

🧠 Mnemonic: LDTELimit steps, Describe outputs, Translate errors, Evaluate outcomes. These four disciplines cover the most common structural failure modes in early agent implementations.


Key Takeaways and What Comes Next

Every section of this lesson has been building toward a single, durable insight: agents are a specific architectural pattern — a model reasoning inside a loop, calling tools, accumulating memory — that trades the predictability of hardcoded control flow for the flexibility of runtime decision-making.

The Four-Element Structure

Regardless of which framework you use, every agent is assembled from the same four elements: a model, tools, memory, and a loop. Frameworks differ in how much scaffolding they provide out of the box; they do not differ in whether the scaffolding exists.

When something breaks, the four-element structure gives you a checklist:

  • Model: Was the reasoning context complete and correctly formatted?
  • Tools: Was the tool description accurate enough that the model could choose correctly?
  • Memory: Did the model have the observations it needed, or was relevant context missing?
  • Loop: Did the host code correctly dispatch the tool call, append the result, and resume with the right stopping condition?

In practice, the majority of agent bugs fall into the tool or loop categories — not because models reason poorly, but because the tool description was ambiguous or the loop logic had an edge case.

📋 Quick Reference: The Four Elements

Element Owner What Can Go Wrong
Model API / provider Incomplete context, bad tool selection
Tools Developer Vague descriptions, unhandled errors
Memory Developer Missing observations, stale context
Loop Developer No step cap, missing stop signal, bad dispatch

Note the pattern: three of the four elements are developer-owned. The model controls reasoning and action selection; everything else is code you write and can inspect directly.

The Developer/Model Boundary Is a Debugging Superpower

When a bug report arrives — "the agent did something unexpected" — the first question is: did the model select the wrong action, or did the loop execute the right action incorrectly? These are different failures with different fixes.

Consider an agent supposed to look up a customer record and then summarize it that instead calls the lookup tool twice and never produces a summary. Two very different causes could produce this symptom:

  1. Model-side: The tool description for "summarize" was missing or unclear, so the model didn't know it could produce a final answer without calling another tool.
  2. Loop-side: The stopping condition checked for a literal string in the response, but the model phrased its conclusion differently, so the loop kept running.
import json
from typing import Any

def run_agent_loop(
    client,
    model: str,
    messages: list[dict],
    tools: list[dict],
    tool_registry: dict[str, callable],
    max_steps: int = 10,
) -> str:
    """
    Minimal agent loop with an explicit developer/model boundary.
    The model produces tool calls or a final answer.
    This function owns: dispatching, appending results, and stopping.
    """
    for step in range(max_steps):
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
        )
        message = response.choices[0].message

        # --- Developer boundary: we interpret the model's output ---
        if message.tool_calls:
            messages.append(message)

            for tool_call in message.tool_calls:
                name = tool_call.function.name
                args = json.loads(tool_call.function.arguments)

                # Developer-owned: dispatch and error handling
                if name not in tool_registry:
                    result = {"error": f"Unknown tool: {name}"}
                else:
                    try:
                        result = tool_registry[name](**args)
                    except Exception as exc:
                        result = {"error": str(exc)}

                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(result),
                })
        else:
            # No tool call means the model is done reasoning
            return message.content

    return "[Agent reached maximum steps without producing a final answer]"

If the agent loops unexpectedly, you read the loop code. If the agent picks the wrong tool, you revisit the tool descriptions. The boundary makes the search space for bugs smaller.

💡 Mental Model: Think of the model as a subcontractor who submits work orders. The host loop is the project manager who reads each work order, executes it, and decides when the job is done. The subcontractor never picks up a hammer; the project manager never decides what to build.

Determinism vs. Emergence: The Central Design Tension

Agents are valuable precisely because they can handle situations their authors didn't anticipate. That same flexibility is what makes them difficult to verify. The behavior that handles the unanticipated case is emergent: it wasn't encoded explicitly, it arose from the model's reasoning given the context at that moment.

               MORE DETERMINISTIC
                      │
   Rigid scripts   ←──┼──→   Fully autonomous agents
   (no emergence)      │      (high emergence)
                      │
               MORE EMERGENT

  BENEFITS OF DETERMINISM:        BENEFITS OF EMERGENCE:
  ✓ Testable with exact assertions  ✓ Handles unanticipated inputs
  ✓ Auditable step-by-step          ✓ Composes tools in novel ways
  ✓ Predictable cost and latency    ✓ Adapts without code changes

  COST OF DETERMINISM:             COST OF EMERGENCE:
  ✗ Breaks on unenumerated inputs  ✗ Behavior harder to verify
                                   ✗ Harder to bound token spend
                                   ✗ Regressions can be subtle

The practical question is: where on this spectrum does your task sit? Tasks with well-defined outputs and low tolerance for variation sit closer to the deterministic end. Tasks where the full range of inputs cannot be enumerated sit closer to the emergent end, and your verification strategy shifts from "assert exact output" to "assert correct outcome."

🤔 Did you know? "Emergent" here is used in the technical sense — behavior that arises from component interactions rather than being specified directly. It does not imply that the behavior is unpredictable in principle, only that it isn't traceable to a single line of code. This distinction matters when explaining agent behavior to stakeholders: "the model reasoned its way to a valid approach" is different from "we don't know what it did."

What the Child Lessons Address

The three topics that follow each address one face of the determinism-vs-emergence tension directly:

  • The ReAct pattern gives the model's emergent reasoning a visible structure — Reasoning, then Action — making it observable without eliminating flexibility. It addresses the auditability problem.
  • Memory architectures address what context is available when the model reasons. Better memory design reduces the variance in emergent behavior by ensuring the model has the right information rather than improvising around gaps.
  • The autonomy arc (from assisted to autonomous operation) addresses how much emergence is appropriate at each stage of deployment, and how to increase autonomy incrementally as verification confidence grows.

The Conceptual Shift This Lesson Was Building Toward

Before this lesson, the natural mental model for software behavior is: code runs, you can read it, what it does is what you wrote. That model fails for agents in a specific, structural way: agents contain a reasoning step you did not write and cannot read at the level of individual decisions.

This means the developer's job shifts from specifying behavior to shaping the conditions under which behavior emerges. You specify tools, descriptions, memory contents, stopping conditions, and loop logic. The model does the rest. The quality of your agent depends on how well you shape those conditions.

Correct thinking: I own the scaffolding. I design for the behavior I want by controlling what context and tools the model has access to, and I verify by checking outcomes.

Wrong thinking: I need to anticipate every path the model might take and handle each one explicitly in code.

🧠 Mnemonic: "Models Reason, Developers Run" — the Model and its Reasoning are the provider's domain; the Developer Runs the loop, registers the tools, and manages memory. A reliable first-order heuristic for where to look when something breaks.

The four components, the developer/model boundary, and the determinism-vs-emergence tension describe the structure of agents, not their behavior in any specific domain. The structure is stable across frameworks, model providers, and task types. When you encounter a new framework, map it back to these four elements first — the framework may have different names for the components, but the components will be there.