You are viewing a preview of this lesson. Sign in to start learning
Back to Agentic AI as a Part of Software Development

Foundations of Agentic AI

Understand what agents truly are, how they differ from assistants and chatbots, and where they fit in the software lifecycle.

Last generated

Why Agentic AI Changes How We Write Software

The word "agent" gets applied to everything from a simple chatbot with a search button to fully autonomous systems managing cloud infrastructure. That breadth of usage is a problem: when a term covers too much, it stops helping you make design decisions. Before examining where agents fit in a software lifecycle or how to build them well, it's worth establishing what an agent actually is — and why the distinction matters architecturally, not just terminologically.

Traditional LLM integrations are, at their core, stateless function calls. You pass in a prompt, receive a completion, and the interaction is over. The model has no persistent awareness of what came before or what comes after. This is a reasonable model for a large class of tasks: summarization, classification, code explanation, draft generation. The system design assumptions are familiar — inputs in, outputs out, errors handled at the call site.

Agents break every one of those assumptions. They introduce state (what happened in prior steps), goals (what the system is trying to accomplish over multiple steps), and decision loops (repeated cycles of perceiving, reasoning, and acting). These aren't incremental additions to a single LLM call — they represent a qualitatively different class of system, one that can fail in qualitatively different ways.

Cascading tool misuse, goal drift, and unintended side effects don't appear in conventional API integrations because conventional integrations don't take external actions or chain decisions. An agent that writes a file, reads it back, and makes a subsequent decision based on that content has created a dependency chain that doesn't exist in any single-turn interaction. A wrong decision in step two can compound through steps three, four, and five before a human sees any output.

This means that software teams adopting agents must account for non-determinism at the architectural level, not just at the prompt level. A prompt can be tuned; an architecture has to be designed. The rest of this lesson builds that design vocabulary — starting with a precise definition of what an agent is, working through the mechanics of how it executes, mapping it onto the software lifecycle, and finishing with the concrete mistakes that cause most early-stage agent failures.

This lesson focuses on foundational concepts. Agent anatomy, multi-agent patterns, and context engineering are covered in dedicated follow-on lessons.


What an Agent Actually Is: A Working Definition

The Functional Definition

An AI agent is a system that perceives inputs, uses a model to decide on an action, executes that action through tools or APIs, and then uses the result to inform its next decision — repeating this loop until a goal is met or a stopping condition is reached.

That definition has four load-bearing parts. Pull out any one of them and you no longer have an agent in any meaningful architectural sense:

  • Perception — The agent receives structured input: a user goal, a file's contents, the output of a previous tool call, an error message. Without perception of the environment, the system has nothing to reason about.
  • Model-driven decision — An LLM (or equivalent reasoning model) processes that input and decides what to do next. The decision isn't hard-coded; it emerges from reasoning over the current context.
  • Action execution — The system carries out that decision by calling a real tool or API: reading a file, writing code, querying a database, running a test suite. The action has an effect outside the model's context window.
  • Feedback loop — The result of that action flows back in as new input. The agent perceives what happened, reasons again, and acts again. This closed loop is what distinguishes an agent from a one-shot call.
┌─────────────────────────────────────────────────────┐
│                   AGENT LOOP                        │
│                                                     │
│   ┌──────────┐    ┌──────────┐    ┌──────────────┐  │
│   │          │    │          │    │              │  │
│   │ PERCEIVE │───▶│  REASON  │───▶│     ACT      │  │
│   │          │    │  (LLM)   │    │ (tool/API)   │  │
│   └──────────┘    └──────────┘    └──────┬───────┘  │
│        ▲                                 │          │
│        │          result fed back        │          │
│        └─────────────────────────────────┘          │
│                                                     │
│              ▼ stop condition met                   │
│         ┌──────────┐                               │
│         │  OUTPUT  │                               │
│         └──────────┘                               │
└─────────────────────────────────────────────────────┘

The loop isn't an implementation detail — it's the definition. The specific mechanics of how each phase works are covered in the next section. What matters here is that the loop exists at all.

💡 Mental Model: Think of an agent the way you think of a subroutine with a while-loop at its core. A single LLM call is like a function call — you pass inputs, get outputs, done. An agent is that function wrapped in a loop where each iteration can change what the next iteration sees.

The Critical Distinction: Agents vs. Chatbots

A chatbot (or any single-turn LLM call) produces text in response to text. When you ask a chatbot how to sort a list in Python, the conversation looks like this:

User:  "How do I sort a list in Python?"
Model: "Use list.sort() for in-place sorting, or sorted() to return a
        new list. Example: sorted([3, 1, 2]) returns [1, 2, 3]."

The model has produced a useful response, but nothing in the world has changed. No file was touched. No code ran. If the advice is wrong, the only consequence is that the human gets bad information — recoverable, because the human is still the one deciding whether to act on it.

Now contrast that with giving a task to an agent:

User:  "Add input validation to the checkout module."

A properly constructed agent receiving this instruction will:

  1. Read the checkout module's source files to understand existing structure
  2. Identify which inputs need validation based on the code and any related tests
  3. Write the validation logic into the file
  4. Run the existing test suite to check for regressions
  5. Examine the test output and, if tests fail, revise the code and re-run
  6. Return a summary of what was changed and the test results

Here's a simplified representation of what that agent's execution trace might look like:

## Illustrative agent execution trace — simplified for clarity.
## A real implementation would include error handling, retry logic,
## and structured tool-call parsing (covered in "The Core Loop" section).

tools = {
    "read_file": read_file,
    "write_file": write_file,
    "run_tests": run_tests,
}

messages = [
    {"role": "user", "content": "Add input validation to the checkout module."}
]

for step in range(MAX_STEPS):
    response = llm.chat(messages=messages, tools=list(tools.keys()))

    if response.finish_reason == "stop":
        print(response.content)
        break

    tool_name = response.tool_call.name
    tool_args = response.tool_call.arguments

    tool_result = tools[tool_name](**tool_args)  # Execute the action

    # Feed the result back — this is the closed loop
    messages.append({"role": "tool", "name": tool_name, "content": tool_result})

The agent here is doing something qualitatively different from a chatbot: it is taking actions with external effects. Files are modified. Tests actually run. The codebase is in a different state at the end than it was at the beginning.

⚠️ Common Mistake: Developers sometimes add a web-search tool to a chatbot and call the result an "agent." If the system makes one external call and returns the result in a single response, it's still essentially a single-turn interaction — an enriched chatbot, not an agent. The loop and sequential decision-making are what matter.

Copilots and Assistants: The Human Stays in the Loop

Between chatbots and fully autonomous agents sits a practical and important category: copilots and assistants. These systems use LLMs to generate suggestions, completions, or plans — but they route every consequential action back through a human before executing it.

A coding assistant that suggests a diff for your review is a copilot. You see the proposed change, you approve or reject it, and only then does the file change. The model is a sophisticated suggestion engine, not an autonomous actor.

 COPILOT PATTERN                  AGENT PATTERN

 Human ──▶ Model                  Human ──▶ Model
             │                              │
             ▼                              ▼
         Suggestion ──▶ Human           Tool call
                            │               │
                            ▼               ▼
                         Action          Result
                                            │
                                            ▼
                                         Model
                                            │
                                            ▼
                                         Tool call
                                            │
                                           ...
                                            │
                                            ▼
                                    Human ◀── Output

This distinction matters for system design, not just terminology. A copilot's failure modes are recoverable because the human catches them before they execute. An agent's failures can compound across multiple steps before a human sees anything — which is why termination conditions and error handling are architectural requirements, not afterthoughts.

🎯 Key Principle: The question to ask is not "does this system use an LLM?" but "does this system make decisions and take actions without a human approving each one?" If the answer is yes, you're in agent territory.

Autonomy as a Spectrum

Autonomy is better thought of as a continuous axis than a binary. The most useful single question to position a system on that axis is: how many sequential decisions does the system make before returning control to a human?

AUTONOMY SPECTRUM

Low ◀──────────────────────────────────────────────▶ High

  Single     Copilot     Bounded     Long-horizon    Fully
  LLM call   (approve    agent       agent           autonomous
              each step) (2–5 steps) (10–50+ steps)  agent

  No loop    Human in    Human at    Human at        Human at
             every step  checkpoints milestones      output only

This isn't just a conceptual axis — it directly determines your design choices. A bounded agent that reads a file and proposes a change after three steps has a very different risk profile from one that autonomously refactors a module, commits the changes, opens a pull request, and merges it after tests pass. Both are agents; their autonomy levels require very different safeguards.

One more nuance worth naming: autonomy level isn't fixed by the agent's design alone. It's also shaped by the tools you give it. An agent with only read-access tools is constrained regardless of how many reasoning steps it takes. Granting write or deploy tools shifts the effective autonomy upward even if the decision loop stays the same.

Putting It Together: A Side-by-Side Comparison

System Type Decision Loop External Actions Human Role Failure Mode
Single LLM call None — one-shot None Interprets output Bad advice
Chatbot Conversational turns, no tool loop None Acts on output Misleading response
Copilot / Assistant Suggests; human approves each action After human approval Approves every step Bad suggestion accepted
Agent Multi-step, autonomous Yes, before human review Receives final output Cascading misuse

In practice, production systems often blend categories — a mostly-autonomous agent with a human checkpoint inserted at a high-stakes step is a common and often appropriate choice.

Why the Definition Matters for Engineering

When you know you're building an agent rather than a chatbot integration, a different set of engineering questions becomes mandatory:

  • What is the termination condition? How does the loop end if the model never emits a final answer?
  • What is the tool access scope? Which tools can the agent call, and what's the blast radius of each?
  • What is the autonomy level? At what points — if any — does a human need to confirm before the loop continues?
  • How does state persist across loop iterations? What gets carried forward, and what gets dropped?

These questions don't arise in chatbot or copilot design because those systems don't have a loop to manage. Recognizing an agent for what it is — a looping, acting system — is the prerequisite for designing it responsibly.

Wrong thinking: "An agent is just an LLM with some extra prompting."

Correct thinking: "An agent is a system with a decision loop that takes external actions and uses their results to drive subsequent decisions. Prompting is one part; loop architecture, tool design, and termination logic are equally load-bearing."


The Core Loop: Perceive, Reason, Act

With the definition in hand, we can examine how each phase of the loop maps to concrete implementation components. Every agent implementation — regardless of framework, language, or underlying model — reduces to three repeating phases. Understanding where each phase begins and ends will save hours of debugging when things go wrong, because when an agent misbehaves, the failure is almost always traceable to exactly one of these three stages.

Phase 1: Perceive — Assembling the Context Window

Perception is the process of collecting everything the model needs to reason well and packing it into the context window for the upcoming LLM call. It is discrete: it happens at the start of each loop iteration, and it is deliberately constructed by your code.

The inputs that feed perception can come from several sources:

  • The original user instruction — the goal the agent was given, usually fixed across all iterations
  • Tool outputs — structured results from the previous Act phase
  • Memory retrievals — relevant chunks surfaced from a vector store or conversation history (covered in the follow-on lesson Context Engineering)
  • Environmental state — a list of open files, the result of a test run, an error traceback

The host application's job in this phase is to assemble these inputs into a messages array, formatted as a sequence of system, user, assistant, and tool-result messages. The model never sees your Python objects or database rows directly — it sees only what you serialize into that array.

## Perception: building the messages array before each LLM call
def build_messages(system_prompt: str, history: list[dict]) -> list[dict]:
    """
    Returns a fully assembled messages array for the next LLM call.
    `history` accumulates across loop iterations and includes both
    assistant decisions and tool results from prior steps.
    """
    return [{"role": "system", "content": system_prompt}] + history

⚠️ Common Mistake: Treating the messages array as a pure log and appending everything indefinitely. Context windows have limits, and costs grow proportionally to input length. Every agent needs a strategy for what to keep and what to trim — even if that strategy is as simple as a fixed sliding window.

Phase 2: Reason — What the Model Actually Produces

Reasoning is the LLM's turn. It receives the assembled context and produces one of two things: a final answer (a text response meant for the end user) or a tool-call decision (a structured object that says "invoke this function with these arguments"). The difference matters enormously to the host application, because each requires a different response.

In practice, the tool-call decision is represented as structured output — most commonly a JSON object. Modern LLM APIs expose this through a dedicated tool-calling interface: you declare available tools with their names, descriptions, and parameter schemas, and when the model decides to use one, it returns a structured object rather than free text.

Assistant message when calling a tool:
{
  "role": "assistant",
  "content": null,
  "tool_calls": [
    {
      "id": "call_abc123",
      "type": "function",
      "function": {
        "name": "read_file",
        "arguments": "{\"path\": \"src/checkout.py\"}"
      }
    }
  ]
}

Notice that arguments is a JSON-encoded string, not a nested object — a detail that trips up many first implementations. Your parsing logic must deserialize it before passing the arguments to the actual function.

The key insight about the reasoning phase is that the model does not execute anything. It only declares intent. Whether that action actually happens, and what guardrails apply, is entirely up to the host application. This separation means you can inspect, log, validate, or override the model's decision before any side effect occurs.

💡 Mental Model: Think of the model's output as a signed purchase order, not a payment. The model says "I want to buy this." Your code is the treasury that decides whether to approve and process it.

Phase 3: Act — Dispatching and Feeding Results Back

Action is where the loop produces real-world effects. The host application reads the model's tool-call object, looks up the corresponding function in a dispatch table, executes it with the decoded arguments, and captures the result. That result then becomes input to the next perception step — appended to the history as a tool-result message.

This feedback mechanism is what makes the loop a loop. The result of acting is immediately perceived in the next iteration, allowing the model to reason about what just happened and decide what to do next.

A clean dispatch pattern uses a registry dictionary rather than a long if/elif chain:

## A simple tool registry: maps tool names to Python callables
def read_file(path: str) -> str:
    with open(path, "r") as f:
        return f.read()

def write_file(path: str, content: str) -> str:
    with open(path, "w") as f:
        f.write(content)
    return f"Written {len(content)} bytes to {path}"

TOOL_REGISTRY: dict[str, callable] = {
    "read_file": read_file,
    "write_file": write_file,
}

With this registry, dispatching becomes a single lookup rather than control flow that must be updated every time you add a tool. The result is serialized back as a tool-result message and appended to the history, completing the cycle.

Termination: How the Loop Ends

A loop without a termination condition is a process that either runs forever or crashes unpredictably. Every agent needs explicit stop logic, and it should be designed before you add the first tool.

There are three primary stop conditions:

  1. Final answer emitted: The model returns a response with no tool calls. This is the happy path — the model determined it had enough information to answer.
  2. Maximum iteration limit reached: A hard cap enforced by your code, regardless of what the model wants to do. This prevents runaway loops.
  3. Error handler intervenes: A tool call fails, returns an unexpected type, or exceeds a timeout. Depending on your policy, this might retry, escalate, or halt the loop.

The maximum iteration limit is not a sign of a poorly designed agent — it is a required safety valve. Well-functioning agents complete in far fewer steps than the limit, so it almost never fires on the happy path. Size the limit to match your task's expected complexity plus a small buffer; a limit of 50 on an agent that should complete in 3–5 steps means up to 45 wasted LLM calls before a runaway is caught.

The Minimal Loop: A Complete Code Sketch

The following is a minimal but structurally complete agent loop. It is intentionally simplified — it does not include memory retrieval, streaming, or retry logic for transient failures — but every real agent loop contains these exact bones.

import json
import openai  # representative of any OpenAI-compatible client

## --- Tool registry ---
def read_file(path: str) -> str:
    with open(path, "r") as f:
        return f.read()

TOOL_REGISTRY = {"read_file": read_file}

## --- Tool schema declared to the model ---
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file at the given path.",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"],
            },
        },
    }
]

SYSTEM_PROMPT = "You are a coding assistant. Use tools when needed."
MAX_ITERATIONS = 10

def run_agent(user_message: str) -> str:
    client = openai.OpenAI()  # uses OPENAI_API_KEY from environment
    history = [{"role": "user", "content": user_message}]

    for iteration in range(MAX_ITERATIONS):
        # --- PERCEIVE: assemble the full messages array ---
        messages = [{"role": "system", "content": SYSTEM_PROMPT}] + history

        # --- REASON: call the model ---
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto",
        )
        message = response.choices[0].message

        # --- Termination: final answer ---
        if not message.tool_calls:
            return message.content

        history.append(message.to_dict())

        # --- ACT: dispatch each tool call and collect results ---
        for tool_call in message.tool_calls:
            func_name = tool_call.function.name
            func_args = json.loads(tool_call.function.arguments)

            if func_name not in TOOL_REGISTRY:
                result = f"Error: unknown tool '{func_name}'"
            else:
                try:
                    result = TOOL_REGISTRY[func_name](**func_args)
                except Exception as exc:
                    result = f"Error executing {func_name}: {exc}"

            # Append tool result to history — feeds the next PERCEIVE step
            history.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": str(result),
            })

    # --- Termination: max iterations reached ---
    return "Agent stopped: maximum iteration limit reached without a final answer."

Walk through what this code does in terms of the three phases:

  • Perceive happens at the start of every for iteration: messages is rebuilt from history, which accumulates tool results as the loop progresses.
  • Reason is the client.chat.completions.create(...) call. The model returns either tool calls or plain content.
  • Act is the dispatch loop inside if message.tool_calls. Each result is appended to history, making it available in the next perception step.

💡 Pro Tip: Tool errors are caught and returned as string results rather than raised as exceptions. This is intentional — passing an error description back into the loop allows the model to reason about what went wrong and potentially recover. Raising an exception here would bypass that recovery path.

(A production loop would also handle tool calls arriving in batches within a single response, transient API errors with exponential backoff, context-length overflow, and structured logging per iteration.)

How the Three Phases Map to Failure Modes

One of the most practical payoffs of thinking in perceive-reason-act terms is that failures become immediately locatable:

Phase Failure signature Concrete example
Perceive Model seems to "forget" earlier context Tool result not appended to history before next call
Reason Model calls wrong tool or hallucinates arguments Tool schema description is ambiguous or argument names don't match
Act Tool returns unexpected output or crashes Error not caught, or result not serialized to string before appending

This diagnostic framing counteracts a common early instinct: blaming the model when something goes wrong. The majority of early-stage agent bugs originate in the perception or action phases — malformed message arrays, missing tool results, or unhandled exceptions — not in the model's reasoning. The model reasons over what it receives; if what it receives is wrong, the reasoning will be wrong too.

🎯 Key Principle: When debugging an agent, trace the messages array that was actually sent on the failing iteration. The mistake is almost always visible there.


Where Agents Fit in the Software Development Lifecycle

Knowing what an agent is — the perceive-reason-act loop — only gets you halfway. The more consequential question for a software team is where in the development process an agent actually belongs, and equally important, where it does not. The answer depends on how verifiable the task's output is, how costly a mistake is to undo, and how tightly scoped the agent's tool access can be.

The Blast Radius Principle

Blast radius is the scope of harm a single incorrect agent action can cause before a human can intervene. It is a function of two things: what tools the agent can invoke, and how reversible their effects are.

LOWER BLAST RADIUS                        HIGHER BLAST RADIUS
─────────────────────────────────────────────────────────────
read_file()     write_file()     run_tests()    deploy_to_prod()
list_issues()   create_branch()  merge_pr()     delete_database()
search_docs()   edit_code()      close_ticket() rotate_secrets()
     │                │               │               │
  Read-only      Reversible      Semi-reversible   Costly/Irreversible

🎯 Key Principle: Grant agents the minimum tool set required for the specific task, scoped to the minimum environment. Adding tools "in case they're useful" violates this principle and meaningfully raises risk. It's also an epistemic concern: the more tools an agent has available, the harder it becomes for the model to reliably choose the right one.

Stage-by-Stage Placement

Most software delivery moves through five recognizable stages:

PLANNING → CODING → REVIEW → TESTING → OPERATIONS

Agents can appear at every stage — but they should look meaningfully different at each one, reflecting the blast-radius principle in practice.

Planning

In the planning stage, inputs are mostly text and outputs are also text, which means verifying correctness is hard to automate.

Appropriate: Drafting initial ticket breakdowns from a product brief; linking a new feature request to existing related tickets by semantic search; flagging potential duplicates.

Inappropriate: Autonomously prioritizing the sprint backlog; assigning story points and closing tickets without review; deciding which features to cut.

The appropriate tasks produce drafts for human review, not decisions with downstream consequences. The right pattern: agent produces output in a staging state (a draft, a comment, a proposed change) and a human promotes it.

Coding

The coding stage is where agents currently show the most practical value — and where the temptation to over-automate is strongest.

Appropriate: Scaffolding boilerplate from a well-defined spec (generating a REST controller from an OpenAPI schema); generating an initial test suite for a known interface; implementing a clearly specified, isolated function.

The key qualifier is well-defined. An agent scaffolding code from a precise spec is operating on a bounded task with a verifiable output — the generated code either compiles and passes tests or it doesn't.

Inappropriate: Autonomously refactoring a large legacy module without a human reviewing the diff; redesigning a data model based on inferred intent; committing directly to the main branch.

Here's what a coding-stage agent triggered by a CI event might look like — reading a failing test, locating the relevant source, attempting a targeted fix, and reporting back:

import json
from typing import Any

def run_coding_agent(failing_test_name: str, tools: dict[str, Any]) -> str:
    """Agent loop triggered by a CI test failure."""
    MAX_ITERATIONS = 5
    messages = [
        {
            "role": "system",
            "content": (
                "You are a coding agent. Given a failing test name, "
                "locate the relevant source, diagnose the failure, "
                "and attempt a targeted fix using the tools provided. "
                "Do not modify files outside the module under test."
            ),
        },
        {
            "role": "user",
            "content": f"The following test is failing: {failing_test_name}",
        },
    ]

    for iteration in range(MAX_ITERATIONS):
        response = call_llm(messages=messages, tools=list(tools.keys()))

        if response.get("type") == "final":
            return response["content"]

        tool_name = response["tool"]
        tool_args = response["args"]

        if tool_name not in tools:
            return f"Agent requested unknown tool '{tool_name}'; stopping."

        tool_result = tools[tool_name](**tool_args)

        messages.append({"role": "assistant", "content": json.dumps(response)})
        messages.append({"role": "tool", "name": tool_name, "content": str(tool_result)})

    return "Max iterations reached without resolution."


## The tool set is intentionally narrow: read source, read test output,
## write a targeted patch. No broad filesystem access, no git push.
ci_tools = {
    "read_source_file": lambda path: open(path).read(),
    "read_test_output": lambda test: run_test_and_capture(test),
    "write_patch": lambda path, content: write_file_safely(path, content),
}

Notice that write_patch exists in this tool set, but there is no git_push or create_pr. The agent can propose a fix by writing it to disk; a separate CI step or a human promotes that patch. This is the staged commit pattern: agents write, humans (or automated gates) promote.

Review

Code review is a natural fit for agents operating in a read-heavy, well-structured environment: diffs are machine-readable, style rules are explicit, and many classes of feedback are pattern-matchable.

Appropriate: Triaging a new PR — summarizing what changed, flagging missing test coverage, identifying potential breaking changes in public interfaces, checking commit message conventions.

Inappropriate: Auto-approving or auto-merging pull requests; overriding a human reviewer's rejection; leaving comments that block merges without a human in the loop.

A PR triage agent triggered by a webhook is a clean example of the event-driven async worker pattern. It runs once when the PR is opened, produces a structured comment, and then stops. Note that the triage function below uses a single LLM call rather than a full agent loop — that's intentional, and connects directly to the heuristic discussed further below.

## PR triage agent — triggered once per PR creation event.
## READ-ONLY tool access: it cannot merge, approve, or modify the PR.

def triage_pull_request(pr_payload: dict) -> str:
    """Analyze a PR diff and return a structured triage summary."""
    diff = fetch_pr_diff(pr_payload["pr_number"])
    changed_paths = extract_changed_paths(diff)
    test_coverage = check_test_coverage(changed_paths)
    interface_changes = detect_interface_changes(diff)

    summary_prompt = [
        {
            "role": "system",
            "content": (
                "Summarize this PR for a human reviewer. Note: changed modules, "
                "test coverage gaps, and any public interface changes. "
                "Be concise and specific. Do not approve or reject."
            ),
        },
        {
            "role": "user",
            "content": f"Diff:\n{diff}\n\nCoverage gaps: {test_coverage}\n"
                       f"Interface changes: {interface_changes}",
        },
    ]

    # Single LLM call — no branching, no tool loop needed.
    result = call_llm(messages=summary_prompt)
    post_pr_comment(pr_payload["pr_number"], result["content"])
    return result["content"]
Testing

Testing is arguably the highest-value stage for agent deployment today, because it has the best combination of properties: the task is bounded, the correctness criterion is automatically checkable (do tests run and pass on known-good code?), and errors are cheap to discard.

Appropriate: Generating a test suite for a new function from its type signature and docstring; expanding edge-case coverage; generating property-based test inputs. An agent can write tests, run them, observe which pass and fail, and refine — a tight loop where every iteration produces a verifiable signal.

Inappropriate: Autonomously deleting or disabling flaky tests; modifying the implementation under test to make tests pass; deciding that a failing test represents an incorrect test rather than a real bug.

The last point deserves emphasis. An agent that can both write tests and modify the implementation faces a degenerate optimization path: it can make tests pass by changing what the code does. Tool access must prevent this — the test-generation agent reads the implementation but cannot write to it.

Operations

Operations is where blast radius concerns are most acute. Production systems carry real consequences for errors.

Appropriate: Triaging incoming alerts against known runbooks; searching logs for patterns matching a known failure mode; drafting a postmortem outline from structured incident data.

Inappropriate: Autonomously rolling back a deployment based on a spike in error rates; scaling infrastructure resources without a human approval gate; executing database migrations in response to inferred schema drift.

🎯 Key Principle: In operations, the cost of a false positive — an agent taking a drastic action when no action was needed — can exceed the cost of the original incident. Alert triage and runbook lookup are safe; autonomous remediation requires approval gates.

The Event-Driven Async Worker Pattern

Across all the appropriate examples above, a pattern emerges: well-placed agents are triggered by discrete events, run to completion on a bounded task, and then stop. They are not always-on processes with persistent broad permissions scanning for things to do.

EVENT-DRIVEN AGENT LIFECYCLE (recommended)

CI failure event
      │
      ▼
┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Event queue │────▶│  Agent spins up  │────▶│ Tool set: scoped│
│ (webhook,   │     │  Runs task loop  │     │ to this task    │
│  CI hook,   │     │  Terminates on   │     │ only            │
│  ticket     │     │  stop condition  │     └─────────────────┘
│  assigned)  │     └──────────────────┘              │
└─────────────┘              │                        ▼
                             │               ┌─────────────────┐
                             └──────────────▶│ Output: comment,│
                                             │ draft, report   │
                                             │ (human reviews) │
                                             └─────────────────┘

CONTRAST: ALWAYS-ON AUTONOMOUS PROCESS (avoid)

┌───────────────────────────────────────────────────┐
│ Agent: running continuously, broad permissions,   │
│ self-deciding when to act, no discrete trigger    │
│ → accumulating cost, risk, and drift over time    │
└───────────────────────────────────────────────────┘

The always-on pattern creates three compounding problems: accumulated cost from continuous LLM calls, no clear accountability for when the agent acted and why, and permission scope that grows to cover "all the things it might need to do" rather than being task-specific.

The Agent-vs-Single-Call Heuristic

One of the most practically useful questions when designing a feature is whether a task actually requires an agent loop, or whether a single LLM call is sufficient.

If the task requires more than one external action, or if it requires conditional branching based on the result of an action, use an agent loop. Otherwise, use a single LLM call.

Task External Actions Branching on Results? Use Agent?
Summarize a PR diff 0 (diff is input) No ❌ Single call
Generate tests, run them, refine 2+ (write, execute, re-read) Yes (pass/fail) ✅ Agent
Draft a ticket from a brief 0 No ❌ Single call
Triage a bug: search codebase, link related issues 2+ (search, fetch, compare) Yes ✅ Agent
Summarize this week's incidents 1 (fetch logs) No ❌ Single call
Fix a failing test: read, diagnose, patch, re-run 3+ Yes ✅ Agent

Using a single LLM call when a task needs branching produces brittle pipelines that fail silently on edge cases. Using a full agent loop when a single call would do adds latency, cost, and failure surface for no benefit.

A Lifecycle Summary

STAGE       │ AGENT ROLE          │ TOOL ACCESS     │ HUMAN GATE
────────────┼─────────────────────┼─────────────────┼────────────────
Planning    │ Draft & triage      │ Read + comment  │ Before promote
Coding      │ Scaffold & fix      │ Read + write    │ Before push
Review      │ Summarize & flag    │ Read + comment  │ Always
Testing     │ Generate & iterate  │ Read tests +    │ Before merge
            │                     │ write tests     │
Operations  │ Triage & summarize  │ Read logs +     │ Before remediate
            │                     │ read runbooks   │

The human gate is always before the action with downstream consequence: before code is pushed to a shared branch, before a PR is merged, before a remediation action runs in production. The agent operates freely in the space between trigger and gate.

🤔 Did you know? The blast-radius principle maps directly to the principle of least privilege in security engineering — the idea that any component of a system should have access only to the resources it needs for its immediate task. Applying this to agents is less a new idea than a disciplined extension of existing engineering practice to a new class of actor.


Common Misconceptions and Early Mistakes

Building your first agent often feels like it should be straightforward: connect a model to some tools, write a system prompt, and let the loop run. The gap between that expectation and a production-ready agent is where most early-stage bugs live. The mistakes covered here are the predictable failure patterns that emerge when engineers apply single-turn LLM intuitions to a fundamentally different execution model.

Mistake 1: Treating the Agent Loop as Reliable by Default

In a conventional codebase, a function called with valid inputs returns a valid output. In an agent loop that assumption is not safe, and building on it without safeguards leads to systems that hang indefinitely, hammer external APIs, or silently corrupt their own context.

Loop reliability failures come in three distinct flavors: the model returns a malformed tool call (missing field, wrong argument type, unknown tool name); the model repeats the same action because the context hasn't changed enough to drive new reasoning; or the loop enters an infinite cycle where a tool keeps returning an error and the loop driver keeps re-entering without terminating.

import json

MAX_ITERATIONS = 10

def run_agent_loop(client, tools, messages):
    for iteration in range(MAX_ITERATIONS):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
        )

        message = response.choices[0].message

        if message.tool_calls is None:
            return message.content

        messages.append(message)

        for tool_call in message.tool_calls:
            try:
                args = json.loads(tool_call.function.arguments)
                result = dispatch_tool(tool_call.function.name, args)
            except json.JSONDecodeError as e:
                # Malformed tool call — feed the error back rather than crashing
                result = f"Error: could not parse arguments — {e}"
            except Exception as e:
                result = f"Error: tool execution failed — {e}"

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": str(result),
            })

    return "Agent reached maximum iterations without completing the task."

This loop does three things the naive version omits: it caps iterations at a hard limit, wraps tool dispatch in exception handling, and feeds parse errors back to the model rather than crashing. Informative error messages are important — a blank string or silent exception gives the model no signal to change its behavior, and the loop will likely repeat the same broken call.

💡 Pro Tip: Track the last N tool calls and their results. If the model has called the same tool with the same arguments twice in a row and received the same result, inject a system-level message informing it that the action is not producing new information.

Mistake 2: Giving Agents Broad Tool Access Upfront

The instinct to register everything the agent might conceivably need sounds sensible. In practice, the opposite dynamic takes hold.

LLMs select tools through a probability-weighted decision process. The more tools in the registered set, the larger the space of plausible-but-wrong choices the model can make. A model with fifteen tools and a task that needs two of them will, with some regularity, reach for a third one that seems superficially relevant.

Agent with 15 tools registered:
  Task: "Check if the test suite passes"
  Model's decision path:
    1. Maybe I should search the codebase first?
    2. Or read the test config file?
    3. Should I check the CI status? Which tool does that?

Agent with 3 tools registered:
  Available tools: run_tests, read_file, get_test_config
  Model's decision path:
    1. Run the tests.

🎯 Key Principle: Start with the minimum viable tool set. Add tools incrementally, in response to a specific observed limitation, not in anticipation of hypothetical needs. Every tool definition also occupies tokens in the context window, so unnecessary tools consume capacity that should go to task-relevant information.

Wrong thinking: "I'll give it all the tools now and restrict them later."

Correct thinking: "I'll give it the two tools this specific task requires, and add more only when I observe the agent failing because a tool is missing."

Mistake 3: Substituting Prompt Length for Loop Architecture

When an agent misbehaves, the system prompt is the first thing many developers reach for. The agent called the wrong tool? Add an instruction. The agent didn't terminate correctly? Add a rule. This pattern produces system prompts that run to thousands of tokens and agents that still misbehave — just differently.

The system prompt is read once at the start of each LLM call, but its influence competes with everything else in the context window. In a long loop, early instructions can be effectively overridden by the weight of recent context. This is not a bug that better phrasing will fix.

Behavioral guarantees belong in code. If the agent must never call a destructive tool more than once, enforce that in the loop driver. If the agent must terminate when a certain condition is met, check that condition in the loop controller. If tool arguments must conform to a schema, validate them in the dispatch layer.

from pydantic import BaseModel, ValidationError

class RunTestsArgs(BaseModel):
    test_path: str
    timeout_seconds: int = 60

def dispatch_tool(name: str, args: dict) -> str:
    if name == "run_tests":
        try:
            validated = RunTestsArgs(**args)
        except ValidationError as e:
            return f"Invalid arguments for run_tests: {e}"
        return execute_test_runner(validated.test_path, validated.timeout_seconds)
    return f"Unknown tool: {name}"

This validates tool arguments against a schema before executing anything. The model never needs a prompt instruction saying "always provide a valid test_path" — the loop will catch and report the error regardless. Guardrails enforced in code are deterministic; guardrails stated in prompts are probabilistic.

💡 Mental Model: Think of the system prompt as the agent's job description — it shapes orientation and goals. Think of the loop architecture as the employment contract — it enforces the non-negotiable rules. Job descriptions get forgotten under pressure; contracts do not.

Mistake 4: Ignoring Latency and Cost Accumulation

A single LLM call feels cheap and fast in isolation. An agent loop is not a single call — it is a sequence of calls, each consuming a context window that grows with every iteration as tool results are appended.

Iteration 1:  [system prompt + task]                      → ~1,000 tokens in
Iteration 2:  [system prompt + task + result_1]            → ~1,800 tokens in
Iteration 3:  [system prompt + task + result_1 + result_2] → ~2,900 tokens in
...
Iteration 10: [system prompt + task + results 1-9]        → ~12,000 tokens in

Total input tokens across 10 iterations: ~50,000+
Comparison: a single-call integration for a similar task: ~1,000–2,000 tokens

The numbers above are illustrative; actual accumulation depends on tool output verbosity and context management strategy. The point is structural: cost and latency scale super-linearly with loop depth when context is not actively managed.

Tool outputs are often the primary driver of context bloat. A tool that returns a full file's contents when only a few lines are relevant inflates every subsequent iteration. The fix is at the tool implementation level: design tools to return minimal, targeted outputs.

Latency compounds even when individual calls are fast. If each LLM call takes two seconds and you have ten iterations, the agent takes at least twenty seconds — before accounting for tool execution time and the fact that longer contexts typically produce slower responses.

Lever What it controls Practice
Max iterations Total loop depth Set to realistic task upper bound
Context pruning Input token accumulation Summarize or drop older tool results
Tool output size Per-iteration token growth Return targeted subsets, not full payloads
Model selection Per-call cost and speed Use smaller models for intermediate steps if appropriate

Mistake 5: Conflating Agent Failure with Model Failure

When an agent produces a wrong result or gets stuck, the instinct is to blame the model — upgrade to a more capable one, adjust the prompt, or switch providers. This instinct is wrong often enough to warrant making it explicit: the majority of early-stage agent bugs originate outside the model, in the infrastructure that surrounds it.

The three most common non-model failure sources:

1. Tool implementation errors. The tool function has a bug — an off-by-one index, a wrong key name, a missing null check. The model correctly decides to call the tool and correctly formats the call; the tool returns garbage. The model cannot inspect whether a result is correct, and continues reasoning on bad data.

2. Missing error propagation. The tool catches an exception internally and returns an empty string or None instead of an informative error message. The model receives a blank result, has no signal that something went wrong, and proceeds as if the tool succeeded.

## ❌ Silent failure — model receives empty string, continues on corrupt context
def read_file_bad(path: str) -> str:
    try:
        with open(path) as f:
            return f.read()
    except Exception:
        return ""  # The model has no idea this failed

## ✅ Propagated error — model receives a signal it can reason about
def read_file_good(path: str) -> str:
    try:
        with open(path) as f:
            return f.read()
    except FileNotFoundError:
        return f"Error: file not found at path '{path}'"
    except PermissionError:
        return f"Error: no read permission for '{path}'"
    except Exception as e:
        return f"Error: unexpected failure reading file — {type(e).__name__}: {e}"

Silent swallows are particularly insidious because the agent appears to be running correctly — the loop doesn't crash — but its outputs are wrong in ways that can be hard to trace.

3. Context assembly problems. The messages list is assembled incorrectly — tool results appended in the wrong order, a missing tool_call_id, or prior turns truncated in a way that removes critical information. The model reasons on a malformed context and produces behavior that looks like hallucination but is actually a predictable response to bad inputs.

🎯 Key Principle: When an agent fails, work backward through the execution trace before touching the model or prompt. Inspect the exact messages list sent to the model on the failing iteration. Check whether errors were caught and returned, or silently swallowed.

🧠 Mnemonic — TEC before LLM: Tool output → Error propagation → Context assembly. Exhaust these three before assuming the model is at fault.

A Debugging Checklist

The five mistakes above share a common thread: applying single-call intuitions to a multi-step, stateful system. When an agent misbehaves, work through this triage order:

Agent Failure Triage Order
─────────────────────────────────────────────────────
1. Did the loop terminate correctly?
   └─ Check: max iteration limit, stop conditions, error handlers

2. Did the right tools get called?
   └─ Check: tool set size, tool name/schema accuracy, argument validation

3. Did the tools return useful outputs?
   └─ Check: tool implementation bugs, error propagation, output verbosity

4. Was the context assembled correctly?
   └─ Check: message ordering, tool_call_id linkage, truncation behavior

5. Was the cost/latency within bounds?
   └─ Check: context size per iteration, total iterations, tool output size

6. Only after 1–5 are clear: evaluate model behavior
   └─ Check: reasoning quality on the actual inputs the model received
─────────────────────────────────────────────────────

The instinct to make agents maximally capable from day one — broad tool access, long instructive prompts, ambitious multi-step tasks — bypasses the feedback loops that surface these mistakes early and cheaply. A narrowly scoped agent with explicit termination conditions, a minimal tool set, code-enforced guardrails, and full error propagation will fail in ways you can diagnose. A maximally capable agent built on prompt instructions and optimism will fail in ways that take days to untangle.


Key Takeaways and What Comes Next

The One Structural Insight Worth Keeping

An agent is defined by its loop, not by its model. A system that takes an LLM response, executes an external action, and feeds the result back into the next LLM call is an agent — regardless of whether the model is large or small, proprietary or open. A system that uses the most powerful model available but returns a single text response is not an agent in any architecturally meaningful sense.

This means the engineering challenges of agentic systems — state management, error propagation, loop termination, blast-radius control — are structural properties of the closed loop, not properties of the model you plug into it. Swapping to a better model does not fix a missing iteration limit. A more capable model does not compensate for tools that don't return structured errors. The loop is the system.

The Three Primary Design Levers

When designing or auditing any agentic system, three variables control almost everything else about its behavior and risk profile:

┌─────────────────────────────────────────────────────────────────┐
│                 THE THREE PRIMARY DESIGN LEVERS                 │
├──────────────────────┬──────────────────────────────────────────┤
│ LEVER                │ WHAT YOU'RE CONTROLLING                  │
├──────────────────────┼──────────────────────────────────────────┤
│ Autonomy Level       │ How many sequential decisions the agent  │
│                      │ makes before requiring human input.       │
│                      │ Range: single-step → fully autonomous.   │
├──────────────────────┼──────────────────────────────────────────┤
│ Tool Access Scope    │ Which external actions the agent can     │
│                      │ take. Read-only vs. write vs. deploy.    │
│                      │ Wider scope = larger blast radius.       │
├──────────────────────┼──────────────────────────────────────────┤
│ Termination          │ What stops the loop. Model emits final   │
│ Conditions           │ answer, max iterations hit, error        │
│                      │ threshold exceeded, or explicit signal.  │
└──────────────────────┴──────────────────────────────────────────┘

These levers are interdependent: wider tool access scope demands tighter autonomy controls and more explicit termination conditions. An agent that can only read files can afford more autonomy because the cost of a wrong decision is lower. An agent that can merge pull requests needs tight iteration limits, confirmation steps, and hard error cutoffs.

import json
from typing import Any

## --- Lever 1: Tool Access Scope ---
## Only read-only tools registered here.
AVAILABLE_TOOLS = {
    "read_file": read_file,
    "list_directory": list_directory,
    "search_codebase": search_codebase,
}

def run_agent(
    instructions: str,
    llm_client: Any,
    max_iterations: int = 8,  # Lever 3: Termination — hard ceiling
) -> str:
    messages = [{"role": "user", "content": instructions}]

    # Lever 2: Autonomy Level — max_iterations controls sequential decisions
    for iteration in range(max_iterations):
        response = llm_client.complete(
            messages=messages,
            tools=list(AVAILABLE_TOOLS.keys()),
        )

        # Lever 3: Termination — model signals it's done
        if response.finish_reason == "stop":
            return response.content

        if response.finish_reason == "tool_call":
            tool_name = response.tool_call["name"]
            tool_args = response.tool_call["arguments"]

            # Lever 1: Scope enforcement — reject unknown tools
            if tool_name not in AVAILABLE_TOOLS:
                raise ValueError(f"Tool '{tool_name}' not in allowed scope")

            tool_result = AVAILABLE_TOOLS[tool_name](**tool_args)
            messages.append({"role": "tool", "content": str(tool_result)})

    # Lever 3: Termination — max iterations reached
    return "Agent reached iteration limit without completing the task."

The max_iterations ceiling, the AVAILABLE_TOOLS dictionary, and the finish_reason checks each correspond directly to one of the three levers. These are decisions to make before writing any other part of the system.

System-Design Concerns Beyond Prompt Engineering

Agentic systems are distributed systems with non-deterministic components, and they need to be designed as such. Four concerns sit at this intersection:

State Management — An agent accumulates state across iterations: growing message history, intermediate tool results, and any external writes it has made. That state can become inconsistent. A tool call that partially succeeds before an error leaves the world in a state neither the agent nor the caller fully understands. Checkpointing and idempotent tool design are the standard mitigations.

Error Propagation — Tool failures need to travel back into the agent's context in a form the model can reason about, not silently swallowed. An agent that receives an empty string when a file read fails will continue reasoning as if the file were empty, which is worse than stopping. The pattern for handling this:

from dataclasses import dataclass

@dataclass
class ToolResult:
    success: bool
    value: str | None
    error: str | None  # Never None when success is False

def read_file_safe(path: str) -> ToolResult:
    try:
        with open(path, "r") as f:
            contents = f.read()
        return ToolResult(success=True, value=contents, error=None)
    except FileNotFoundError:
        return ToolResult(
            success=False, value=None,
            error=f"File not found: {path}. Check the path and try list_directory first."
        )
    except PermissionError:
        return ToolResult(
            success=False, value=None,
            error=f"Permission denied reading {path}. This file is outside agent scope."
        )

def serialize_tool_result(result: ToolResult) -> str:
    if result.success:
        return result.value
    return f"[TOOL ERROR] {result.error}"

The agent receives [TOOL ERROR] File not found: checkout.py. Check the path and try list_directory first. rather than an empty string. That's information the model can act on.

Cost Control — Each loop iteration is an LLM call, and context windows grow with each iteration as tool results accumulate. This isn't a reason to avoid agents; it's a reason to set iteration limits and context-trimming policies deliberately.

Blast-Radius Containment — The scope of harm a misbehaving agent can cause is directly proportional to the scope of its tool access. An agent's tool access should be the minimum needed for the task, not the maximum that might eventually be useful.

Concept Map

                    ┌─────────────────────────────┐
                    │   PERCEIVE → REASON → ACT   │
                    │        (closed loop)         │
                    └──────────────┬──────────────┘
                                   │ enables
                    ┌──────────────▼──────────────┐
                    │      EXTERNAL ACTIONS        │
                    │  (what distinguishes agents  │
                    │   from chatbots/assistants)  │
                    └──────────────┬──────────────┘
                                   │ introduces
          ┌────────────────────────┼────────────────────────┐
          │                        │                        │
┌─────────▼────────┐   ┌──────────▼──────────┐   ┌────────▼──────────┐
│  STATE &         │   │   DESIGN LEVERS      │   │  FAILURE MODES    │
│  NON-DETERMINISM │   │  autonomy / tools /  │   │  loop errors,     │
│                  │   │  termination         │   │  cost, blast      │
└──────────────────┘   └─────────────────────┘   └───────────────────┘
          │                        │                        │
          └────────────────────────┴────────────────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │    SYSTEM-DESIGN CONCERNS    │
                    │  state / errors / cost /     │
                    │  blast-radius containment    │
                    └─────────────────────────────┘

The closed loop is what makes external actions possible; external actions are what introduce state and non-determinism; that combination is what makes the design levers necessary; and the design levers, when set poorly, are what produce the failure modes.

Core Concepts at a Glance

Concept One-Line Definition Why It Matters
Agent Loop Perceive → Reason → Act, repeated until a stop condition Defines what an agent is — the structure, not the model
Autonomy Level Sequential decisions before human confirmation Controls risk exposure and appropriate deployment context
Tool Access Scope Which external actions the agent is permitted to take Determines blast radius; should be minimum viable
Termination Condition What causes the loop to exit cleanly Required for every agent; not optional
Blast Radius Maximum damage a misbehaving agent can cause Proportional to tool scope; a core risk management metric
Error Propagation How tool failures surface back to the reasoning step Structured errors let the model reason; silent errors corrupt it
Cost Accumulation Latency and token spend grows with each loop iteration Must be accounted for at design time

⚠️ Critical point to carry forward: The three design levers — autonomy level, tool access scope, and termination conditions — should be explicitly revisited when the task changes, when tools are added, or when the agent is deployed in a new environment. An agent scoped for a read-only code review task should not automatically receive write permissions for a related task without resetting all three levers.

What Comes Next in This Roadmap

This lesson has given you the vocabulary and mental models to reason about agents as a class of system. The next two lessons apply that foundation to the problems that matter most in practice:

Agent Anatomy & Patterns takes the single-agent loop covered here and extends it to multi-agent systems — orchestrators that delegate to specialized subagents, parallel execution patterns, and the tradeoffs between different coordination strategies. If you've understood why the loop structure matters, that lesson will show you how to compose loops together without compounding their failure modes.

Context Engineering addresses the perceive phase of the loop in depth — how you assemble, compress, and manage the information the agent reasons over at each step. Context is the agent's only view of the world, and assembling it poorly is one of the most common sources of agent failure that isn't a model problem.

Both lessons build on exactly the framework established here. The perceive-reason-act loop, the distinction between agents and single-turn calls, and the three primary design levers are the reference points those lessons will use without re-explaining them.