Protocols, Tools & Skills
Standardize agent integrations with MCP and A2A, design effective tools, and use progressive skill loading.
Why Protocols and Tools Are the Backbone of Agentic Systems
Imagine you've wired up an LLM to call a function that reads from a database. It works — once, in your notebook, with your credentials, on your machine. Then a teammate tries to run it in a different environment. Then you want the same capability accessible to a second agent that handles scheduling. Then a third agent needs it, but scoped to read-only access. Suddenly, what felt like a clean solution reveals itself as a one-off bridge built for one crossing. The LLM call still works; the surrounding infrastructure has become a tangle of assumptions. This is the core tension that agentic AI surfaces: the gap between a model that can reason about what to do and a system that can reliably do it — repeatedly, inspectably, and across different hosts, agents, and contexts.
This lesson is about closing that gap. The answer isn't more clever prompting or larger models. It's structured protocols and well-defined tools — the infrastructure layer that turns a language model from a conversational interface into a dependable software actor. Understanding why these structures are necessary, and what properties make them work, is the foundation for everything else in this lesson and the child lessons that follow.
From a Single LLM Call to an Agentic Loop
A standard LLM call is stateless and self-contained: you send a prompt, you receive a completion. There's no inherent notion of doing something in the world, no feedback loop, no retry logic, no inspection surface. That simplicity is a feature when you're generating text. It becomes a liability when the model is expected to act.
An agentic loop changes the picture significantly. Rather than a single request-response cycle, the agent operates through repeated steps: it observes some state (the task, previous results, environmental context), reasons about what action to take, invokes a tool or external capability, observes the result, and iterates. This cycle might run for a handful of steps or for dozens, potentially spawning sub-agents or handing off work to other systems.
AGENTIC LOOP — HIGH LEVEL
┌─────────────────────────────────────────────────────┐
│ AGENT │
│ │
│ 1. OBSERVE ──► 2. REASON ──► 3. INVOKE TOOL │
│ ▲ │ │
│ └──────── 4. OBSERVE RESULT ◄──┘ │
│ │ │
│ [done?] ──┤ │
│ └─► [continue loop] │
└─────────────────────────────────────────────────────┘
│ │
[LLM Reasoning Layer] [Tool Execution Layer]
│ │
understands intent acts on the world
The critical observation here is that steps 3 and 4 — invoke and observe result — require the agent to interact with something external. That external interaction needs to be repeatable (same inputs produce predictable outputs), inspectable (you can log, audit, and debug what happened), and composable (you can chain or substitute tools without rewriting the agent's reasoning). Ad-hoc function calls — raw Python callables, inline API requests, hardcoded logic — can satisfy the first requirement in a narrow case. They rarely satisfy the second, and almost never the third.
The moment you want more than one agent, more than one environment, or more than one developer working on the system, ad-hoc wiring becomes a liability. Structured protocols and well-defined tools are how you build the repeatability, inspectability, and composability that an agentic loop requires.
The Three Concerns Structured Protocols Address
When teams move from prototype to production agentic systems, three distinct problems tend to surface — often separately, which makes them hard to recognize as facets of the same underlying issue.
Discoverability: What Can the Agent Do?
Discoverability is the question of how an agent (or its orchestrating layer) knows what capabilities are available at runtime. In a simple system with two or three hardcoded functions, this isn't a problem — the developer knows what's there. But as tool sets grow, as capabilities come from external services, or as tools are loaded dynamically based on context, the agent needs a way to query the available capability surface.
Without a protocol that handles discoverability, you end up with one of two failure modes: either capabilities are hardcoded into the system prompt (brittle, bloated, and impossible to update without a redeploy) or the agent attempts actions without knowing whether the underlying tool exists (leading to hallucinated function calls or silent failures).
Invocation: How Does the Agent Act?
Invocation is the question of how a capability is actually called — what parameters it takes, what format they must be in, how errors are reported, and what the shape of the return value is. This is where the analogy to a REST API is most direct: just as a well-designed API specifies request schema, response schema, and error codes, a well-defined tool specifies exactly what goes in and what comes out.
Without consistent invocation contracts, the agent's reasoning layer has to guess. LLMs are surprisingly good at guessing, which masks the problem in testing and reveals it in production — usually when an edge-case input produces an unexpected format and the downstream handler fails silently.
Interoperability: Can Different Agents and Hosts Share Tools?
Interoperability is the hardest concern to appreciate until you've been burned by its absence. A tool built for Agent A, running on Host X, is often completely inaccessible to Agent B running on Host Y — even if both agents could theoretically benefit from the same capability. Without a shared protocol, each integration is bespoke. Teams end up maintaining multiple versions of the same tool, each adapted to a different agent framework or runtime.
This is analogous to what happened with early web services before REST and JSON emerged as dominant conventions: every integration was custom, every client had to understand every server's idiosyncratic interface. Standardization didn't eliminate all complexity, but it dramatically reduced the cost of adding a new client or server to the ecosystem.
🎯 Key Principle: Discoverability, invocation, and interoperability are not independent problems — they are three facets of the same design question: how does a reasoning layer interact with an execution layer in a way that is reliable, observable, and reusable? A good protocol addresses all three.
Tool Definitions as Contracts
The concept of a tool definition is worth sitting with carefully, because it's doing more work than it might appear. A tool definition is not simply a function signature. It is the contract between the agent's reasoning layer (which decides what to do) and the execution layer (which does it). This contract has at least three components:
- Input schema — what the tool expects: parameter names, types, whether they're required or optional, and constraints on values
- Output contract — what the tool returns: the shape of a successful result and the shape of an error
- Side effect declaration — what the tool does to the world: does it read, write, delete, send a message, trigger a payment?
The analogy to an API schema is precise and useful. When you consume a REST API, you rely on its OpenAPI specification to understand what endpoints exist, what they accept, and what they return. You don't read the server's source code every time. The specification is the boundary you program against. A tool definition serves the same function for an agent: the LLM reasons against the specification, not against the implementation.
Here's a concrete illustration. Consider a minimal tool definition in the style that many agent frameworks expect:
## Tool definition: schema-first approach
## The agent reasons against this spec — not the underlying implementation
from typing import Any
def get_tool_definition() -> dict[str, Any]:
return {
"name": "search_knowledge_base",
"description": (
"Search the internal knowledge base for documents matching a query. "
"Returns a list of matching document summaries. "
"Does NOT modify any data — read-only operation."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query string."
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return (1–20).",
"minimum": 1,
"maximum": 20,
"default": 5
}
},
"required": ["query"]
}
}
Notice what this definition accomplishes. It tells the agent what the tool does (search, read-only), what it needs (a required query string, an optional result count with explicit bounds), and implicitly what it does not do (no writes, no side effects). The underlying implementation — whether it calls a vector database, a keyword index, or a hybrid retrieval system — is completely hidden from the agent. The agent can reason about when and why to call this tool using nothing but the definition.
Now contrast that with the ad-hoc approach many teams start with:
## ⚠️ Ad-hoc approach — common in early prototypes, problematic at scale
## The agent has to infer everything from context; no explicit contract exists
def search_kb(q, n=5):
# implementation hidden
results = internal_search(q, limit=n)
return results # returns... something — agent has to guess the shape
This function works when called directly from Python. But from the agent's perspective, there is no discoverable description, no parameter schema, no explicit output shape, and no indication of side effects. The LLM will attempt to infer these from variable names and context — a surprisingly fragile strategy that breaks when parameter names are abbreviated, when the function is used in a new context, or when a different agent framework tries to invoke it.
💡 Mental Model: Think of a tool definition the way you'd think of a strongly typed function signature in a statically typed language. The types don't run the program — but they catch entire classes of errors before execution, and they make it possible for tools (IDEs, compilers, other developers) to reason about the code without reading its body. A tool definition does the same thing for an agent reasoning about what to call.
Why "Just Prompt the Agent" Doesn't Scale
A reasonable objection at this point: can't you just describe your tools in the system prompt and let the LLM figure out the details? In practice, teams try this. It works well enough for two or three tools in a controlled environment. The failure modes appear at scale — and they're worth making concrete.
Schema drift happens when the tool's implementation changes but the system prompt isn't updated. The agent continues to call the tool with the old parameter names; the implementation rejects or silently mishandles them. With a formal tool definition that's generated from the same source as the implementation, this drift is structurally prevented.
Token budget pressure happens when tool descriptions are written as prose in the system prompt. A system with fifteen tools, each described in a paragraph, consumes a significant portion of the context window before the agent has processed a single word of the actual task. Structured definitions, especially when loaded selectively based on the task at hand, use the token budget far more efficiently.
Auditability gaps happen when tool invocations aren't discrete, inspectable events. If the agent's tool use is embedded in conversational turns rather than structured calls, you can't easily extract a log of what was called, with what arguments, and what was returned. This matters for debugging, for compliance, and for understanding why an agent produced a given output.
🤔 Did you know? The concept of a clearly defined interface between reasoning and execution appears across many software patterns — the separation between a query planner and a query executor in a database engine is one example where this boundary has been carefully maintained for decades, precisely because it allows each layer to evolve independently without breaking the other.
The Practical Cost of Skipping Structure
It's worth being direct about what skipping structured tool definitions actually costs — not in the abstract, but concretely.
## Example: structured tool invocation with a well-defined contract
## This is the pattern you want — explicit schema, explicit error handling
import json
from typing import Any
def invoke_tool(tool_name: str, arguments: dict[str, Any]) -> dict[str, Any]:
"""
Invoke a registered tool by name with validated arguments.
Returns a structured result with 'success', 'data', and 'error' fields.
"""
registry = get_tool_registry() # tools are discoverable from a central registry
if tool_name not in registry:
# ✅ Explicit failure: the agent receives a structured error it can reason about
return {
"success": False,
"data": None,
"error": f"Tool '{tool_name}' not found in registry."
}
tool = registry[tool_name]
# Validate arguments against the tool's declared schema before execution
validation_error = tool.validate_arguments(arguments)
if validation_error:
return {
"success": False,
"data": None,
"error": f"Invalid arguments: {validation_error}"
}
# Execute and return structured result
try:
result = tool.execute(arguments)
return {"success": True, "data": result, "error": None}
except Exception as exc:
return {"success": False, "data": None, "error": str(exc)}
This pattern has three properties that ad-hoc function calls lack: it routes through a registry (solving discoverability), it validates against a schema before execution (enforcing the invocation contract), and it returns a consistent result envelope (giving the agent a predictable observation format regardless of which tool was called). Each of those properties is a consequence of the structural choices made in the tool definition layer — not of the LLM being more capable.
(This is a simplified illustration — a production system would also handle authentication, rate limiting, and timeouts, which are covered in the tool design child lesson.)
A Map of This Lesson and Its Child Lessons
With the why established, it's worth orienting you to the what — the specific topics this lesson covers and how they connect forward.
LESSON MAP
THIS LESSON: Protocols, Tools & Skills (Foundational)
├── Section 1 [HERE]: Why Protocols and Tools Are the Backbone
│ Sets up the core argument: structure enables reliability
│
├── Section 2: Anatomy of an Agent Tool
│ Goes deep on what a tool IS at the code level —
│ schema, execution boundary, return contract
│
├── Section 3: Protocols as Shared Contracts
│ What standardization buys you; why informal
│ conventions break at scale
│
├── Section 4: Skill Loading and Capability Composition
│ How agents acquire and scope tools at runtime;
│ static vs. dynamic tool sets
│
└── Section 5: Common Mistakes When Defining Tools
Concrete failure patterns and their corrections
CHILD LESSONS (where this foundation is applied)
├── MCP & A2A Deep Dive
│ Model Context Protocol and Agent-to-Agent Protocol;
│ the specific standards that implement the concepts
│ from Sections 2–3 of this lesson
│
└── Tool Design & Skill Loading in Practice
Designing high-quality tools; progressive skill
loading patterns; the specifics that Section 4
introduces but doesn't fully expand
The flow is intentional. This lesson builds the conceptual vocabulary — what is a tool definition, what is a protocol, what is skill loading — so that the child lessons can go deep on specific implementations (MCP, A2A) and design patterns (tool quality, capability composition) without needing to re-establish first principles each time.
🎯 Key Principle: Foundational concepts aren't filler. Understanding why structured protocols matter at the level of agentic loops, discoverability, and invocation contracts is what allows you to evaluate any specific protocol — MCP, A2A, or something that emerges later — rather than just following configuration steps.
📋 Quick Reference Card: The Three Concerns and What Addresses Them
| Concern | The Question It Answers | Addressed By |
|---|---|---|
| 🔍 Discoverability | What tools can the agent access? | Tool registry + protocol-level enumeration |
| ⚙️ Invocation | How does the agent call a tool correctly? | Typed schema + input/output contract |
| 🔗 Interoperability | Can tools work across agents and hosts? | Shared protocol (e.g., MCP, A2A) |
| 📋 Auditability | What did the agent actually do? | Structured invocation envelope + logging |
(Auditability is a consequence of getting the first three right — not a separate concern to solve independently.)
What You're Building Toward
The rest of this lesson fills in the structure that this section has outlined at a high level. Section 2 goes inside a single tool — schema, execution boundary, return contract — so you can reason about any tool regardless of what framework wraps it. Section 3 zooms out to protocols as coordination mechanisms across multiple agents and hosts, building the conceptual foundation for the MCP and A2A child lesson. Section 4 introduces skill loading — how agents acquire capabilities dynamically, rather than having a fixed tool set baked in at startup. Section 5 grounds everything in the mistakes that developers most commonly make, with before-and-after examples.
By the end, you'll have a mental model that makes the specific protocols and design patterns covered in the child lessons legible — not just steps to follow, but design choices you can evaluate and adapt.
Anatomy of an Agent Tool: Inputs, Outputs, and Side Effects
Every agentic system eventually reduces to a deceptively simple question: what can the agent actually do? The answer lives in its tools — and yet "tool" is one of the most overloaded words in the agentic vocabulary. A tool is sometimes described as "a function the agent can call," which is technically true but misses the structural properties that determine whether tools compose cleanly, fail safely, and behave predictably at runtime. This section dissects a tool down to its load-bearing parts so that you can reason about any tool you encounter or build, regardless of which framework wraps it.
What a Tool Actually Is at the Code Level
At its core, a tool is a named, typed function exposed to the agent runtime through a structured contract. That contract has three parts that work together:
- Input schema — a JSON-serializable description of every parameter the tool accepts, including names, types, constraints, and which parameters are required.
- Execution body — the code that runs when the agent calls the tool. It may be deterministic (same inputs always produce same outputs) or effectful (it touches the outside world: databases, APIs, filesystems).
- Output contract — the shape of the value returned to the agent runtime, which the model must be able to parse and reason about.
This three-part structure is not framework-specific. Whether you are looking at a LangChain Tool, an OpenAI function definition, an Anthropic tool use block, or a raw MCP server endpoint, all of them express this same contract — they just use different syntax to do it.
┌─────────────────────────────────────────────────────┐
│ TOOL CONTRACT │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
│ │ Input Schema │──▶│ Execution │──▶│ Output │ │
│ │ │ │ Body │ │Contract │ │
│ │ name: str │ │ │ │ │ │
│ │ type: float │ │ deterministic│ │ typed │ │
│ │ required: [] │ │ or │ │ object │ │
│ │ │ │ effectful │ │ │ │
│ └──────────────┘ └──────────────┘ └─────────┘ │
│ ▲ │ │
│ │ Agent Runtime │ │
│ └─────────────────────────────────-┘ │
└─────────────────────────────────────────────────────┘
The runtime sits between the model and your tool. It receives the model's request to call a tool, validates it against the input schema, invokes the execution body, and hands the structured output back to the model. This intermediary role is why the schema and output contract matter so much — they are the language the runtime and model share.
A Minimal Working Example
Let's make this concrete with a tool that looks up the current price of a product. This example uses a framework-agnostic pattern that maps cleanly onto most agent SDKs.
import json
from dataclasses import dataclass
from typing import Any
## --- Tool definition ---
## The schema is what the model sees: names, types, descriptions.
PRODUCT_PRICE_SCHEMA = {
"name": "get_product_price",
"description": (
"Look up the current listed price for a product by its SKU. "
"Returns the price in USD and whether the item is in stock. "
"Use this when the user asks about pricing or availability."
),
"parameters": {
"type": "object",
"properties": {
"sku": {
"type": "string",
"description": "The product's stock-keeping unit identifier, e.g. 'WIDGET-42'."
}
},
"required": ["sku"]
}
}
@dataclass
class PriceResult:
sku: str
price_usd: float
in_stock: bool
error: str | None = None
def get_product_price_handler(sku: str) -> dict[str, Any]:
"""
Execution body: reads from a product database.
This is a read-only tool — safe to retry.
"""
# Simulated database lookup (replace with real DB call)
catalog = {
"WIDGET-42": {"price_usd": 19.99, "in_stock": True},
"GADGET-7": {"price_usd": 149.00, "in_stock": False},
}
if sku not in catalog:
# Structured error: still a valid output shape
return PriceResult(sku=sku, price_usd=0.0, in_stock=False,
error=f"SKU '{sku}' not found in catalog").__dict__
entry = catalog[sku]
return PriceResult(
sku=sku,
price_usd=entry["price_usd"],
in_stock=entry["in_stock"]
).__dict__
## --- Wiring into an agent loop (simplified) ---
TOOL_REGISTRY = {
"get_product_price": get_product_price_handler
}
def dispatch_tool_call(tool_name: str, arguments: dict[str, Any]) -> str:
"""
The agent loop calls this when the model requests a tool.
Returns a JSON string the model can read on the next turn.
"""
handler = TOOL_REGISTRY.get(tool_name)
if handler is None:
return json.dumps({"error": f"Unknown tool: {tool_name}"})
result = handler(**arguments)
return json.dumps(result) # Structured JSON back to the model
Several design decisions in this snippet are worth unpacking. The tool name, description, and parameters schema are defined separately from the handler function. This separation matters because the schema is what gets sent to the model at the start of a conversation; the handler never runs until the model actually calls the tool. The dispatch_tool_call function is a stand-in for whatever routing logic your agent framework provides — in practice, LangChain, LlamaIndex, and similar libraries handle this dispatch for you, but the logical structure is identical.
Notice also that errors are returned as structured objects rather than raised exceptions. The model needs to read the error and decide what to do next; an unhandled Python exception just crashes the loop.
Side-Effect Classification: Why It Matters More Than You Think
Not all tools are equal in terms of what they can do to the world. The side-effect class of a tool determines how safely the runtime can retry it, whether it needs user confirmation, and what guardrails are required. Most production systems benefit from recognizing three categories:
📚 Read-only tools perform no writes. They query data, fetch content, or compute values from existing state. Because they leave the world unchanged, they are safe to retry automatically when a network hiccup or timeout occurs. The get_product_price tool above is read-only.
🔧 Write tools modify state — inserting a database record, sending an email, updating a configuration. Retrying a write naively can cause duplicate records or double-sent messages. Write tools require either idempotency (designing the operation so that calling it twice produces the same result as calling it once) or an explicit confirmation step before execution.
🔒 Destructive tools delete data, transfer funds, revoke permissions, or perform actions that are difficult or impossible to reverse. These require the strongest guardrails: explicit user confirmation, audit logging, and ideally a dry-run mode that previews the effect without committing it.
SIDE-EFFECT CLASSIFICATION
READ-ONLY WRITE DESTRUCTIVE
────────── ────────── ──────────
GET /products POST /orders DELETE /account
SELECT query INSERT record DROP TABLE
Compute hash Send email Revoke API keys
Safe to retry? Idempotency Explicit confirm
✅ Yes or confirm req. + audit log req.
⚠️ Design care 🔒 Highest guard
The reason the agent runtime needs to know the side-effect class is that autonomous agents frequently encounter ambiguous situations: the network timed out, the model issued the same call twice, or the user interrupted mid-task and restarted. A runtime that treats all tools as read-only will happily retry a payment tool until it succeeds — which may mean charging a customer three times. Conversely, a runtime that requires confirmation for every read will be so slow as to be unusable.
💡 Real-World Example: A common failure pattern is a tool that calls an external notification service to send a user alert. The developer treats it as a simple function call, doesn't mark it as a write tool, and the retry logic on timeout causes three notifications to be sent. The fix — marking the tool as a write tool and adding an idempotency key — is straightforward once the classification framework is in place. The cost of not having the classification is subtle until it suddenly isn't.
🎯 Key Principle: Side-effect classification is not just documentation — it is an operational contract that the runtime, the tool developer, and the system designer must agree on before deployment.
Why Tool Descriptions Are Load-Bearing
Here is a structural fact about how tool-calling models work: the model selects which tool to invoke based almost entirely on the natural-language description you provide in the schema. The parameter types and names matter for correct argument construction, but the description is what gets the model to reach for the right tool in the first place.
This means a vague or misleading description will cause wrong tool selection even when the underlying code is perfectly correct. Consider the difference between these two descriptions for the same tool:
❌ Wrong thinking: "description": "Get product information."
✅ Correct thinking: "description": "Look up the current listed price for a product by its SKU. Returns the price in USD and whether the item is in stock. Use this when the user asks about pricing or availability."
The first description is technically accurate but gives the model no signal about when to use the tool versus, say, a search_product_catalog tool or a get_product_reviews tool. The second description specifies what data it returns, the units (USD), and explicit conditions for use. That last clause — "Use this when the user asks about pricing or availability" — might look redundant to a human reader who sees the tool name, but it gives the model a direct hint that resolves ambiguity when multiple tools are plausible.
⚠️ Common Mistake: Writing descriptions for a human who can read the function name and infer intent, rather than for a model that must pattern-match a user's request against every available tool simultaneously. Treat the description as a disambiguation clause in a contract.
The same discipline applies to parameter descriptions. Each property in your schema's parameters object should explain not just what the parameter is but what format it expects and what happens if it is wrong:
## Weak parameter description
"sku": {
"type": "string",
"description": "The SKU."
}
## Strong parameter description
"sku": {
"type": "string",
"description": (
"The product's stock-keeping unit identifier. "
"Format: uppercase letters and digits separated by a hyphen, "
"e.g. 'WIDGET-42'. Do not include spaces or special characters."
)
}
The stronger version reduces the chance of the model inventing a plausible-but-incorrect format when it does not have an example in context.
Output Shape Discipline: Structured Objects Over Raw Strings
The output of a tool is fed back into the model's context on the next turn. If that output is a raw string — say, a paragraph of text — the model must parse it to extract the values it needs. Parsing text is something language models are generally capable of, but it introduces unnecessary variability. The model may extract the right value 95% of the time and a plausible-but-wrong value the other 5%, and that 5% is hard to catch because the downstream reasoning looks superficially correct.
Returning structured objects — dictionaries with explicit field names and typed values — eliminates the parsing step. The model can directly reference result.price_usd or result.in_stock without needing to interpret prose.
Here is a more elaborate example showing output shape discipline applied to a tool that chains into a downstream tool:
from dataclasses import dataclass, field
from typing import Any
import json
@dataclass
class OrderLookupResult:
order_id: str
customer_id: str
status: str # "pending", "shipped", "delivered", "cancelled"
line_items: list[dict[str, Any]] = field(default_factory=list)
estimated_delivery_date: str | None = None # ISO 8601 date string or None
error: str | None = None
def get_order_handler(order_id: str) -> dict[str, Any]:
"""
Returns a structured order object.
Downstream tools (e.g., initiate_return) can reference
fields directly without re-parsing.
"""
# Simulated order store
orders = {
"ORD-1001": {
"customer_id": "CUST-88",
"status": "shipped",
"line_items": [
{"sku": "WIDGET-42", "quantity": 2, "unit_price_usd": 19.99}
],
"estimated_delivery_date": "2026-04-15"
}
}
if order_id not in orders:
return OrderLookupResult(
order_id=order_id,
customer_id="",
status="not_found",
error=f"No order found with ID '{order_id}'"
).__dict__
data = orders[order_id]
return OrderLookupResult(
order_id=order_id,
customer_id=data["customer_id"],
status=data["status"],
line_items=data["line_items"],
estimated_delivery_date=data.get("estimated_delivery_date")
).__dict__
## When the model receives this output, it can reason:
## "status is 'shipped' and estimated_delivery_date is '2026-04-15',
## so I should tell the user their order is on its way."
## No prose parsing required.
Compare what the model receives with structured output versus raw string output:
STRUCTURED OUTPUT (what we want)
──────────────────────────────────────────────────────
{
"order_id": "ORD-1001",
"customer_id": "CUST-88",
"status": "shipped",
"line_items": [{"sku": "WIDGET-42", "quantity": 2, ...}],
"estimated_delivery_date": "2026-04-15",
"error": null
}
RAW STRING OUTPUT (avoid this)
──────────────────────────────────────────────────────
"Order ORD-1001 for customer CUST-88 has been shipped
and is expected to arrive around April 15th, 2026."
The raw string output is perfectly readable to a human. It is problematic for the model because if a downstream tool needs estimated_delivery_date as an ISO 8601 string (for example, to call a schedule_follow_up tool that takes a date parameter), the model must infer "2026-04-15" from "around April 15th, 2026" — introducing a potential parsing error in a place that is invisible in logs.
💡 Mental Model: Think of tool outputs as database rows, not paragraphs. A row has named columns with typed values. A paragraph requires interpretation. Agents chain tools the way queries join tables — structured data makes that chaining reliable.
🤔 Did you know? The pattern of returning structured error objects (with an error field rather than raising an exception) also enables the model to handle partial failures gracefully. If a tool returns {"error": "rate limit exceeded, retry after 30s"}, the model can include that information in its reasoning and decide whether to wait, try an alternative tool, or report back to the user — none of which is possible if the tool simply crashes the execution loop.
📋 Quick Reference Card: Tool Anatomy at a Glance
| Component | 🎯 What It Does | ⚠️ What Goes Wrong Without It |
|---|---|---|
| 🔧 Name | Unique identifier the runtime routes on | Collisions cause wrong handler dispatch |
| 📚 Description | Natural language the model uses for selection | Vague descriptions → wrong tool chosen |
| 📋 Input Schema | Typed parameter contract for validation | Untyped inputs → runtime errors or hallucinated args |
| ⚙️ Execution Body | The code that actually runs | Missing side-effect classification → unsafe retries |
| 🔒 Output Contract | Structured return shape | Raw strings → hallucinated parsing downstream |
Putting It Together: The Tool as a Contract
Thinking about a tool as a contract — rather than just a function — changes how you design it. A contract has obligations on both sides. Your tool obliges itself to accept the inputs described in the schema, execute within reasonable time bounds, and return an output that matches the declared shape. The runtime obliges itself to validate inputs before passing them, handle dispatch errors gracefully, and surface the output to the model without modification.
This framing also clarifies where debugging should start when things go wrong. If the model is calling a tool with malformed arguments, the problem is usually in the description or schema — the model's instructions were ambiguous. If the tool is being selected when it shouldn't be, the problem is in the description — disambiguation is insufficient. If the model is misreading the output, the problem is in the output contract — the shape is too loose. The three-part structure gives you three distinct places to look.
⚠️ Common Mistake: Treating the tool's Python (or TypeScript, or whatever) type signature as the schema. The schema that the model sees is the JSON definition you explicitly declare. If you update the handler's parameters without updating the schema, the model is working from outdated information and will generate calls that fail validation. These two representations must be kept in sync — many frameworks provide decorators or code-generation utilities to help enforce this, and it is worth using them.
The simplifications in this section are worth naming: real production tools often have authentication context, rate limiting, timeout handling, and observability hooks wired in. Those concerns layer on top of the core contract described here — they don't replace it. The input schema, execution body, and output contract are the foundation; everything else is built on them. The next section will show how protocols like MCP and A2A standardize this contract across systems, which is where these individual tool definitions connect into a larger composable architecture.
Protocols as Shared Contracts: What Standardization Buys You
Every time an agent calls a tool, something has to agree on how that call is structured. In small systems, that agreement lives informally in the code — a shared function signature, a documented JSON shape, a convention that two developers happened to follow. The informal agreement works fine until something changes independently: the agent framework upgrades its tool-calling format, a new model produces slightly different structured outputs, or a tool author refactors their API. Suddenly the two sides of the call no longer speak the same language, and the integration silently breaks. A protocol is the explicit, durable answer to that fragility. It replaces informal conventions with a written contract that both sides can implement independently and verify against.
This section builds the conceptual foundation for understanding what protocols do, what they must specify to be useful, and why the distinction between different kinds of coordination protocols matters. The specific protocols you'll encounter most in agentic development — MCP and A2A — are covered in detail in the child lesson; the goal here is to give you the mental model that makes those specifics intelligible.
The Hidden Cost of Bespoke Integrations
Imagine a small agentic system where one agent needs to call three tools: a web search utility, a database query function, and a code execution sandbox. The first developer wires the search tool using a dict with keys query and max_results. The database tool expects a sql string plus a connection_id. The sandbox wants a code field and a language enum. Each integration is custom — reasonable enough when each tool was added.
Now the team upgrades their agent host to a newer framework version. The new framework passes tool inputs as a flat JSON object under a parameters key rather than directly in the message body. Every tool integration breaks simultaneously because each one made different assumptions about message structure. The team fixes all three — and then discovers that the database tool's author changed their API, so the connection_id field is now session_token. Another round of fixes.
This is the bespoke integration problem: each custom integration creates a hidden dependency between three independent components — the agent host, the model or planner producing tool calls, and the tool implementation. When any one of those changes, every integration that touches it becomes a potential breakage point. The failure mode compounds as the number of tools grows. With ten tools and three independent change axes, the number of potential mismatches is not additive — it multiplies.
💡 Mental Model: Think of bespoke integrations like electrical adapters improvised with tape and bare wire. Two devices, one connection, works today. Add a third device, change either device, or revisit the system a year later, and the improvisation unravels. A protocol is the standardized plug-and-socket specification that lets any compliant device connect to any compliant outlet.
⚠️ Common Mistake: Teams often treat the bespoke integration problem as a tooling problem — "we just need better documentation." Documentation helps, but it cannot enforce compatibility. A protocol is enforceable in code: a tool that violates the contract fails validation at the boundary, loudly and early, rather than producing silent incorrect behavior deep inside the agent's reasoning loop.
What a Protocol Actually Specifies
A useful integration protocol has to address four distinct concerns. Each one solves a different failure mode. Together, they cover most of what breaks in real agentic systems.
┌─────────────────────────────────────────────────────────┐
│ Integration Protocol │
├──────────────────┬───────────────────┬──────────────────┤
│ Message │ Capability │ Error │
│ Envelope │ Negotiation │ Codes │
│ Format │ │ │
├──────────────────┴───────────────────┴──────────────────┤
│ Transport Bindings │
└─────────────────────────────────────────────────────────┘
Message envelope format defines the structure that wraps every call and response. It specifies what fields are mandatory, what their types are, and where the actual payload lives within the envelope. A well-designed envelope separates protocol-level metadata (message ID, timestamp, version, sender identity) from application-level content (the tool's inputs and outputs). This separation means that protocol infrastructure — logging, routing, tracing — can operate on the envelope without parsing the payload.
Capability negotiation defines how a caller discovers what a callee can do, and in what version. This is the handshake that happens before any functional call. Without it, an agent has to assume the tool supports exactly the interface it was compiled against — which fails the moment the tool adds or removes a capability. With capability negotiation, the agent can ask "what do you support?" and adjust its behavior accordingly. This is what allows a tool registry to serve agents built against different protocol versions without forking the tool implementation.
Error codes define the shared vocabulary for things going wrong. A protocol without a standardized error taxonomy forces every consumer to parse free-text error messages — a notoriously fragile pattern. Standardized error codes let the agent host apply consistent retry logic, distinguish recoverable from non-recoverable failures, and surface meaningful diagnostics to the operator. An error code like TOOL_INPUT_VALIDATION_FAILED is actionable; an uncaught exception with a stack trace is not.
Transport bindings define how messages move between the caller and the callee. A protocol that specifies message format but not transport leaves implementers to invent their own HTTP schemas, gRPC service definitions, or stdio framing — producing exactly the fragmentation the protocol was meant to prevent. Transport bindings specify the concrete serialization (typically JSON or a binary format), the HTTP methods or WebSocket patterns used, and any authentication or session-management requirements.
🎯 Key Principle: A protocol that addresses fewer than these four concerns is a partial standard. Partial standards reduce some integration friction but preserve the rest. They are better than nothing, but teams should be explicit about which gaps they are filling with local convention — and treat those conventions as technical debt.
Here is a minimal illustration of what an envelope-aware tool call looks like in practice. This is not tied to any specific framework — it demonstrates the structural separation the envelope concept introduces:
import json
import uuid
from datetime import datetime, timezone
from typing import Any
def build_tool_call_envelope(
tool_name: str,
parameters: dict[str, Any],
protocol_version: str = "1.0",
caller_id: str = "agent-main",
) -> dict[str, Any]:
"""
Constructs a protocol-compliant tool call envelope.
The envelope separates protocol metadata (id, version, timestamp, caller)
from the application payload (tool_name, parameters). Downstream
infrastructure can log or route on metadata without touching the payload.
"""
return {
# --- Protocol-level metadata (infrastructure can read this) ---
"message_id": str(uuid.uuid4()),
"protocol_version": protocol_version,
"timestamp": datetime.now(timezone.utc).isoformat(),
"caller_id": caller_id,
"message_type": "tool_call",
# --- Application-level payload (tool implementation reads this) ---
"payload": {
"tool_name": tool_name,
"parameters": parameters,
},
}
## Example: calling a web-search tool through the envelope
envelope = build_tool_call_envelope(
tool_name="web_search",
parameters={"query": "agent integration protocols", "max_results": 5},
)
print(json.dumps(envelope, indent=2))
## Output: a JSON object with metadata at the top level and the
## tool-specific parameters nested under "payload".
The point of this example is not the specific field names — real protocols like MCP define those precisely. The point is the architectural separation: anything that deals with routing, logging, or versioning operates on the outer envelope; anything that deals with tool behavior operates on the inner payload. Mixing these two concerns into a flat structure (as bespoke integrations typically do) makes it impossible to evolve either independently.
Agent-to-Tool vs. Agent-to-Agent: Two Distinct Coordination Problems
Not all protocols are solving the same problem. There is an important distinction between two fundamentally different kinds of coordination that protocols address, and conflating them leads to systems that use the wrong tool for the job.
An agent-to-tool protocol governs the interaction between a single agent and a capability it invokes. The relationship is asymmetric: the agent is the caller, the tool is the callee. The tool does not have goals of its own; it executes a function and returns a result. The coordination problem is primarily about interface compatibility — ensuring the agent can discover what the tool accepts, invoke it correctly, and interpret the response. The tool has no memory of prior calls (unless explicitly designed to), no ability to push updates to the agent, and no independent initiative.
An agent-to-agent protocol governs the interaction between two agents that have their own goals, reasoning loops, and potentially their own sub-tools. The relationship can be symmetric (peers) or asymmetric (orchestrator and sub-agent), but in either case both sides have agency. The coordination problem is fundamentally different: it involves task delegation, status reporting, partial result streaming, failure escalation, and potentially negotiation over how a task should be decomposed. An agent receiving a delegated sub-task may push back, ask for clarification, or return a partial result while still working.
Agent-to-Tool (asymmetric invocation)
┌─────────────┐ tool_call({params}) ┌─────────────┐
│ Agent │ ─────────────────────────► │ Tool │
│ (has goals)│ ◄───────────────────────── │ (stateless)│
└─────────────┘ result({output}) └─────────────┘
One caller, one callee. Tool has no independent initiative.
Agent-to-Agent (peer delegation)
┌─────────────────┐ delegate(task) ┌─────────────────┐
│ Orchestrator │ ──────────────────► │ Sub-Agent │
│ Agent │ ◄────────────────── │ (has goals, │
│ (has goals) │ status / result │ sub-tools) │
└─────────────────┘ └─────────────────┘
Both sides have reasoning loops. Sub-agent can push updates,
ask for clarification, or decompose the task further.
⚠️ Common Mistake: Treating agent-to-agent coordination as just "a tool call that returns later." This fails because sub-agents need to report intermediate state, handle ambiguous instructions, and sometimes reject tasks they cannot safely complete. A tool-call protocol has no vocabulary for these interactions. Trying to shoehorn multi-agent coordination into a tool-calling protocol produces systems that either lose intermediate state or silently swallow sub-agent failures.
The practical implication is that a well-architected agentic system will use both kinds of protocols for different purposes: a tool protocol for leaf-level capability invocation (searching, querying, computing), and an agent-delegation protocol for distributing reasoning work across multiple agents. The child lesson covers the concrete protocols that have emerged for each of these purposes; the important thing to understand here is that they are solving genuinely different problems, not just different scales of the same problem.
Versioning and Backwards Compatibility as First-Class Concerns
The most overlooked aspect of protocol design — and the one that causes the most production pain — is versioning. A protocol that does not specify how versions are negotiated and how backwards compatibility is maintained is a protocol that will fragment its ecosystem the moment anyone needs to make a breaking change.
Consider a tool registry serving multiple agents deployed at different times. Agent A was built when the protocol was at version 1.2. Agent B was built against version 1.5, which added a new context_window field to the tool call envelope. The tool itself has been updated to support both versions. Without a versioning mechanism, the registry has three bad options: maintain two separate deployments of every tool, require all agents to upgrade simultaneously, or quietly break Agent A by including fields it does not expect.
Protocol versioning solves this by making the version number part of the capability negotiation handshake. When Agent A connects, it advertises protocol_version: "1.2". The registry responds with the capabilities available at that version. When Agent B connects with protocol_version: "1.5", it gets a superset. Neither agent sees fields it does not understand; neither tool implementation needs to branch on which agent is calling.
The standard vocabulary for characterizing version changes comes from semantic versioning conventions adapted to protocol design:
- 🔧 Non-breaking additions: new optional fields, new optional capabilities. Old clients ignore what they do not understand.
- ⚠️ Non-breaking removals (with deprecation): capabilities marked deprecated remain functional for a migration window before removal.
- ❌ Breaking changes: changing the type of an existing field, removing a required field, changing error code semantics. These require a major version increment and explicit negotiation.
Here is a compact example of how capability negotiation with version awareness might be implemented on the tool side:
from dataclasses import dataclass, field
from typing import Literal
@dataclass
class CapabilityManifest:
"""
Returned by a tool during the negotiation handshake.
The tool declares which protocol versions it supports and
which optional features are available at each level.
"""
tool_name: str
supported_protocol_versions: list[str]
# Features gated by protocol version
features_by_version: dict[str, list[str]] = field(default_factory=dict)
def negotiate_capabilities(
requested_version: str,
manifest: CapabilityManifest,
) -> dict:
"""
Returns the negotiated capability set for a given protocol version.
If the requested version is unsupported, raises ValueError with guidance.
Note: this simplified example uses exact version matching.
Production implementations typically use semver range logic.
"""
if requested_version not in manifest.supported_protocol_versions:
supported = ", ".join(manifest.supported_protocol_versions)
raise ValueError(
f"Protocol version '{requested_version}' not supported. "
f"Supported versions: {supported}"
)
available_features = manifest.features_by_version.get(requested_version, [])
return {
"negotiated_version": requested_version,
"tool_name": manifest.tool_name,
"available_features": available_features,
"status": "accepted",
}
## A web-search tool that supports two protocol versions:
## v1.2 offers basic search; v1.5 adds result streaming.
search_tool_manifest = CapabilityManifest(
tool_name="web_search",
supported_protocol_versions=["1.2", "1.5"],
features_by_version={
"1.2": ["basic_search", "max_results"],
"1.5": ["basic_search", "max_results", "streaming_results", "context_window"],
},
)
## Agent A, built against v1.2, negotiates successfully:
result_a = negotiate_capabilities("1.2", search_tool_manifest)
## result_a["available_features"] == ["basic_search", "max_results"]
## Agent B, built against v1.5, gets the full feature set:
result_b = negotiate_capabilities("1.5", search_tool_manifest)
## result_b["available_features"] includes "streaming_results"
## An agent requesting an unsupported version gets an explicit error:
try:
negotiate_capabilities("2.0", search_tool_manifest)
except ValueError as e:
print(e) # Clear guidance on what versions are available
This example is deliberately simplified — production implementations typically include semver range logic, graceful feature degradation paths, and deprecation notices embedded in the manifest. But the structural point holds: both agents get a clear, typed answer about what is available to them, and the tool implementation does not need to guess which client it is talking to.
🤔 Did you know? The most durable protocols in software history — HTTP, SMTP, DNS — share one design trait: they were designed to be extensible without breaking. HTTP's header system lets new capabilities be added without requiring old clients to understand them. The lesson for agent protocols is the same: design for the fields you don't yet know you'll need, by specifying what receivers should do with unrecognized fields (typically: ignore them).
Why Standardization Compounds Over Time
The individual benefits of a good protocol — predictable call structure, capability discovery, standardized errors, version safety — are useful from day one. But the compounding benefit becomes visible over time and at scale.
When every tool in an ecosystem speaks the same protocol, the infrastructure layer can be built once and reused everywhere. A single logging framework can capture structured traces for every tool invocation without tool-specific adapters. A single testing harness can validate any tool's compliance with the contract. A single dashboard can surface error rates and latency across all tools in a registry, because they all report failures using the same vocabulary.
❌ Wrong thinking: "Protocols are overhead for simple integrations. We'll standardize later when we have more tools."
✅ Correct thinking: "Protocols are cheap to adopt at the start and expensive to retrofit. The break-even point comes earlier than it feels like it will."
The retrofitting cost is real and consistently underestimated. It is tempting to skip X (protocol adoption) when you have two tools and one agent. The cost shows up when you have twelve tools, three agent versions, and a production incident at 2am where the difference between TOOL_TIMEOUT and TOOL_INPUT_INVALID would tell you exactly where to look — but you are parsing free-text exceptions instead.
📋 Quick Reference Card: What Good Protocols Provide
| Concern | 🎯 What It Solves | ⚠️ What Breaks Without It |
|---|---|---|
| 🔧 Message envelope format | Consistent call/response structure | Integration breaks on framework upgrade |
| 🔍 Capability negotiation | Version-safe feature discovery | Silent incompatibility between agent and tool versions |
| ❗ Error codes | Actionable failure vocabulary | Unparseable exceptions, wrong retry logic |
| 🌐 Transport bindings | Concrete wire format and auth | Fragmented HTTP schemas across tools |
| 📦 Versioning | Backwards-compatible evolution | Fork-or-break on every protocol change |
The sections that follow build on this foundation: you will see how tools acquire and compose capabilities at runtime (Section 4), and the child lesson will walk through how MCP and A2A instantiate these four protocol concerns in concrete, deployable form. The conceptual structure here — envelope, negotiation, errors, transport, versioning — is the lens through which those specifics become legible rather than a list of implementation details to memorize.
Skill Loading and Capability Composition in Practice
An agent's power comes not from having every possible tool available at once, but from having the right tools available at the right moment. This distinction might sound subtle, but it has concrete consequences: loading every tool at startup inflates the context window your language model must reason over, exposes capabilities that should only exist in certain task phases, and broadens the attack surface for prompt injection. The alternative — loading tools progressively based on what a task actually requires — is one of the more practical architectural decisions you can make when building production agentic systems. This section walks through that pattern, explains how to scope capabilities by role and trust level, and clarifies when to compose tools internally versus delegate to a sub-agent.
Static vs. Dynamic Tool Loading
Static tool loading means compiling your agent's full toolset at startup time. Every tool is registered before the agent receives its first task, and it remains registered for the agent's entire lifetime. This is the default pattern in most introductory tutorials, and it works reasonably well for narrow-scope agents with five or six tools.
The problems emerge as scope grows. Consider an agent that handles customer support, order management, and internal knowledge retrieval. A static approach registers all three capability groups upfront — perhaps thirty or forty tools total. The agent's system prompt and tool schema descriptions consume a meaningful portion of the context window before a single word of user input arrives. More importantly, the model can see every tool on every turn, which means it can attempt to call a tool that is inappropriate for the current task phase, and you're left enforcing correctness through prompt engineering alone.
Dynamic (progressive) tool loading inverts this: the agent starts with a minimal bootstrap toolset — typically just enough to interpret a task and determine what capabilities it needs — and then fetches and registers additional tools as the task context becomes clear. When a task phase ends, those tools are unregistered. The result is a context window that reflects what the agent should be doing right now, not a complete catalog of everything it could ever do.
STATIC LOADING DYNAMIC LOADING
───────────────── ─────────────────────────────
Agent startup Agent startup
│ │
▼ ▼
Register ALL tools ──────────► Register bootstrap tools only
│ │
▼ ▼
Task arrives Task arrives
│ │
▼ ▼
Model sees 40 schemas Model sees 3 schemas
│ │
▼ ▼
Task executes Classify task context
│ │
▼ ▼
All tools remain active Load task-specific tools
│
▼
Task executes
│
▼
Unregister task tools
💡 Real-World Example: An agent that handles both read-only analytics queries and write operations on a database benefits enormously from progressive loading. During an analytics session, write tools simply aren't registered — so even a compromised prompt cannot invoke them. The protection is structural, not conversational.
⚠️ Common Mistake: Treating dynamic loading as a performance optimization only. The security and correctness benefits — reduced attack surface, phase-appropriate toolsets — often matter more than the context-window savings.
A Minimal Progressive Loading Pattern
The cleanest implementation is a tool registry class that the agent runtime queries at each decision step. The registry maintains the current active toolset and exposes three core operations: register(), unregister(), and get_active_tools(). Here's a working implementation that a real agent loop can call directly:
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Callable, Any
@dataclass
class ToolDefinition:
"""Represents a single callable tool with its schema and handler."""
name: str
description: str
parameters: dict[str, Any] # JSON Schema object
handler: Callable[..., Any]
tags: list[str] = field(default_factory=list) # e.g. ["write", "database"]
class ToolRegistry:
"""Runtime registry for progressive tool loading.
The agent loop calls get_active_tools() before each model invocation
to build the tools array. It calls register/unregister as task
context changes.
"""
def __init__(self) -> None:
self._tools: dict[str, ToolDefinition] = {}
def register(self, tool: ToolDefinition) -> None:
"""Add a tool to the active set. Idempotent on name."""
self._tools[tool.name] = tool
def unregister(self, name: str) -> None:
"""Remove a tool from the active set. Silent if not found."""
self._tools.pop(name, None)
def unregister_by_tag(self, tag: str) -> None:
"""Bulk-remove all tools carrying a specific tag.
Useful for clearing an entire capability group at phase boundaries
(e.g., unregister_by_tag('write') when entering a read-only phase).
"""
to_remove = [name for name, t in self._tools.items() if tag in t.tags]
for name in to_remove:
self._tools.pop(name)
def get_active_tools(self) -> list[ToolDefinition]:
"""Return the current tool list. Called before every model invocation."""
return list(self._tools.values())
def get_schemas(self) -> list[dict[str, Any]]:
"""Return JSON-Schema-compatible dicts suitable for passing to a model API."""
return [
{
"name": t.name,
"description": t.description,
"parameters": t.parameters,
}
for t in self._tools.values()
]
This code defines two things. ToolDefinition is a lightweight dataclass that bundles a tool's name, natural-language description, parameter schema, and the actual callable — plus an optional list of tags used for bulk operations. ToolRegistry is the runtime container: it's a thin wrapper over a dictionary, deliberately kept simple so the agent loop can call it on every turn without overhead.
The unregister_by_tag method deserves attention. Rather than individually tracking which tools belong to which phase, you annotate tools at registration time with semantic tags like "write" or "external-api". At a phase boundary — say, when the agent finishes a write-heavy onboarding step and enters a read-only verification phase — a single registry.unregister_by_tag("write") call clears the entire group. This is a useful heuristic for most cases; in practice you'd extend it with role-based filters (covered next).
🎯 Key Principle: The agent runtime should call get_schemas() (or get_active_tools()) immediately before constructing the model request, not once at startup. This ensures the model always sees the current capability snapshot.
Capability Scoping
Knowing when to load tools only solves half the problem. You also need to control who can use which tools — and doing that per-tool rather than per-registry leads to duplicated logic, inconsistent enforcement, and bugs that only surface under unusual role combinations.
Capability scoping is the practice of restricting the active toolset based on attributes external to the tool itself: the current user's role, the task phase, or a trust level derived from how the agent was invoked. The key design principle is that this filtering lives in the registry layer, not scattered across individual tool handlers.
Here's how that looks in practice:
from enum import Enum, auto
class TrustLevel(Enum):
UNTRUSTED = auto() # e.g., external user input via public API
USER = auto() # authenticated end user
OPERATOR = auto() # internal service calling the agent
SYSTEM = auto() # orchestrating agent in a multi-agent pipeline
@dataclass
class ScopedToolDefinition(ToolDefinition):
"""Extends ToolDefinition with access constraints."""
min_trust_level: TrustLevel = TrustLevel.USER
allowed_roles: list[str] = field(default_factory=list) # empty = all roles
class ScopedToolRegistry(ToolRegistry):
"""Registry that filters tools by the current execution context."""
def __init__(self) -> None:
super().__init__()
# All stored tools are ScopedToolDefinition instances
self._tools: dict[str, ScopedToolDefinition] = {}
def register(self, tool: ScopedToolDefinition) -> None: # type: ignore[override]
self._tools[tool.name] = tool
def get_active_tools(
self,
trust_level: TrustLevel = TrustLevel.USER,
role: str | None = None,
) -> list[ScopedToolDefinition]:
"""Return only tools the current context is permitted to use."""
result = []
for tool in self._tools.values():
# Check trust level
if tool.min_trust_level.value > trust_level.value:
continue
# Check role allowlist (empty list means no restriction)
if tool.allowed_roles and role not in tool.allowed_roles:
continue
result.append(tool)
return result
def get_schemas(
self,
trust_level: TrustLevel = TrustLevel.USER,
role: str | None = None,
) -> list[dict]:
"""Return filtered schemas for the model API call."""
return [
{"name": t.name, "description": t.description, "parameters": t.parameters}
for t in self.get_active_tools(trust_level=trust_level, role=role)
]
By moving the trust and role check into get_active_tools(), every caller — the agent loop, a test harness, a monitoring wrapper — gets consistent filtering for free. Adding a new tool doesn't require the tool author to remember to check user.role; the registry handles it centrally.
⚠️ Common Mistake: Enforcing role checks inside tool handler functions rather than at the registry level. The handler-level check feels more explicit, but it scatters policy across dozens of files, and the model can still see the tool's schema even when the user can't call it — wasting context tokens and potentially revealing capability information.
POOR PATTERN BETTER PATTERN
───────────────────────── ────────────────────────────
Registry returns ALL tools Registry filters at get_active_tools()
│ │
▼ ▼
Model sees admin tools Model sees only permitted tools
│ │
▼ ▼
Model calls admin_tool() Model cannot call what it can't see
│
▼
Handler checks user.role ← enforcement too late; schema already exposed
💡 Mental Model: Think of the registry as a bouncer, not a guard inside the venue. Checking credentials at the door is more efficient and safer than waiting until someone reaches the bar.
Composition vs. Delegation
Once you have a registry of scoped, progressively loaded tools, a natural question arises: when a task is complex enough to require multiple capabilities, should you build one tool that calls another internally, or should you spawn a separate agent to handle part of the work?
These are two distinct architectural patterns with different trade-offs:
Tool composition means one tool's handler internally invokes another tool (or calls a sub-function that does equivalent work). From the agent's perspective, it calls one tool and gets one result. The entire operation is a black box.
Delegation means the agent explicitly spawns or routes to a sub-agent, passing it a task description. The sub-agent has its own context, tool registry, and decision loop. The parent agent receives the sub-agent's final output.
COMPOSITION DELEGATION
────────────────────────── ───────────────────────────
Agent Agent
│ │
▼ ▼
call: enrich_contact() call: spawn_enrichment_agent(task)
│ │
▼ (inside the tool) ▼
enrich_contact handler Sub-agent starts
├── calls lookup_email() ├── picks tools from its own registry
└── calls verify_domain() ├── runs its own decision loop
│ └── returns final result
▼ │
single result returned ◄─────────────────┘
to agent
When Composition Makes Sense
Composition is preferable when the sub-operations are tightly coupled, always executed together, and don't require independent error handling or observability. If enrich_contact always calls lookup_email then verify_domain — and you never need to observe or retry those steps independently — composing them into a single tool keeps the agent's decision space simpler and reduces the number of turns in its loop.
The trade-off is opacity. If the composed tool fails partway through, the agent receives an error from enrich_contact without knowing whether the lookup succeeded but the verification failed, or whether the lookup itself never ran. Your error messages have to carry that detail, or you lose diagnostic resolution.
When Delegation Makes Sense
Delegation is preferable when the sub-task is genuinely independent — it has its own context requirements, might fail and retry on its own, or needs different tool permissions than the parent agent holds. A research agent that delegates to a web-search sub-agent benefits from isolation: the sub-agent can be given narrow read-only tools, and its failures don't pollute the parent agent's context.
Delegation also shines for observability. Because the sub-agent's decisions are explicit in its own trace, you can inspect exactly what it did and why. With composition, you'd need to instrument the tool's internal calls manually.
The trade-off for delegation is failure complexity. When a sub-agent fails, the parent agent has to decide whether to retry, fall back, or escalate — and the handoff protocol for that decision has to be designed explicitly. In composed tools, failure propagation is synchronous and straightforward (an exception bubbles up). In delegation, failure can be asynchronous, partial, or ambiguous.
📋 Quick Reference Card:
| 🔧 Composition | 🤖 Delegation | |
|---|---|---|
| 📊 Observability | Low — internal calls are opaque | High — sub-agent has its own trace |
| 🔒 Isolation | Same context and permissions | Separate context and tool scope |
| ⚡ Latency | Lower — single call overhead | Higher — agent startup + loop cost |
| 🔥 Failure handling | Synchronous, simple propagation | Asynchronous, requires protocol |
| 🎯 Best for | Tightly coupled, always-together ops | Independent, retryable sub-tasks |
🤔 Did you know? Composition and delegation map roughly to function calls versus process spawning in operating systems. A function call shares memory and stack; a process gets its own address space. The isolation trade-off is structurally the same. This analogy holds for the most common cases but breaks down at the edges — for instance, sub-agents may share memory stores or tool registries in ways that don't map cleanly to OS process isolation.
⚠️ Common Mistake: Defaulting to delegation for everything because it sounds more "agentic." Delegation adds real latency (another agent startup, another decision loop, another round-trip to the model). For operations that are three lines of Python, composition is almost always the right call. Reserve delegation for work that genuinely needs an independent reasoning loop.
Putting It Together: A Task-Phased Example
Consider a document processing agent with four phases: ingestion, extraction, validation, and storage. Here's how progressive loading, capability scoping, and the composition/delegation choice interact:
## Illustrative agent loop showing phase-based tool management.
## Simplified for clarity — production code would add error handling,
## retry logic, and async support.
def run_document_pipeline(
document: dict,
user_role: str,
trust_level: TrustLevel,
registry: ScopedToolRegistry,
) -> dict:
"""
Phases:
1. Ingest — read-only parsing tools
2. Extract — NLP/ML tools
3. Validate — rules-engine tool (composed with field checker)
4. Store — write tools, operator-trust required
"""
# ── Phase 1: Ingestion ──────────────────────────────────────────
ingest_tools = load_ingest_tools() # returns list[ScopedToolDefinition]
for t in ingest_tools:
registry.register(t)
raw = call_agent_loop(
task="Parse and normalize the document.",
document=document,
schemas=registry.get_schemas(trust_level=trust_level, role=user_role),
)
registry.unregister_by_tag("ingest") # clean up before next phase
# ── Phase 2: Extraction ─────────────────────────────────────────
extract_tools = load_extraction_tools()
for t in extract_tools:
registry.register(t)
extracted = call_agent_loop(
task="Extract entities and relationships.",
document=raw,
schemas=registry.get_schemas(trust_level=trust_level, role=user_role),
)
registry.unregister_by_tag("extract")
# ── Phase 3: Validation (composition example) ───────────────────
# validate_document internally calls check_required_fields().
# The agent sees one tool, not two. Acceptable here because
# the two operations always run together and we don't need
# independent retry on check_required_fields.
validation_tool = build_composed_validation_tool() # composes field checker
registry.register(validation_tool)
validated = call_agent_loop(
task="Validate the extracted data against the schema.",
document=extracted,
schemas=registry.get_schemas(trust_level=trust_level, role=user_role),
)
registry.unregister("validate_document")
# ── Phase 4: Storage (trust-gated) ──────────────────────────────
# Write tools require OPERATOR trust — they won't appear in
# get_schemas() for USER-level calls, so the model can't invoke them.
store_tools = load_storage_tools() # tagged "write", min_trust=OPERATOR
for t in store_tools:
registry.register(t)
result = call_agent_loop(
task="Persist the validated document.",
document=validated,
schemas=registry.get_schemas(trust_level=trust_level, role=user_role),
)
registry.unregister_by_tag("write")
return result
This example is simplified — call_agent_loop, load_ingest_tools, and similar functions stand in for real implementations that would include model API calls, error handling, and streaming. What it demonstrates is the structural rhythm: register at phase entry, call get_schemas() with the current context, unregister at phase exit. The trust gate in Phase 4 means that if this function is called with TrustLevel.USER, the storage tools are simply invisible to the model — no prompt engineering required to keep them safe.
Practical Guidance on Sizing Your Registry
Progressive loading, scoping, and the composition/delegation choice are tools, not mandates. A few heuristics help calibrate how much machinery to build:
🧠 If your agent has fewer than ten tools total, static loading is almost certainly fine. The complexity of a dynamic registry only pays off when the alternative — a large static toolset — is measurably harming context quality or creating security concerns you can't address otherwise.
📚 If your agent crosses a trust boundary — for instance, it accepts input from external users and also has access to internal write APIs — scoped capability loading stops being optional. The structural guarantee that untrusted contexts cannot even see privileged tools is much more robust than prompt-level restrictions.
🔧 If a composed tool's error messages become hard to interpret, that's a signal the composition has grown too opaque and the operations should either be exposed as separate tools or delegated. Error legibility is a proxy for whether your composition boundary is drawn in the right place.
🎯 Delegation should be justified by independence, not by complexity. A complex task that must be executed sequentially by a single reasoning thread benefits more from a well-structured tool chain than from spawning a sub-agent that replicates the same sequence.
These heuristics cover the most common cases. Edge cases — agents that serve radically different user tiers, pipelines that mix human-in-the-loop steps with fully automated ones — may require more nuanced registry designs than what's shown here.
The patterns in this section give you the vocabulary and the mechanism. The next section examines what happens when these patterns are applied incorrectly — the concrete mistakes developers make when defining and registering tools, and how to recognize and fix them before they become production incidents.
Common Mistakes When Defining and Registering Tools
Building a tool-enabled agent feels straightforward until the agent starts doing the wrong thing — calling a tool with the wrong parameters, treating a database error as a successful query result, or silently consuming half the context window with a single API response. These failures share a common origin: the tool definition or registration was ambiguous, incomplete, or mismatched to the environment, and the model filled the gaps with plausible-but-wrong behavior. Unlike a traditional function call where a mismatch produces a runtime exception you can trace, a tool-enabled agent can continue running for several steps before the error surfaces — by which point the agent's state may be corrupted beyond easy recovery.
This section catalogs the five most consequential mistakes practitioners make when defining and registering tools, each illustrated with broken patterns and their corrected counterparts. The goal is not just a checklist but an understanding of why each mistake misleads the model, so you can recognize novel variants as they appear in your own systems.
Mistake 1: Overloaded Tools with Mode Parameters ⚠️
The most tempting shortcut when building tools is consolidation. You have three related operations — say, reading a record, updating it, and deleting it — and it feels efficient to expose a single manage_record tool with a mode parameter that accepts "read", "update", or "delete". The logic is tidy from an engineering standpoint. The problem is that this design pushes branching logic into the model's reasoning layer.
When the model receives a tool schema, it has to predict which tool to call and what parameters to supply — essentially a classification and a slot-filling task. An overloaded tool forces the model to also predict which branch of behavior is needed before it can reason about the downstream parameters, because mode: "update" requires a payload field that mode: "read" does not. The result is a schema where parameter relevance depends on another parameter's value, and most schema formats (JSON Schema, OpenAPI) express this poorly. The model sees all parameters as nominally valid and may supply irrelevant ones, omit required ones, or choose the wrong mode entirely when the task description is slightly ambiguous.
❌ Wrong thinking: "One tool with a mode parameter is cleaner and reduces the number of tools the model has to choose from."
✅ Correct thinking: "Three narrow tools with unambiguous names give the model three clear, independently-described options with no internal branching."
## ❌ Broken pattern: overloaded tool with mode parameter
overloaded_tool = {
"name": "manage_record",
"description": "Manage a database record.",
"parameters": {
"type": "object",
"properties": {
"mode": {
"type": "string",
"enum": ["read", "update", "delete"]
# No description of what changes between modes
},
"record_id": {"type": "string"},
"payload": {
"type": "object",
# Required for "update", irrelevant for "read"/"delete"
# but the schema doesn't say so clearly
}
},
"required": ["mode", "record_id"]
}
}
## ✅ Corrected pattern: three narrow, single-purpose tools
read_record_tool = {
"name": "read_record",
"description": "Retrieve a single record by its ID. Returns the full record object.",
"parameters": {
"type": "object",
"properties": {
"record_id": {
"type": "string",
"description": "The unique identifier of the record to retrieve."
}
},
"required": ["record_id"]
}
}
update_record_tool = {
"name": "update_record",
"description": "Update fields on an existing record. Only the fields present in payload are modified.",
"parameters": {
"type": "object",
"properties": {
"record_id": {
"type": "string",
"description": "The unique identifier of the record to update."
},
"payload": {
"type": "object",
"description": "Key-value pairs of fields to update. Unknown keys are ignored."
}
},
"required": ["record_id", "payload"]
}
}
The corrected pattern has three times as many tool definitions, but each one is unambiguously interpretable in isolation. The model never has to reason about which parameters are conditionally relevant — every parameter in each schema is always relevant.
🎯 Key Principle: A tool's description and parameter list should be independently interpretable without reading any other tool's definition. If understanding one parameter requires knowing the value of another, the tool is a candidate for splitting.
Mistake 2: Missing or Misleading Parameter Descriptions ⚠️
Parameter names feel self-explanatory to the developer who wrote them. The model, however, encounters the schema without any of the surrounding context the developer had in mind. A parameter named query means something very different in a full-text search tool versus a SQL execution tool versus a vector similarity search tool — and in the absence of a description, the model infers meaning from the parameter name alone, which is often wrong in ways that are subtle and hard to debug.
The problem compounds when parameter names are reused across tools. If your agent has both a search_documents tool and an execute_sql tool, and both have an undescribed query parameter, the model may apply the conventions of one to the other — for instance, passing a natural-language string to the SQL executor because the task was phrased in natural language.
## ❌ Broken pattern: parameter with no description
search_tool_broken = {
"name": "search_documents",
"description": "Search for documents.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}, # What kind of query? Keyword? Semantic?
"limit": {"type": "integer"}, # Limit of what? Max results? Chars?
"filter": {"type": "object"} # What fields can be filtered? What format?
},
"required": ["query"]
}
}
## ✅ Corrected pattern: every parameter described precisely
search_tool_fixed = {
"name": "search_documents",
"description": (
"Perform a semantic similarity search over the document index. "
"Returns documents ranked by relevance to the query text."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": (
"Natural-language text describing the topic to search for. "
"Do NOT pass SQL, regex, or structured syntax — use plain prose."
)
},
"limit": {
"type": "integer",
"description": "Maximum number of documents to return. Defaults to 10 if omitted.",
"default": 10
},
"filter": {
"type": "object",
"description": (
"Optional metadata filter. Supported keys: 'author' (string), "
"'created_after' (ISO 8601 date string), 'tag' (string). "
"Omit to search all documents."
)
}
},
"required": ["query"]
}
}
The corrected query description does something important beyond just explaining the field: it explicitly rules out formats the model might otherwise try, like SQL or regex. This is negative specification — describing what a parameter is not — and it prevents a class of errors that positive descriptions alone miss.
💡 Pro Tip: For any parameter that could be misinterpreted as a different format, add a sentence of negative specification: "Do NOT pass X — use Y instead." This is especially important for query, input, data, and value parameters, which carry almost no intrinsic meaning.
Mistake 3: Not Distinguishing Tool Errors from Tool Results ⚠️
When a tool fails — a network request times out, a database connection is unavailable, an input fails validation — what should the tool return? Many developers return an error message in the same result field used for successful outputs, reasoning that the model will "understand" that the string "Error: connection timed out" means the tool failed. This reasoning is wrong in a way that causes real damage.
Models process tool results as content to reason about, not as signals to branch on — unless the error is surfaced through a distinct, recognized channel. When an error message arrives in the result field, the model typically does one of two things: it treats the error string as a piece of data and continues reasoning from it (leading to nonsensical downstream behavior), or it paraphrases the error in its response rather than retrying or escalating. Neither behavior is what you want.
The correct pattern is to use structured error returns — a separate channel or a typed envelope that the model and your agentic framework can recognize as a failure, distinct from a successful result with unexpected content.
from typing import Any
from dataclasses import dataclass
@dataclass
class ToolResult:
"""Envelope for all tool returns. success=False means the agent loop
should treat this as a failure, not as data to reason about."""
success: bool
data: Any = None # Populated when success=True
error_code: str = None # Machine-readable error class
error_message: str = None # Human-readable explanation for the model
## ❌ Broken pattern: error returned as raw string in the result
def get_customer_broken(customer_id: str) -> str:
try:
return database.fetch("customers", customer_id) # Returns JSON string on success
except ConnectionError as e:
return f"Error: {e}" # Model sees this as a normal result string
## ✅ Corrected pattern: structured ToolResult with explicit success flag
def get_customer(customer_id: str) -> ToolResult:
try:
record = database.fetch("customers", customer_id)
return ToolResult(success=True, data=record)
except ConnectionError as e:
return ToolResult(
success=False,
error_code="DB_CONNECTION_FAILED",
error_message=(
f"Could not reach the database to retrieve customer {customer_id}. "
"Do not assume the customer does not exist — retry or escalate."
)
)
except KeyError:
return ToolResult(
success=False,
error_code="RECORD_NOT_FOUND",
error_message=f"No customer with ID '{customer_id}' exists in the database."
)
Notice that error_message in the corrected pattern also includes behavioral guidance: "Do not assume the customer does not exist — retry or escalate." Because the model reads this message when deciding what to do next, the error return is an opportunity to steer behavior explicitly. A DB_CONNECTION_FAILED error and a RECORD_NOT_FOUND error warrant very different follow-up actions, and the structured error makes that distinction machine-readable for the framework and human-readable for the model.
⚠️ Common Mistake: Catching all exceptions and returning a single generic "Tool failed" message. This prevents the model from distinguishing recoverable failures (retry on connection timeout) from terminal ones (record does not exist), collapsing decisions that should be different into the same response.
Mistake 4: Registering Tools the Agent Cannot Actually Invoke ⚠️
This mistake is architectural rather than definitional, but it surfaces as a tool problem: an agent is given a tool schema for, say, a run_sql_query tool, but at runtime the database connection credentials are missing, the service is unavailable, or the tool simply hasn't been wired up in the current deployment environment. The agent sees the tool as available — it's in the registry — and confidently calls it. The call fails, but not cleanly: depending on the error handling, the agent may retry, hallucinate a plausible result, or continue with a corrupted understanding of what it accomplished.
Ghost tools — tools that are registered but not actually callable — introduce non-determinism into the agent loop that is very hard to debug because the failures look like model reasoning errors, not infrastructure errors.
Agent loop with a ghost tool:
[Task: "Find all orders over $500"]
│
▼
┌─────────────────┐
│ Tool registry │ ◄── run_sql_query registered ✓
│ │ (but DB credentials absent)
└────────┬────────┘
│ Agent calls run_sql_query("SELECT...")
▼
┌─────────────────┐
│ Tool executor │ ── ConnectionError ──► Unstructured error
└────────┬────────┘ returned as string
│
▼
Model receives: "Error: no such host"
│
├─ Interprets as: "query returned no results"
│
└─ Continues: "No orders over $500 were found."
↑
Wrong answer, no visible failure
The fix operates at two levels. First, validate tool availability at registration time, not at call time. A tool that requires a database connection should check that connection during the registration or startup phase and either succeed fully or fail loudly:
import os
class ToolRegistry:
def __init__(self):
self._tools = {}
def register(self, tool_def: dict, handler: callable, probe: callable = None):
"""
Register a tool with an optional availability probe.
If probe is provided, it is called immediately; if it raises,
the tool is NOT registered and the error is surfaced at startup.
"""
tool_name = tool_def["name"]
if probe is not None:
try:
probe() # e.g., run a SELECT 1 against the database
except Exception as e:
# Fail loudly at startup, not silently mid-loop
raise RuntimeError(
f"Tool '{tool_name}' failed availability check: {e}. "
f"Resolve the infrastructure issue before registering this tool."
) from e
self._tools[tool_name] = {"definition": tool_def, "handler": handler}
return self
## Usage: probe runs at registration time
def db_probe():
"""Raises if the database is unreachable."""
conn = get_db_connection() # Will raise if credentials missing or host unreachable
conn.execute("SELECT 1")
conn.close()
registry = ToolRegistry()
## This raises immediately if DB is unavailable, preventing ghost tool registration
registry.register(
tool_def=run_sql_tool_schema,
handler=run_sql_handler,
probe=db_probe
)
Second, in environments where tool availability is dynamic (a user's OAuth token may expire mid-session), build availability re-checking into the tool handler itself and return a structured TOOL_UNAVAILABLE error — not a data-shaped failure — so the framework can decide whether to surface this to the user, attempt re-authentication, or suspend the task.
💡 Real-World Example: A common manifestation of this mistake is copying a tool registry from a production configuration into a development environment, where several external service connections are absent. The agent runs, produces confident-sounding outputs, and every tool that touches a missing service silently fails. The outputs look plausible because the model interpolates from context, making the errors very hard to spot without per-tool execution logging.
Mistake 5: Unbounded Tool Outputs ⚠️
A tool that returns its output without size constraints will, sooner or later, return something enormous — a full file, a paginated API response that wasn't paginated, a database table with thousands of rows. When this lands in the agent's context window, it consumes space that was allocated for reasoning, tool call history, and the current task. In a multi-step agent loop, one oversized tool result can crowd out the memory of earlier steps, causing the agent to lose track of its progress or repeat work it already did.
The more subtle version of this mistake is not a single catastrophic response but accumulating bloat: ten tool calls each returning a few hundred tokens of unnecessary metadata slowly consume the context budget, and the degradation looks like "the model gets worse at long tasks" rather than "the tools are returning too much data."
🎯 Key Principle: The tool, not the caller, is responsible for bounding its output. The agent loop cannot reliably post-process a 50,000-token tool result before it has already consumed context budget.
The fix involves building three defenses into every tool that could return variable-length output:
- Hard truncation with a visible marker — the tool cuts off at a defined token or character limit and appends a signal (e.g.,
"[TRUNCATED — 1,240 additional records not shown]"), so the model knows it has a partial view rather than a complete one. - Pagination parameters — expose
pageandpage_sizeparameters in the tool schema so the agent can request further data explicitly when needed, rather than receiving everything at once. - Summary mode — for tools that return structured data, provide a
summarize: boolparameter that returns counts, ranges, and key statistics instead of raw rows.
import json
MAX_RESULT_CHARS = 4000 # Tune based on your context budget allocation for tool results
def search_logs(
query: str,
page: int = 1,
page_size: int = 20,
summarize: bool = False
) -> ToolResult:
"""
Search application logs. Returns at most page_size results per call.
If summarize=True, returns counts and time-range metadata instead of raw entries.
"""
raw_results = log_index.search(query, offset=(page - 1) * page_size, limit=page_size)
total_count = log_index.count(query)
if summarize:
# Return lightweight metadata rather than raw log lines
summary = {
"total_matches": total_count,
"returned_page": page,
"page_size": page_size,
"earliest_match": raw_results[0]["timestamp"] if raw_results else None,
"latest_match": raw_results[-1]["timestamp"] if raw_results else None,
"sample_messages": [r["message"][:100] for r in raw_results[:3]] # 3 samples only
}
return ToolResult(success=True, data=summary)
# Serialize and enforce the hard character limit
serialized = json.dumps(raw_results)
if len(serialized) > MAX_RESULT_CHARS:
truncated = serialized[:MAX_RESULT_CHARS]
truncation_note = (
f"[TRUNCATED — showing partial page {page} of results. "
f"Total matches: {total_count}. Use page={page + 1} for more, "
f"or summarize=True for an overview.]"
)
return ToolResult(
success=True,
data=truncated + truncation_note
)
return ToolResult(success=True, data=raw_results)
This implementation enforces a hard character cap and communicates the truncation explicitly, including a navigation hint (page={page + 1}) the model can use to continue without human intervention. The summarize mode gives the model a low-cost first pass before committing to fetching full results — a pattern especially valuable when the model needs to assess whether a search is productive before retrieving all matching records.
🤔 Did you know? Unbounded outputs are particularly costly in multi-agent pipelines where one agent's tool result becomes another agent's input. A single oversized response can cascade through the pipeline, consuming context in every downstream agent before the first useful reasoning step has occurred.
Seeing the Mistakes as a System
These five mistakes are individually avoidable, but they tend to co-occur: developers building their first tool-enabled agent often write overloaded tools with undescribed parameters, minimal error handling, no availability checking, and no output bounding — all at once. The resulting system is hard to debug because any single observed failure could be caused by any combination of these issues.
A useful diagnostic framing is to ask, for each tool in your registry:
Diagnostic checklist per tool:
1. Can this tool's purpose be stated in one sentence
without the word "or"? ──► If no, split it.
2. Would a developer who has never seen your codebase
know exactly what each parameter expects? ──► If no, add descriptions.
3. If this tool fails, does the return value look
different from a successful return? ──► If no, add structured errors.
4. Has this tool been called successfully in the
current environment since the last deploy? ──► If no, add a probe.
5. What is the largest response this tool could
plausibly return? Does that fit your context
budget for tool results? ──► If no, add truncation.
Working through this checklist for each registered tool before an agent goes into production surfaces most of the issues described in this section. The checklist is not exhaustive — there are failure modes particular to specific tool types, specific frameworks, and specific deployment environments — but it catches the category of mistakes that appear most frequently across different agent implementations.
📋 Quick Reference Card:
| 🔧 Mistake | ❌ Symptom | ✅ Fix |
|---|---|---|
| 🔀 Overloaded tool | Model chooses wrong mode or omits conditional params | Split into single-purpose tools |
| 📝 Missing descriptions | Model misinterprets parameter format or intent | Describe every param, add negative specs |
| ⚠️ Unstructured errors | Model treats failures as valid data | Return typed ToolResult with success flag |
| 👻 Ghost tools | Non-deterministic mid-loop failures | Probe at registration time |
| 📦 Unbounded output | Context budget consumed by single result | Truncate in tool, expose pagination |
The section that follows consolidates the key concepts from this lesson and connects them to the specifics of MCP, A2A, and the deeper tool design patterns covered in the child lessons.
Key Takeaways and Preparing for MCP, A2A, and Tool Design
This lesson began with a deceptively simple question: why can't agents just call functions directly, the way any other piece of software does? The answer, worked out across five sections, is that agent systems have properties that make informal function calls collapse under their own weight — non-deterministic execution paths, multi-step planning, dynamic capability composition, and distributed deployment across processes or hosts. Structured protocols and well-specified tools aren't bureaucratic overhead; they're the load-bearing structure that keeps those properties manageable. What follows is a consolidation of what you now understand, why it matters, and exactly what the child lessons ahead will build on top of it.
The Three Properties That Make a Tool Real
Every concept in this lesson reduces, at some point, to a single idea: a tool is a typed, described, side-effect-classified function. If you get those three properties right, almost everything else in the protocol and composition layers can work. If you get them wrong, no amount of framework sophistication will save you.
Let's be precise about what each property actually means:
Typed means that inputs and outputs have explicit schemas — not informal documentation, but machine-readable contracts that a host can validate before calling and that a caller can parse after receiving. The type system is the first line of defense against the agent passing a string where an integer is expected, or interpreting a null return as success.
Described means that the tool's purpose, parameter semantics, and constraints are expressed in a form the planning layer can reason about. For LLM-based agents, this is typically a natural-language description embedded in the schema. The description is not documentation for a human developer reading the code — it is an operational input to the model's routing decisions. A vague or misleading description produces wrong tool selection, which is a silent, hard-to-debug failure mode.
Side-effect-classified means that the tool is explicitly labeled as read-only, write-once, idempotent, or destructive. This classification is what allows an orchestrator to safely retry, cache, or parallelize calls without risking data corruption. Without it, the agent loop must treat every tool as potentially destructive, which either breaks retry logic or makes it unsafe.
The following code block shows a minimal but complete tool definition that satisfies all three properties. Notice that the side-effect classification is expressed as a field in the schema metadata, not buried in prose comments:
from pydantic import BaseModel, Field
from typing import Literal
class SearchDocumentsInput(BaseModel):
query: str = Field(
description="Natural language query to search the document index."
)
max_results: int = Field(
default=5,
ge=1,
le=20,
description="Maximum number of documents to return (1–20)."
)
class SearchDocumentsOutput(BaseModel):
results: list[dict] # Each dict: {"id": str, "title": str, "snippet": str}
total_found: int
## Tool descriptor — this is the artifact the agent host registers and reasons about.
SEARCH_TOOL = {
"name": "search_documents",
"description": (
"Search the internal document index by natural language query. "
"Returns ranked snippets. Does NOT modify any data."
),
"input_schema": SearchDocumentsInput.model_json_schema(),
"output_schema": SearchDocumentsOutput.model_json_schema(),
"side_effects": "none", # Declares: safe to retry, cache, parallelize
"idempotent": True,
}
This is a simplified picture — in production you'd also handle pagination tokens, authentication context, and timeout declarations — but the three core properties are all present and machine-readable. The side_effects: none declaration is the key detail most informal implementations omit, and its absence is exactly why retry logic in agentic systems so often causes unintended duplicate writes.
💡 Mental Model: Think of the three properties as the public contract of the tool. The implementation behind them can change freely as long as the contract holds. This is the same principle that makes REST APIs stable across versions — the interface is the promise, not the implementation.
Protocols: The Value Is the Contract, Not the Transport
One of the cleaner conceptual distinctions in this lesson is the difference between a protocol and a transport. It's tempting to think of a protocol as a particular wire format or HTTP endpoint convention. That conflates two things that should be separated:
- The transport is the mechanism by which bytes move between agent host and tool implementation: HTTP, stdin/stdout, gRPC, shared memory.
- The protocol is the semantic contract — the envelope structure, negotiation handshake, capability advertisement, error taxonomy, and versioning convention that both sides agree to honor regardless of transport.
The value of a protocol is entirely in the stability of that contract. When the contract is stable, you can swap transports without changing either side's business logic. You can replace the tool implementation without changing the host. You can introduce a new agent host without rewriting the tools. The decoupling is what makes the system composable at scale.
An informal convention — "just look at how the other tools are structured and do the same thing" — works fine for two or three tools owned by one team. It breaks when the fourth team joins and interprets "the same thing" differently, or when the first tool is refactored and the undocumented convention silently changes. The protocol layer exists to make that convention explicit, versioned, and machine-checkable.
INFORMAL CONVENTION (breaks at scale)
Agent Host ──── ad-hoc JSON ───► Tool A (Team 1's format)
Agent Host ──── ad-hoc JSON ───► Tool B (Team 2's slightly different format)
Agent Host ──── ad-hoc JSON ───► Tool C (Team 3's format, evolved from Team 1)
Host must contain per-tool adapter logic → O(n) complexity as tools grow
STANDARDIZED PROTOCOL (scales)
Agent Host ──── Protocol Envelope ───► Tool A }
Agent Host ──── Protocol Envelope ───► Tool B } All speak the same contract
Agent Host ──── Protocol Envelope ───► Tool C }
Host contains one protocol adapter → O(1) complexity regardless of tool count
🎯 Key Principle: A protocol standardizes the envelope, negotiation, and error conventions — not the tool's business logic. The business logic stays inside the tool. This is what keeps the protocol layer thin and the tools independently deployable.
The child lesson on MCP and A2A will show exactly what those three elements — envelope, negotiation, and error conventions — look like in a specific open protocol implementation. What you're bringing to that lesson is the understanding of why each element exists, which makes the specific choices MCP and A2A make legible rather than arbitrary.
Progressive Skill Loading: An Orchestration Decision, Not a Model Decision
The section on skill loading introduced a distinction that deserves to be restated clearly, because it's the one most frequently blurred in practice: which tools are active at any moment is an orchestration decision, not a model decision.
The model — the LLM at the center of the agent loop — can only reason about the tools it has been told exist. It cannot introspect the full tool registry, decide to add a capability, or remove one that's been made available to it. Those decisions happen at the agent loop level: the code that assembles the context window, selects which tool schemas to include, and dispatches calls. The model works within whatever capability set the loop gives it.
This matters for two practical reasons:
Auditability. If tool selection were a model decision, auditing which capabilities were active during a given run would require inspecting the model's reasoning trace — which is probabilistic and not reliably reproducible. When tool selection is an orchestration decision, it's a deterministic code path that can be logged, replayed, and tested.
Security. Privilege escalation in agentic systems almost always exploits a gap between what the model thinks it should do and what the orchestration layer actually allows. Keeping those two levels separated — the model reasons, the orchestrator gates — is the structural defense against prompt injection attacks that try to convince the model to invoke capabilities it shouldn't have.
The following code block shows the pattern: the orchestrator explicitly computes the active tool set based on context, then passes only that set to the model. The model never sees the full registry.
from typing import Callable
## Full registry: all tools available to this agent application
TOOL_REGISTRY: dict[str, dict] = {
"search_documents": SEARCH_TOOL,
"write_report": WRITE_REPORT_TOOL, # side_effects: "write"
"send_email": SEND_EMAIL_TOOL, # side_effects: "external"
"delete_record": DELETE_RECORD_TOOL, # side_effects: "destructive"
}
def select_tools_for_context(
user_role: str,
task_phase: str,
) -> list[dict]:
"""
Orchestrator-level decision: derive the active tool set from explicit
context signals. This function, not the model, controls capability scope.
"""
active = []
# All roles can read.
active.append(TOOL_REGISTRY["search_documents"])
# Only writers can produce output artifacts.
if user_role in ("editor", "admin"):
active.append(TOOL_REGISTRY["write_report"])
# External side effects only during 'finalize' phase, never during 'draft'.
if task_phase == "finalize" and user_role == "admin":
active.append(TOOL_REGISTRY["send_email"])
# Destructive tools: never exposed automatically.
# (Requires explicit operator configuration, not covered here.)
return active
def run_agent_turn(user_role: str, task_phase: str, user_message: str) -> str:
active_tools = select_tools_for_context(user_role, task_phase)
# Log the capability decision — this is the auditable record.
print(f"[ORCH] Active tools for this turn: {[t['name'] for t in active_tools]}")
# Pass only the active tool schemas to the model.
response = model_call(
messages=[{"role": "user", "content": user_message}],
tools=active_tools, # Model sees only what the orchestrator permits.
)
return response
The select_tools_for_context function is deterministic: given the same inputs, it always produces the same tool set, and that decision is logged before the model call. This is the auditable, explicit approach the lesson advocates. The alternative — letting the model request tools dynamically from an unrestricted registry — moves the capability gating decision into a probabilistic process, which breaks both auditability and security boundaries.
⚠️ Critical Point to Remember: Dynamic skill loading is not the same as uncontrolled skill loading. An agent can legitimately acquire new tools at runtime — by passing through the orchestrator's gating logic. The problem arises when tools are added in response to the model's request rather than the system's explicit rules. Always gate on context, role, and phase — never on what the model asks for.
What You Now Understand That You Didn't Before
Before this lesson, a reasonable developer might approach agentic tool integration the way they'd approach any other API call: write a function, expose an endpoint, pass the URL to the agent, and see what happens. That approach works for a single tool in a controlled demo. It produces systems that are unreliable, unauditable, and insecure at any meaningful scale.
Here is the specific conceptual inventory you've built:
📋 Quick Reference Card: Core Concepts Acquired
| 🔧 Concept | 📚 What It Means | 🎯 Why It Matters |
|---|---|---|
| 🔒 Tool anatomy | Typed inputs, typed outputs, side-effect classification | Enables safe retry, caching, and parallel execution |
| 📡 Protocol vs. transport | Semantic contract vs. byte-moving mechanism | Lets you swap implementations without changing interfaces |
| 🧩 Envelope structure | Standardized wrapper around tool calls and results | Keeps host/tool logic independent of each other |
| 🔄 Static vs. dynamic loading | Fixed tool set vs. runtime-selected tool set | Dynamic loading requires explicit orchestration gating |
| 🛡️ Orchestration gating | Capability selection happens in code, not in the model | Prerequisite for auditability and security |
| ⚠️ Side-effect taxonomy | None / write / external / destructive | The missing classification that breaks most retry logic |
💡 Real-World Example: A team building a customer-support agent ships with twelve tools in the registry. Six months in, they discover that the agent occasionally calls escalate_to_human during automated batch runs — a tool that triggers a real ticket and notifies a live agent. The tool wasn't gated on the batch task phase because the gating logic lived in the model's system prompt, not in the orchestrator. A one-line addition to select_tools_for_context fixes it permanently. The same bug in an orchestration-gated system would have been caught in code review.
How the Child Lessons Build on This Foundation
The two child lessons ahead are both grounded in concepts introduced here. Understanding where each one picks up will help you read them with the right frame.
The MCP & A2A Child Lesson
The sections above established why protocols need envelopes, negotiation, and error conventions — but deliberately stayed abstract about the specific choices any real protocol makes. The child lesson on MCP (Model Context Protocol) and A2A (Agent-to-Agent) fills in the concrete specification: what the envelope looks like in JSON, how capability advertisement works in practice, what the error taxonomy is, and how versioning is handled between a host and a tool server that may have been written by different teams at different times.
Coming into that lesson, your job is to recognize which design decisions map to concepts you already understand. When you see MCP's initialize handshake, you'll recognize it as the negotiation phase. When you see its error code structure, you'll recognize it as the error convention. The protocol is no longer a black box — it's a specific implementation of a pattern you can now describe in the abstract.
The Tool Design & Skill Loading Child Lesson
The tool anatomy and loading patterns covered here are the theoretical prerequisites. The child lesson applies them to production-grade design decisions: how to version a tool schema without breaking existing agents, how to handle partial failures in a tool chain, how to design side-effect classifications for tools that are sometimes idempotent depending on parameters, and how to structure a skill-loading configuration file that operations teams can read and modify without touching application code.
The simplifications in this lesson — particularly the clean separation between "static" and "dynamic" loading — will be nuanced there. Real systems often blend the two, with a static base set augmented by dynamically loaded specialist tools triggered by task classification. That lesson will show you how to reason about those hybrid patterns without losing the auditable, explicit gating property.
Practical Next Steps
Before moving into the child lessons, three practical applications will solidify what you've learned:
1. Audit an existing tool definition you own or have access to. Take any function currently being used as an agent tool and check it against the three properties: Is the input schema machine-readable and validated? Is the description written for the planning layer, not for a human reading the source? Is the side-effect level explicitly declared? Most tools fail at least one of these. The correction is usually small — adding a schema field, rewriting a description from the implementation's perspective to the caller's perspective — but the impact on reliability is disproportionate.
2. Draw the capability boundary for one agent workflow. Pick a workflow with at least three tools and map which tools should be active in each phase (draft, review, finalize, for example). Then ask: is that mapping currently enforced in code, or is it expressed only in the system prompt? If the latter, you've identified a concrete security and reliability gap that the orchestration gating pattern addresses.
3. Read the MCP specification overview before the child lesson. The MCP specification is publicly available. Before the child lesson, skim the envelope and initialization sections — not to understand every detail, but to see how the abstract concepts from this lesson surface in a real protocol document. You'll find the reading significantly easier now than it would have been before this lesson, because you have a mental scaffold to hang the specifics on.
🧠 Mnemonic: TDS — Typed, Described, Side-effect-classified. If a tool satisfies TDS, it's ready for a protocol layer. If it doesn't, fixing the protocol won't help.
A Final Word on the Right Level of Abstraction
One of the harder skills in building agentic systems is knowing which problems to solve at which layer. The tool layer handles individual function semantics. The protocol layer handles cross-boundary communication contracts. The orchestration layer handles capability selection and task routing. The model layer handles reasoning and planning.
A common failure mode is solving a protocol problem at the tool layer (hardcoding retry logic inside the tool function), or solving an orchestration problem at the model layer (writing capability gating into the system prompt). These misplacements feel expedient in the short term and become load-bearing technical debt in the medium term — the kind that causes the customer-support agent to file real tickets during batch runs.
The conceptual framework from this lesson — tool anatomy, protocol contracts, and orchestration-level gating — is a useful heuristic for assigning problems to the right layer. It covers the most common cases. It won't resolve every edge case in production, and the child lessons will introduce additional nuance. But it gives you a consistent mental structure for asking the right question first: which layer owns this decision?
⚠️ Final Critical Point: The agent loop is code. Treat it with the same rigor you'd apply to any other piece of production software: explicit decisions, auditable state, tested boundaries. The model is a reasoning component inside that loop — a powerful one, but not a substitute for structural design.