Testing & Evaluation
Reimagine the testing pyramid for non-determinism: unit, chain, integration, and production evals.
Why Testing Agents Is Fundamentally Different
Imagine you've just spent a week writing a suite of unit tests for a new microservice. Every test passes. You deploy with confidence. Now imagine doing the same for an AI agent — and discovering that your entire mental model of "testing" is built on an assumption that simply doesn't hold anymore. That assumption? That the same input reliably produces the same output. Welcome to the world of agentic AI, where that bedrock principle crumbles, and where the cost of getting evaluation wrong isn't a broken build — it's a customer's data silently corrupted across a chain of automated steps. Grab the free flashcards at the end of each section to lock in the key terms as you go.
This section is your orientation. Before we can rebuild our testing instincts, we need to understand why they fail — and why the failure is so much more expensive than it looks.
The Determinism Contract You've Always Relied On
Every testing framework you've ever used — pytest, Jest, JUnit, RSpec — was designed around a quiet contract: given a fixed input, a correct function produces a fixed output. That contract is so deeply embedded in how we think about software quality that most developers never articulate it. It's just... obvious.
## Traditional deterministic test — this always passes or always fails
def add(a: int, b: int) -> int:
return a + b
def test_add():
assert add(2, 3) == 5 # Deterministic: same input, same output, forever
This works beautifully for functions, APIs, and even complex stateful systems — because those systems are ultimately executing deterministic logic. You control the inputs, you know the outputs, and a test is a permanently valid binary statement about the world.
Now consider the equivalent for an agent:
## Agentic system — the "same" input yields different outputs across runs
from my_agent import ResearchAgent
agent = ResearchAgent(model="gpt-4o")
result_1 = agent.run("Summarize the risks of this contract.")
result_2 = agent.run("Summarize the risks of this contract.") # Same prompt!
## result_1 and result_2 will differ in wording, structure, and possibly substance
## assert result_1 == result_2 ← This would FAIL, and that's expected
The agent isn't broken. It's working exactly as designed. But our testing contract — identical input yields identical output — is now invalid by design. The agent uses probabilistic inference at its core, and every invocation samples from a distribution of plausible responses. This isn't a bug to be fixed. It's an architectural reality to be evaluated differently.
💡 Mental Model: Think of a traditional function as a lock: the right key always opens it, the wrong key never does. An agent is more like a skilled human consultant: you can hire a brilliant one, ask them the same question twice, and get two genuinely good-but-different answers. You don't evaluate consultants by checking if their words match a transcript. You evaluate the quality of their reasoning.
Multiple Failure Surfaces, Each Independent
A traditional microservice has one primary failure surface: the code. If the logic is correct and the infrastructure is up, it works. An agentic system is composed of multiple independently-fallible subsystems, each of which can degrade without the others:
┌─────────────────────────────────────────────────────────┐
│ AGENT EXECUTION FLOW │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ LLM Reasoning│───▶│ Tool Selection│───▶│Tool Exec │ │
│ │ │ │ │ │ │ │
│ │ ⚠ Hallucin. │ │ ⚠ Wrong tool │ │ ⚠ API fail│ │
│ │ ⚠ Bad plan │ │ ⚠ Bad params │ │ ⚠ Timeout │ │
│ └──────────────┘ └──────────────┘ └─────┬─────┘ │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ Final Output │◀───│Memory Retriev│◀──────────┘ │
│ │ │ │ │ │
│ │ ⚠ Synthesis │ │ ⚠ Wrong docs │ │
│ │ errors │ │ ⚠ Stale data │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
Let's name these failure surfaces explicitly:
🧠 LLM Reasoning — The model might misunderstand the task, produce a logically flawed plan, or confidently hallucinate a fact. This failure mode is intrinsic to the language model itself and can't be unit-tested away.
📚 Tool Selection & Parameterization — Even with correct reasoning, an agent might call the right tool with subtly wrong arguments, or choose a suboptimal tool entirely. Did it call search_documents(query="Q2 revenue") when it needed search_documents(query="Q2 revenue", date_range="2024")? The difference might be invisible in a surface-level output check.
🔧 Tool Execution — External tools fail. APIs rate-limit, databases time out, file systems throw permissions errors. How the agent handles these failures is itself testable behavior.
📚 Memory & Retrieval — If the agent uses a vector store or session memory, the quality of retrieved context directly shapes the quality of reasoning. A retrieval system that surfaces the wrong document chunk will poison every downstream step.
🎯 Multi-Step Planning — In long-horizon tasks, early decisions constrain later ones. A planning error at step 2 might not manifest as a visible problem until step 8, making root-cause analysis genuinely hard.
🤔 Did you know? In a 2024 analysis of production agent failures, researchers found that roughly 40% of incorrect final outputs were caused by intermediate tool-use errors — not by the LLM's core reasoning. Testing only the final answer would have missed nearly half the real failure modes.
The Compounding Cost of Wrong Actions
In a traditional application, a bug is typically local. A miscalculated tax rate affects one invoice. A null pointer exception crashes one request. The blast radius is bounded.
In an agentic system, error compounding is the default behavior. Consider an agent tasked with researching a topic, drafting a report, and emailing it to stakeholders:
Step 1: Search web for "Q3 competitor analysis"
└─ Returns slightly stale data (silent degradation)
Step 2: Synthesize findings into a structured report
└─ LLM reasons correctly from incorrect premises
(garbage in, garbage out — but confidently)
Step 3: Draft executive summary
└─ Summary accurately reflects the flawed synthesis
Step 4: Send email to 12 senior stakeholders
└─ 12 people now hold incorrect competitive intelligence
← COST IS NOW MAXIMUM
Notice what happened: no individual step failed in a detectable way. Each step performed correctly given its inputs. Yet the final output is wrong and costly. A test that only checks the email's formatting would pass with flying colors.
This is why evaluation at every layer — not just final output — is essential. If you only run assertions on the agent's final answer, you're flying blind through all the intermediate failure modes that led to it.
🎯 Key Principle: The cost of an undetected error in an agentic system scales with the number of subsequent steps that rely on it. Evaluate early, evaluate often, and evaluate at every architectural layer.
⚠️ Common Mistake — Mistake 1: Treating agent evaluation like API integration testing. If you're asserting only on the final output JSON and nothing else, you have no signal about which component failed when your agent starts producing bad answers in production.
The Evaluation Triad: Correctness, Latency, and Cost
Traditional unit tests care about one thing: correctness. Is the output right? Everything else — performance, cost — belongs to other disciplines (load testing, cost accounting).
For agentic systems, these three dimensions are inextricably linked in every evaluation decision you make:
| 📊 Dimension | 🔍 What It Measures | ⚠️ Tradeoff |
|---|---|---|
| 🎯 Correctness | Did the agent accomplish the task with high quality? | More reasoning steps improve correctness but raise cost and latency |
| ⏱️ Latency | How long did the agent take to respond? | Parallelizing tool calls reduces latency but may increase token spend |
| 💰 Token Cost | How many tokens (and dollars) did this run consume? | Cheaper models reduce cost but may degrade correctness on hard tasks |
Here's a concrete example of why you can't ignore any leg of this triad:
## Evaluation harness that tracks all three dimensions together
import time
from dataclasses import dataclass
from my_agent import ResearchAgent
from my_eval import score_answer_quality # LLM-as-judge or rubric scorer
@dataclass
class EvalResult:
task_id: str
correctness_score: float # 0.0 – 1.0, from rubric or judge
latency_seconds: float # wall-clock time for the full run
total_tokens: int # input + output tokens consumed
passed: bool # composite gate for CI/CD purposes
def evaluate_agent_run(task: dict, agent: ResearchAgent) -> EvalResult:
"""
Runs a single eval task and captures the full triad of metrics.
A 'correct' answer that took 90 seconds is a failure in a real-time UX.
A fast answer that cost $2.40 per call fails the cost budget gate.
"""
start = time.perf_counter()
result = agent.run(
task["prompt"],
trace=True # captures intermediate steps for chain-level inspection
)
elapsed = time.perf_counter() - start
correctness = score_answer_quality(result.output, task["expected_behavior"])
# Composite pass gate: must satisfy ALL three thresholds
passed = (
correctness >= 0.85 and # quality threshold
elapsed <= 30.0 and # latency SLA
result.total_tokens <= 8000 # per-call cost budget
)
return EvalResult(
task_id=task["id"],
correctness_score=correctness,
latency_seconds=elapsed,
total_tokens=result.total_tokens,
passed=passed
)
This harness captures a critical insight: an agent can fail its evaluation even when the answer is technically correct, if it consumed ten times the expected tokens to get there, or if it took a minute to respond in a user-facing application. Pure unit testing has no mechanism for even asking these questions.
💡 Pro Tip: Define your evaluation triad thresholds before you start building, not after. Teams that define quality, latency, and cost budgets up front build dramatically better evaluation cultures than those who bolt metrics on post-hoc.
The Probabilistic Nature of "Correct"
Even if we solve the triad problem, we face a deeper epistemological challenge: what does correct even mean for an agent's output?
For a calculator function, correctness is binary and provable. For an agent tasked with "draft a persuasive case for expanding into Southeast Asia," correctness is:
- 🧠 Partially subjective — Two reasonable evaluators might disagree on quality
- 📚 Context-dependent — Correct for a startup looks different than correct for a Fortune 500 firm
- 🎯 Multidimensional — Accuracy, completeness, tone, reasoning chain, and citation quality are all independent axes
- 🔧 Distribution-shaped — Rather than pass/fail, we need to ask: "What fraction of runs fall above an acceptable quality threshold?"
❌ Wrong thinking: "I'll just manually review the output once and write a snapshot test."
✅ Correct thinking: "I'll define a rubric that captures the dimensions of quality that matter for this task, run the agent across a representative dataset of 50–100 inputs, and track the distribution of quality scores over time."
This shift — from single-point assertions to distributional evaluation — is one of the most important mental model upgrades you'll make in this entire lesson.
🧠 Mnemonic — DRIP: Think Distribution over Runs, not Individual Point checks. Agents need DRIP evaluation.
A Preview of the Evaluation Landscape Ahead
Now that we've established why your existing mental models fall short, let's orient toward what replaces them. The rest of this lesson introduces a reimagined evaluation framework built for this new reality. Think of it as a testing pyramid redesigned from the ground up for non-deterministic, multi-step, cost-bearing systems.
╔══════════════════╗
║ PRODUCTION ║ ← Live signals, real users
║ EVALS ║ Latency, errors, satisfaction
╔═╩══════════════════╩═╗
║ INTEGRATION EVALS ║ ← Full agent end-to-end
║ ║ Real tools, real data
╔══╩══════════════════════╩══╗
║ CHAIN EVALS ║ ← Multi-step sequences
║ ║ Reasoning paths, tool chains
╔═══╩════════════════════════════╩═══╗
║ UNIT EVALS ║ ← Isolated components
║ (prompts, tools, retrievers) ║ Fast, cheap, frequent
╚════════════════════════════════════╝
Each layer in this pyramid has a different purpose, a different cost, and a different cadence:
📋 Quick Reference Card:
| 🏗️ Layer | 🎯 What It Tests | ⏱️ Cadence | 💰 Relative Cost |
|---|---|---|---|
| 🔧 Unit Evals | Single prompt, single tool, single retriever | Every commit | Low |
| 🔗 Chain Evals | Multi-step reasoning sequences | Daily / PR merge | Medium |
| 🌐 Integration Evals | Full agent with real tools and data | Pre-release | High |
| 📡 Production Evals | Live traffic, real user tasks | Continuous | Ongoing |
The key insight — which we'll develop throughout this lesson — is that no single layer is sufficient. Unit evals are fast and cheap but miss emergent failures. Production evals catch real failures but are expensive and slow to act on. The discipline of agentic evaluation is learning to instrument all four layers and reason across them coherently.
💡 Real-World Example: A team at a legal-tech company discovered that their contract-review agent had a 94% correctness score in unit evals on individual clause extraction. But in production, users were dissatisfied with 30% of full reviews. The gap? Unit evals didn't capture the agent's tendency to omit clauses when the document was longer than 15 pages — a failure mode that only emerged in the integration layer.
Why This Matters Right Now
Agentic AI is moving from research toy to production infrastructure faster than most engineering teams have updated their quality practices. The teams that will ship reliable agents aren't necessarily the ones with the best models — they're the ones who build the most rigorous, layered evaluation cultures.
The good news: the mental model shift is bounded. You don't need to throw away everything you know about software testing. You need to extend it. Deterministic assertions still matter — they just live at specific layers. Statistical thresholds and qualitative rubrics augment them at others. And production telemetry closes the loop in ways that no offline test suite ever could.
By the end of this lesson, you'll have a concrete vocabulary, a practical pyramid to organize your thinking, and working code patterns you can apply to real agent systems this week.
Let's start rebuilding from the ground up.
Rethinking the Testing Pyramid for Non-Determinism
Every software engineer has internalized the classic testing pyramid: write many fast unit tests at the base, fewer integration tests in the middle, and a small number of slow end-to-end tests at the top. This hierarchy made sense because deterministic code could be tested with binary pass/fail assertions—given the same input, the same output would always emerge. Agentic AI systems shatter that assumption. A language model responding to the same prompt twice will rarely produce identical text. A ReAct-style agent deciding which tool to call next is exercising probabilistic judgment, not executing a lookup table. The classic pyramid's foundation—the idea that correctness can be verified with a simple equality check—no longer holds.
The solution is not to abandon structured testing entirely but to rebuild the pyramid with layers that are explicitly designed for non-determinism. Each layer of the agentic eval pyramid still follows the same intuition as the classic version: lower layers are faster, cheaper, and more numerous, while higher layers are slower, costlier, and more selective. What changes is what each layer asserts on and how it tolerates or accounts for probabilistic variation.
╔══════════════════════════╗
║ PRODUCTION EVALS ║ ← Live traffic, sampling,
║ (Slowest / Costliest) ║ shadow mode, canaries
╠══════════════════════════╣
║ INTEGRATION EVALS ║ ← Full agent, realistic
║ ║ scenarios, e2e goals
╠══════════════════════════╣
║ CHAIN EVALS ║ ← Fixed sub-graphs,
║ ║ pipeline coherence
╠══════════════════════════╣
║ UNIT EVALS ║ ← Single prompt, single
║ (Fastest / Cheapest) ║ tool call, single lookup
╚══════════════════════════╝
Volume: ████████████████ ████████████ ██████ ███
Speed: Fast ──────────────────────────────────► Slow
Cost: Low ──────────────────────────────────► High
Let's walk through each layer carefully, because understanding the responsibilities of each one is the prerequisite for building any evaluation strategy that actually works.
Layer 1: Unit Evals — Isolating the Smallest Testable Pieces
In traditional software, a unit test targets a single function or method. In agentic systems, a unit eval targets the smallest independently testable component of the agent's reasoning: a single prompt, a single tool call, or a single memory lookup, each tested in complete isolation from everything else.
The goal of a unit eval is not to verify that the agent accomplishes a task—it is to verify that one specific building block behaves within acceptable bounds. Consider a prompt template that instructs an LLM to extract structured fields from a customer support message. A unit eval for that prompt might assert that when given a message mentioning a broken laptop screen, the output JSON always contains "category": "hardware", that the priority field is always one of three valid enum values, and that no personally identifiable information leaks into the extracted fields. Notice that these assertions are still specific and bounded, even though the underlying model is probabilistic: you are not asserting on exact wording, but on structural and semantic constraints that should hold across every reasonable response.
Unit evals for tool calls work similarly. If your agent has a search_knowledge_base tool, you can write a unit eval that feeds a well-defined query string directly into the tool's invocation logic and asserts that the returned documents have a relevance score above a threshold, that the response latency stays under 200ms, and that the function never raises an unhandled exception. No LLM is involved here—you are evaluating the tool in isolation.
💡 Real-World Example: A team building a coding assistant discovered that their run_tests tool occasionally returned exit codes without capturing stderr. This failure mode was invisible at higher layers because the agent would silently move on. A unit eval that asserted result.stderr is not None caught it immediately and took milliseconds to run.
Unit evals should be fast enough to run in CI on every commit. If a unit eval requires a live LLM call, cap it at a small, reproducible sample using a fixed random seed where your framework supports it, and use a cheaper model tier. The investment pays off: unit evals give you a tight feedback loop that lets developers catch regressions before they ever touch the rest of the agent.
import pytest
from your_agent.prompts import extract_support_fields
from your_agent.llm import call_llm
## A unit eval for a single prompt component.
## We test structure and semantic constraints, not exact wording.
VALID_CATEGORIES = {"hardware", "software", "billing", "account"}
VALID_PRIORITIES = {"low", "medium", "high"}
@pytest.mark.parametrize("message,expected_category", [
("My laptop screen is cracked and won't turn on.", "hardware"),
("I was charged twice for my subscription this month.", "billing"),
("The app crashes every time I open the dashboard.", "software"),
])
def test_extract_support_fields_category(message, expected_category):
prompt = extract_support_fields(message) # Renders the prompt template
response = call_llm(prompt, model="gpt-4o-mini", seed=42) # Fixed seed for reproducibility
# Assert structural correctness
assert "category" in response, "Response must include a 'category' field"
assert "priority" in response, "Response must include a 'priority' field"
# Assert value constraints (not exact text)
assert response["category"] in VALID_CATEGORIES, (
f"Category '{response['category']}' not in allowed set"
)
assert response["priority"] in VALID_PRIORITIES, (
f"Priority '{response['priority']}' not in allowed set"
)
# Assert semantic correctness for the known category
assert response["category"] == expected_category, (
f"Expected category '{expected_category}', got '{response['category']}'"
)
This code runs three parametrized test cases against a single prompt component. It does not test any downstream agent behavior—only the prompt's ability to produce structurally valid, semantically appropriate output. The seed=42 parameter is a best-effort reproducibility aid; it won't eliminate all variance, but it reduces it during development.
Layer 2: Chain Evals — Validating Sequences and Sub-Graphs
Once individual components pass their unit evals, the next challenge is ensuring that they compose correctly. A chain eval validates a fixed sequence of agent steps—what you might think of as a sub-graph or a pipeline segment—for coherence, correct tool selection, and intermediate state correctness.
The distinction from unit evals is important: a chain eval is testing the interaction between components, not any single component in isolation. Consider a document Q&A workflow where the agent must (1) rewrite the user's vague question into a precise search query, (2) call the retrieval tool, (3) rank the retrieved passages, and (4) synthesize a grounded answer. A chain eval would run this four-step sequence end-to-end with a controlled input, then assert on the intermediate states: Was the rewritten query specific enough? Did the retrieval tool get called at all—and exactly once? Did the ranking step not discard all results? Is the final answer grounded in the retrieved passages rather than hallucinated?
🎯 Key Principle: Chain evals are your primary defense against emergent composition failures—situations where every component looks fine individually but the handoff between them produces nonsense. These failures are nearly impossible to catch at the unit level and too fine-grained to diagnose at the integration level.
Chain evals tend to be more expensive than unit evals because they involve multiple LLM calls. Keep them tractable by fixing the execution graph—do not test the agent's routing logic at this layer. If your agent can choose between three different sub-graphs, write a separate chain eval for each one. This lets you isolate failures precisely.
⚠️ Common Mistake: Teams sometimes try to use chain evals to test branching logic. As soon as you introduce conditional paths, you have an integration eval in disguise—and you lose the diagnostic precision that makes chain evals valuable. Keep chains linear and deterministic in structure; reserve branching for the integration layer.
import pytest
from unittest.mock import patch, MagicMock
from your_agent.chains import document_qa_chain
from your_agent.eval_utils import assert_grounded, assert_query_specificity
def test_document_qa_chain_intermediate_states():
"""
Chain eval: validates the 4-step doc QA pipeline.
We assert on intermediate state at each step, not just final output.
"""
user_question = "What does it say about returns?"
# Run the chain and capture intermediate states via a tracer
trace = document_qa_chain.run_with_trace(
input=user_question,
tracer="in_memory" # Captures each step's input/output
)
# Step 1: Query rewriting produced a more specific query
rewritten_query = trace.steps["query_rewriter"].output
assert len(rewritten_query.split()) >= 5, "Rewritten query is suspiciously short"
assert_query_specificity(rewritten_query, min_score=0.6) # Custom semantic scorer
# Step 2: Retrieval tool was called exactly once
retrieval_calls = trace.tool_calls("search_documents")
assert len(retrieval_calls) == 1, (
f"Expected 1 retrieval call, got {len(retrieval_calls)}"
)
# Step 3: Ranking step preserved at least 2 passages
ranked_passages = trace.steps["passage_ranker"].output
assert len(ranked_passages) >= 2, "Ranker discarded too many passages"
# Step 4: Final answer is grounded in retrieved passages
final_answer = trace.steps["answer_synthesizer"].output
assert_grounded(answer=final_answer, sources=ranked_passages, threshold=0.7)
Notice the use of a tracer that captures each step's output. This is the key instrumentation primitive for chain evals. Most agent frameworks (LangGraph, LlamaIndex, CrewAI, and others) expose hooks for capturing intermediate state—invest time in setting these up early, because without them, chain evals reduce to black-box integration tests.
Layer 3: Integration Evals — End-to-End Goals in Controlled Scenarios
With unit and chain evals covering individual components and fixed pipelines, integration evals zoom out to the full agent operating against realistic but controlled scenarios. The question being asked shifts from "did each step behave correctly?" to "did the agent accomplish the goal, and did it do so safely?"
Integration evals are the agentic equivalent of end-to-end tests, but with two critical modifications. First, because the agent's path to the goal may vary across runs, assertions must be goal-oriented rather than path-oriented. You do not assert that the agent called tools in a specific order; you assert that the final state of the world matches the intended outcome. Second, because agentic systems have side effects—sending emails, modifying databases, calling external APIs—integration evals must explicitly validate side-effect safety. Did the agent attempt to take any irreversible actions it was not authorized to take?
💡 Mental Model: Think of integration evals like hiring an actor to play a customer in a controlled store environment. You are not scripting every word of the interaction; you are checking that by the end, the customer left with the right product, no unauthorized charges appeared, and the store shelves were left in good condition.
Controlling the environment is the core challenge at this layer. Effective integration evals rely on sandboxed tool environments: mock email servers that record sends without delivering them, in-memory database replicas seeded with known data, and stubbed external APIs that return realistic but fixed responses. Without this control, integration evals become flaky—failures could reflect real-world API changes rather than agent regressions.
Integration eval scenarios should be drawn from a golden dataset: a curated collection of realistic tasks with known success criteria. These scenarios should cover not just happy paths but adversarial cases—ambiguous instructions, missing context, deliberately malformed tool responses, and tasks that should be refused. A healthy integration eval suite typically contains 20–100 scenarios, each representing a meaningfully distinct test case.
💡 Pro Tip: Tag your integration eval scenarios with difficulty and category labels (e.g., difficulty=hard, category=multi-step-planning). When a new agent version regresses, these tags tell you immediately which kind of reasoning broke, which dramatically accelerates debugging.
Layer 4: Production Evals — Catching What You Cannot Anticipate
Even a thorough integration eval suite cannot cover every situation that real users will produce. Production evals operate on live traffic, using techniques from statistics and deployment engineering to catch distribution shift and edge cases that no one thought to include in a golden dataset.
Three core mechanisms power production evals. Sampling means continuously selecting a fraction of live agent runs—say, 2–5%—and routing them through an evaluation pipeline that scores quality using automated metrics or LLM-as-judge patterns. Sampling lets you monitor aggregate quality trends without evaluating every single interaction. Shadow mode means running a new agent version in parallel with the current production version, feeding both the same inputs, and comparing outputs before the new version takes over any real traffic. Shadow mode is ideal for catching regressions before deployment without exposing users to risk. Canary deployments gradually shift a small percentage of real traffic to a new agent version, watching for quality degradation signals before completing the rollout.
Production Eval Techniques
SAMPLING SHADOW MODE CANARY
────────── ─────────── ──────
Live traffic Live traffic Live traffic
│ │ │
├─ 95% → Normal ┌───┴───┐ ┌────┴────┐
│ response │ │ │ │
└── 5% → Eval Agent v1 Agent v2 90% v1 10% v2
pipeline (serves) (shadows) (serves) (serves)
│ │ │ │ │
Score & Compare outputs Monitor
monitor before shipping metrics
The most insidious failure that production evals catch is distribution shift: the gradual drift between the inputs your agent was built and tested against and the inputs real users actually send. A customer service agent evaluated against a dataset of 2023 support tickets may quietly degrade when new product features generate entirely new categories of user questions. No amount of offline eval coverage prevents this—only continuous monitoring of live traffic reveals it.
⚠️ Common Mistake: Teams often treat production evals as an afterthought, setting them up only after something goes wrong. By then, the agent may have served degraded responses to thousands of users. Production evals should be part of the launch criteria for any agentic system, not a retrospective fix.
🤔 Did you know? Some teams run a "chaos eval" layer alongside production evals, deliberately injecting malformed inputs, tool failures, and adversarial prompts into a small percentage of sampled traffic to continuously stress-test the agent's error-handling and refusal behaviors in production conditions.
Why All Four Layers Are Non-Negotiable
A natural temptation is to pick one or two layers and invest heavily in them while skipping the others. This always leads to blind spots, because each layer catches a distinct class of failure that is invisible to every other layer.
❌ Wrong thinking: "Our integration evals are comprehensive, so we don't need unit evals." ✅ Correct thinking: "Integration evals tell us the agent failed; unit evals tell us why it failed."
❌ Wrong thinking: "We have excellent test coverage offline, so production evals are redundant." ✅ Correct thinking: "Offline evals test the distribution you imagined; production evals test the distribution that actually exists."
The relationship between layers is both hierarchical and complementary. A failure caught at the unit layer costs almost nothing to fix. The same failure caught at the production layer—after it has been compounding across thousands of interactions—can cost days of investigation and significant user trust. Conversely, a distribution shift in user intent can only be detected at the production layer; no offline eval suite can anticipate inputs that haven't been seen yet.
🧠 Mnemonic: Think of the four layers as UCIP — "Unit, Chain, Integration, Production" — or remember the phrase "Under Careful Investigation, Problems surface" to recall that each layer surfaces a different class of problem.
📋 Quick Reference Card:
| 🎯 Layer | 🔧 What It Tests | ⚡ Speed | 💰 Cost | 🔍 Failure Class Caught | |
|---|---|---|---|---|---|
| 1 | 🧩 Unit Evals | Single prompt, tool, or memory lookup | Very Fast | Very Low | Component-level bugs, schema violations |
| 2 | 🔗 Chain Evals | Fixed multi-step sub-graph | Fast | Low-Medium | Composition failures, bad tool selection |
| 3 | 🌐 Integration Evals | Full agent on controlled scenarios | Slow | Medium-High | Goal failures, unsafe side effects |
| 4 | 🚀 Production Evals | Live traffic via sampling/shadow/canary | Continuous | Variable | Distribution shift, unseen edge cases |
The reimagined pyramid is not simply the old pyramid with new labels. It represents a fundamentally different philosophy: that correctness in agentic systems is probabilistic and contextual, and that no single evaluation layer can capture the full picture. Each layer asks a question that the others cannot answer. Together, they form a safety net robust enough to catch the full spectrum of failures that agentic systems produce—from the trivial to the catastrophic. The sections that follow will show you exactly how to build the assertions, metrics, and harnesses that make each layer work in practice.
Core Evaluation Primitives: Assertions, Metrics, and Rubrics
Before you can evaluate an agent at any layer of the testing pyramid, you need a vocabulary for expressing what correct even means. In traditional software, correctness is binary: a function either returns 42 or it doesn't. In agentic systems, correctness is a spectrum—a response can be partially right, right in spirit but wrong in format, or right today and subtly wrong tomorrow as the underlying model changes. This section equips you with the three fundamental building blocks for expressing correctness across that spectrum: deterministic assertions, statistical metrics, and rubric-based scoring. Mastering all three—and knowing when to reach for each—is the foundation of a mature agent eval practice.
The Three Primitives at a Glance
Think of these three primitives as tools in a toolbelt, not competing philosophies. Each one occupies a different position in the trade-off between precision, flexibility, and human interpretability.
HIGH PRECISION
|
| Deterministic Assertions
| (JSON schema, regex, call count)
|
| Statistical Metrics
| (BLEU, ROUGE, cosine similarity)
|
| Rubric-Based Scoring
| (relevance, groundedness, tone)
|
LOW PRECISION / HIGH FLEXIBILITY
<----------------------------------->
Low Human High Human
Interpretability Interpretability
Deterministic assertions sit at the top: maximally precise, minimally interpretable. Rubrics sit at the bottom: maximally interpretable, but inherently softer. Statistical metrics occupy the middle ground. A well-designed eval harness uses all three in combination, letting each one do the work it's best suited for.
Deterministic Assertions
Deterministic assertions are the closest analog to classical unit tests. They evaluate a single, measurable property of an agent's output and produce a hard pass or fail with no ambiguity. They work best whenever the agent's output—or some sub-component of it—is structured: a JSON blob, a function call with typed arguments, a date string, a URL.
The three most common forms are:
🔧 JSON schema validation — Confirm that a tool-call payload or agent response conforms to a declared schema. If your agent is supposed to emit a search_query tool call with a query: string and an optional max_results: integer, you can codify that contract in a JSON Schema and validate every observed call against it automatically.
🔧 Tool-call argument checking — Go beyond schema conformance and assert on the values of arguments. If the user asked for results in French, assert that language: "fr" appears in the tool call. This is a behavioral assertion, not just a structural one.
🔧 Regex matching — For constrained free-text fields—phone numbers, ISO dates, email addresses, citation markers—a regex is often the most expressive and maintainable assertion.
🎯 Key Principle: Use deterministic assertions wherever the correct answer is unambiguous. If a human reviewer would never disagree about whether the output passes, a deterministic assertion is the right tool.
Code Example: Intercepting Tool Calls with pytest
The following example shows how to write a pytest fixture that intercepts an agent's tool calls, then asserts on call count, argument values, and return-value handling. The pattern uses a simple CallRecorder wrapper that captures calls without changing the tool's behavior.
## test_tool_calls.py
import pytest
from unittest.mock import MagicMock, patch
from my_agent import ResearchAgent # your agent class
class CallRecorder:
"""Wraps a callable and records every invocation for later assertion."""
def __init__(self, fn):
self._fn = fn
self.calls = [] # list of {"args": ..., "kwargs": ..., "result": ...}
def __call__(self, *args, **kwargs):
result = self._fn(*args, **kwargs)
self.calls.append({"args": args, "kwargs": kwargs, "result": result})
return result
@property
def call_count(self):
return len(self.calls)
@pytest.fixture
def agent_with_recorded_tools():
"""Spin up the ResearchAgent with its web_search tool wrapped in a recorder."""
agent = ResearchAgent()
recorder = CallRecorder(agent.tools["web_search"])
agent.tools["web_search"] = recorder # swap in the recorder
return agent, recorder
def test_single_search_call_with_correct_arguments(agent_with_recorded_tools):
"""
Given a user query in French, the agent should call web_search exactly once
with the language parameter set to 'fr'.
"""
agent, recorder = agent_with_recorded_tools
# Run the agent on a constrained prompt
agent.run("Recherche les dernières actualités sur l'IA en France.")
# --- Assertion 1: call count ---
assert recorder.call_count == 1, (
f"Expected exactly 1 web_search call, got {recorder.call_count}"
)
# --- Assertion 2: argument values ---
call = recorder.calls[0]
assert call["kwargs"].get("language") == "fr", (
"Agent should pass language='fr' when user query is in French"
)
assert "IA" in call["kwargs"].get("query", "") or "intelligence artificielle" in \
call["kwargs"].get("query", "").lower(), (
"Search query should reflect the topic of the user message"
)
# --- Assertion 3: the agent did not crash on the tool's return value ---
# If the agent surfaces the result to the user, the final response should
# be non-empty and not contain a raw exception traceback.
response = agent.last_response
assert response, "Agent produced an empty response after tool call"
assert "Traceback" not in response, "Agent leaked an exception in its response"
Notice that this test checks three distinct things with one fixture: how many times the tool was called (call-count assertion), what arguments were passed (behavioral assertion), and how the return value was handled (error-handling assertion). Each of these can fail independently, giving you fine-grained signal when a regression occurs.
⚠️ Common Mistake: Asserting only on the final text response when a tool-call argument is wrong. If the agent passes language: "en" instead of "fr" but the search engine happens to return French results anyway, a response-level check will pass while the underlying bug goes undetected. Always assert as close to the source of the behavior as possible.
Statistical Metrics
Deterministic assertions break down the moment the correct output has legitimate variation. If you ask an agent to summarize a research paper, there is no single correct summary—there are many good ones and many bad ones. A regex cannot tell them apart. This is where statistical metrics enter.
Statistical metrics produce a score rather than a binary pass/fail. They compare the agent's output to one or more reference outputs and quantify how close the match is. The key ones to know:
📚 BLEU (Bilingual Evaluation Understudy) — Originally designed for machine translation, BLEU measures n-gram overlap between a candidate string and one or more reference strings. A BLEU score of 1.0 means a perfect match; 0.0 means no shared n-grams. BLEU is fast and deterministic, but it is notoriously poor at rewarding paraphrase—two semantically identical sentences with different wording can score near zero.
📚 ROUGE (Recall-Oriented Understudy for Gisting Evaluation) — A family of metrics (ROUGE-1, ROUGE-2, ROUGE-L) that emphasize recall of reference content. ROUGE-L measures the longest common subsequence, making it more tolerant of word reordering than BLEU. It is widely used for summarization evaluation.
📚 Semantic similarity — Embedding-based metrics, such as cosine similarity between sentence embeddings (e.g., from sentence-transformers), capture meaning rather than surface form. Two sentences that say the same thing in different words will score high. This is the most flexible and the most expensive of the three.
🎯 Key Principle: Statistical metrics are soft signals, not verdicts. Use them to rank outputs or flag regressions across a dataset, not to declare a single response correct or incorrect.
💡 Mental Model: Think of BLEU/ROUGE as spell-checkers for content coverage—useful but limited. Think of semantic similarity as a meaning-checker. Use both, and treat their scores as evidence, not truth.
## eval_metrics.py
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer, util
## Initialize once and reuse across eval runs
_rouge = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
_embedder = SentenceTransformer("all-MiniLM-L6-v2")
def compute_rouge(candidate: str, reference: str) -> dict:
"""Return a dict of ROUGE scores for a candidate against a single reference."""
scores = _rouge.score(reference, candidate)
return {
"rouge1_f": round(scores["rouge1"].fmeasure, 4),
"rouge2_f": round(scores["rouge2"].fmeasure, 4),
"rougeL_f": round(scores["rougeL"].fmeasure, 4),
}
def compute_semantic_similarity(candidate: str, reference: str) -> float:
"""Return cosine similarity in [0, 1] between candidate and reference embeddings."""
embs = _embedder.encode([candidate, reference], convert_to_tensor=True)
score = util.cos_sim(embs[0], embs[1]).item()
return round(score, 4)
## Example usage in an eval loop
if __name__ == "__main__":
candidate = "The Eiffel Tower is located in Paris, France."
reference = "Paris, France is home to the iconic Eiffel Tower."
rouge_scores = compute_rouge(candidate, reference)
sim_score = compute_semantic_similarity(candidate, reference)
print("ROUGE scores:", rouge_scores)
# ROUGE scores: {'rouge1_f': 0.5714, 'rouge2_f': 0.2667, 'rougeL_f': 0.4286}
# Note: ROUGE is low despite near-identical meaning!
print("Semantic similarity:", sim_score)
# Semantic similarity: 0.9312
# Semantic similarity correctly captures the meaning overlap.
The comment in the output is the critical lesson: ROUGE scores this pair poorly because the n-gram overlap is low, yet any human would agree the two sentences convey the same fact. Semantic similarity gets it right. This is why you should never rely on a single metric—use a panel.
🤔 Did you know? BERTScore, a metric based on contextual embeddings from BERT, has been shown to correlate better with human judgments than BLEU or ROUGE on most NLG tasks. It is increasingly the default choice for teams that want a single statistical metric that generalizes well.
Rubric-Based Scoring
Some qualities of an agent's output simply cannot be captured by string-matching or embedding similarity. Is the response grounded in the retrieved documents, or is the agent hallucinating? Is the tone appropriate for a customer-facing context? Is the reasoning transparent and logically coherent? These are the questions that rubric-based scoring is designed to answer.
A rubric is a set of human-interpretable criteria, each with a defined scale, that judges apply to an output. The judge can be a human annotator, or it can be a language model acting as an automated evaluator (often called an LLM-as-Judge pattern, which gets its own deep-dive in a later lesson).
Common rubric dimensions for agent evaluation:
| Dimension | What it measures | Example scale |
|---|---|---|
| 🎯 Relevance | Does the response address the user's actual intent? | 1–5 |
| 📚 Groundedness | Are claims supported by retrieved context? | 1–5 or binary |
| 🔒 Safety | Does the response avoid harmful or policy-violating content? | Pass/Fail |
| 🧠 Coherence | Is the reasoning logical and internally consistent? | 1–3 |
| 🔧 Completeness | Does the response cover all required sub-tasks? | 0–1 per sub-task |
Rubrics shine when you need explanation alongside a score. A rubric score of 2/5 on groundedness is only useful if you know why—which claim was unsupported, which document was ignored. This explanatory trace is what makes rubrics actionable during debugging.
⚠️ Common Mistake: Writing rubric criteria that are too vague to apply consistently. "Is the response good?" is not a rubric dimension. "Does every factual claim in the response appear verbatim or as a clear paraphrase in at least one retrieved document?" is a rubric dimension. Precision in criteria definition is what separates useful rubrics from expensive noise.
Combining All Three Primitives into a Single Eval Harness
The real power emerges when you stop treating these three primitives as alternatives and start treating them as layers in a single structured result object. Every test case—whether at the unit, chain, integration, or production level of the pyramid—should emit a consistent result that captures all available signal.
Here is a design for a EvalResult dataclass and a harness that populates it:
## eval_harness.py
from dataclasses import dataclass, field
from typing import Optional
import uuid
@dataclass
class EvalResult:
"""
A structured evaluation result that combines all three primitive types.
Every test case, regardless of layer, should produce one of these.
"""
trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
test_case_id: str = "" # e.g., "tc_042"
layer: str = "" # "unit" | "chain" | "integration" | "production"
# Deterministic assertion results
assertions: dict = field(default_factory=dict) # {"schema_valid": True, ...}
assertion_pass: Optional[bool] = None # True if ALL assertions pass
# Statistical metric scores
metrics: dict = field(default_factory=dict) # {"rouge1_f": 0.72, "sem_sim": 0.91}
# Rubric scores
rubric_scores: dict = field(default_factory=dict) # {"relevance": 4, "groundedness": 3}
rubric_notes: dict = field(default_factory=dict) # {"groundedness": "Claim X unsupported"}
# Rolled-up pass/fail and composite score
passed: Optional[bool] = None
composite_score: Optional[float] = None
def finalize(self, metric_thresholds: dict = None, rubric_thresholds: dict = None):
"""
Compute rolled-up pass/fail and a [0, 1] composite score.
A result passes if:
1. All deterministic assertions pass (hard gate), AND
2. All metric scores meet their thresholds (configurable), AND
3. All rubric scores meet their thresholds (configurable)
"""
metric_thresholds = metric_thresholds or {}
rubric_thresholds = rubric_thresholds or {}
# Hard gate: assertions must all pass
if self.assertion_pass is False:
self.passed = False
self.composite_score = 0.0
return
metric_pass = all(
self.metrics.get(k, 0) >= v
for k, v in metric_thresholds.items()
)
rubric_pass = all(
self.rubric_scores.get(k, 0) >= v
for k, v in rubric_thresholds.items()
)
self.passed = metric_pass and rubric_pass
# Composite score: normalize each rubric dimension to [0, 1]
# and average with metric scores
all_scores = list(self.metrics.values())
for key, score in self.rubric_scores.items():
# Assume rubrics are on a 1–5 scale; normalize
all_scores.append((score - 1) / 4)
self.composite_score = round(sum(all_scores) / len(all_scores), 4) if all_scores else None
## --- Example: running a single test case through the full harness ---
def run_eval_case(agent, test_case: dict) -> EvalResult:
result = EvalResult(
test_case_id=test_case["id"],
layer=test_case["layer"],
)
# 1. Run the agent
response = agent.run(test_case["input"])
# 2. Deterministic assertions
schema_valid = validate_json_schema(response, test_case.get("output_schema"))
call_count_ok = agent.tool_call_count == test_case.get("expected_call_count", 1)
result.assertions = {"schema_valid": schema_valid, "call_count": call_count_ok}
result.assertion_pass = all(result.assertions.values())
# 3. Statistical metrics
if "reference_output" in test_case:
result.metrics["rouge1_f"] = compute_rouge(response, test_case["reference_output"])["rouge1_f"]
result.metrics["sem_sim"] = compute_semantic_similarity(response, test_case["reference_output"])
# 4. Rubric scoring (automated judge — simplified stub here)
if "rubric" in test_case:
result.rubric_scores, result.rubric_notes = run_rubric_judge(
response=response,
context=test_case.get("retrieved_context", ""),
rubric=test_case["rubric"],
)
# 5. Roll up
result.finalize(
metric_thresholds={"sem_sim": 0.80},
rubric_thresholds={"groundedness": 3, "relevance": 3},
)
return result
The EvalResult object is the heart of this design. It acts as a structured contract between the eval harness and everything downstream: dashboards, regression trackers, CI gates, and human reviewers. Every result carries a trace_id so you can link it back to the full agent trace in your observability system. Every result carries passed as a hard boolean so CI can gate on it. And every result carries composite_score so you can track trends across model versions.
💡 Pro Tip: Store EvalResult objects as JSON lines (JSONL) in a versioned file alongside your test dataset. This gives you a cheap, diff-friendly eval history that requires no database infrastructure to get started. You can always migrate to a purpose-built eval platform later once your harness is stable.
Choosing the Right Primitive for the Right Layer
Now that you understand all three primitives, the question becomes: which one belongs at which layer of the testing pyramid? The answer is not exclusive—most layers benefit from a mix—but the primary primitive shifts as you move up:
PRODUCTION EVALS
Primary: Rubric (human or LLM judge on sampled live traffic)
Secondary: Statistical metrics for trend tracking
─────────────────────────────────────────────────────
INTEGRATION EVALS
Primary: Statistical metrics + soft rubrics
Secondary: Deterministic assertions on structured sub-outputs
─────────────────────────────────────────────────────
CHAIN EVALS
Primary: Deterministic assertions on tool calls + metrics on
intermediate outputs
─────────────────────────────────────────────────────
UNIT EVALS
Primary: Deterministic assertions
(individual tools, parsers, schema validators)
This layering principle keeps your fast, cheap tests (deterministic assertions) at the bottom where they run in every CI build, and your expensive, slow tests (LLM-as-Judge rubrics) at the top where they run on a schedule or against sampled production traffic.
🧠 Mnemonic: "Assert at the bottom, score at the top, measure in between." Assertions give you hard edges, scores give you direction, and metrics tell you how far you've traveled.
Putting It All Together
You now have the full toolkit. Deterministic assertions give you hard, fast, unambiguous correctness checks for structured outputs. Statistical metrics give you soft, continuous signals when valid outputs vary. Rubric-based scoring gives you human-interpretable judgments for qualities that resist quantification. And the EvalResult harness gives you a single structured envelope that carries all three, linked to an observability trace, with a clear pass/fail signal for CI and a numeric score for trend tracking.
The next section will build directly on this foundation, walking through the practical patterns for assembling these primitives into a full, repeatable, version-controlled eval suite—covering dataset management, fixture design, and how to run evals both locally and in CI without coupling them to a live LLM endpoint.
📋 Quick Reference Card:
| 🎯 Primitive | 📚 Best For | 🔧 Output | ⚠️ Limitation |
|---|---|---|---|
| 🔒 Deterministic Assertion | Structured outputs, tool calls, schema | Pass/Fail | Cannot handle valid variation |
| 📊 Statistical Metric | Summarization, translation, QA | Float [0,1] | Poor on paraphrase (BLEU/ROUGE) |
| 🧠 Rubric Score | Groundedness, tone, coherence | Int or Float | Requires calibrated judge |
Practical Evaluation Patterns for Agentic Workflows
Knowing why agent evaluation is hard, and what layers of the pyramid exist, only gets you so far. At some point you have to sit down and write the code. This section is about that moment — the decisions, structures, and patterns that separate evaluation pipelines that actually catch regressions from those that quietly rot in a CI job nobody checks. We will move from theory to implementation: building golden datasets, taming non-determinism, wiring up a harness, enforcing regression gates, and threading the provenance needle so you always know which prompt and which model produced a given score.
Building Your Golden Dataset
The foundation of any repeatable eval suite is a golden dataset — a curated, versioned collection of input/output pairs (and, for agentic systems, multi-turn conversation traces) that represent the range of behavior you care about. Think of the golden dataset as a contract: it encodes your beliefs about what the agent should do, and every run of the eval suite tests whether the agent still honors that contract.
What Goes Into a Golden Example
A golden example is more than just a prompt and an expected string. For agentic workflows it typically contains:
- 🎯 The initial user message or task description
- 📚 A full conversation trace if the scenario is multi-turn
- 🔧 The expected tool calls (which tools, in what order, with what arguments)
- 🧠 The expected final output or a structured rubric that defines acceptable final outputs
- 🔒 Metadata tags (difficulty, domain, failure mode being probed, creation date, author)
Capturing these examples is a craft. The best sources are:
- Production logs — real users revealing edge cases you never anticipated. When a user interaction surprises you (the agent took an unexpected tool path, produced a hallucinated fact, or refused a legitimate request), that trace is gold. Anonymize it, add ground-truth labels, and promote it to your dataset.
- Exploratory red-teaming sessions — deliberately trying to break the agent and recording what breaks.
- Stakeholder acceptance criteria — product requirements translated directly into testable examples.
💡 Pro Tip: Start small and deliberate. Twenty carefully chosen examples that cover distinct failure modes will teach you more than two hundred examples sampled uniformly at random. Quantity without diversity creates an illusion of coverage.
Dataset Format and Version Control
JSONL (newline-delimited JSON) has become the de facto standard for eval datasets because each line is a self-contained record, diffs are human-readable, and the format is trivially streamable. Store your dataset files in the same Git repository as your agent code — or in a linked data versioning system like DVC — so that dataset changes and code changes travel together through your history.
A minimal golden record for an agentic workflow looks like this:
{"id": "task-001", "created": "2024-11-01", "tags": ["search", "multi-step"], "prompt": "What is the current weather in Oslo, and should I bring an umbrella?", "expected_tool_calls": [{"name": "get_weather", "args": {"city": "Oslo"}}], "expected_output_contains": ["temperature", "precipitation"], "rubric": "The agent must call get_weather exactly once and mention whether an umbrella is advisable.", "baseline_score": 1.0}
{"id": "task-002", "created": "2024-11-03", "tags": ["multi-turn", "clarification"], "turns": [{"role": "user", "content": "Book me a flight"}, {"role": "assistant", "content": "Sure! Where would you like to fly to?"}, {"role": "user", "content": "London next Friday"}], "expected_tool_calls": [{"name": "search_flights", "args": {"destination": "London"}}], "rubric": "Agent must gather origin before calling search_flights or explicitly ask for it.", "baseline_score": 0.8}
Notice the baseline_score field. This is the score this example achieved when you first added it, and it anchors your regression gate (more on that shortly).
⚠️ Common Mistake: Mistake 1 — Treating the golden dataset as append-only. Real datasets need curation. When the product changes intentionally, update the golden examples to reflect the new expected behavior. Failing to do so means your eval suite measures the old system, not the one you are shipping.
Mocking and Seeding Non-Determinism
Agents are non-deterministic by design: LLMs sample from probability distributions, external APIs return different data on every call, and time-dependent tools make yesterday's correct answer wrong today. If you run your eval suite twice and get different results for reasons unrelated to your code changes, your eval is noise, not signal.
The solution is to stub external dependencies and control the sources of randomness so that a given input always produces the same execution path during offline evaluation.
🎯 Key Principle: For evaluation purposes, reproducibility beats realism. A deterministic eval that catches 80% of regressions reliably is more valuable than a realistic eval that catches 95% of regressions but introduces 30% false-positive noise from environment variance.
Controlling LLM Temperature and Seeds
Most LLM APIs accept a temperature parameter and, increasingly, a seed parameter. For eval runs:
- Set
temperature=0to make the model as close to greedy-deterministic as possible. - Set a fixed
seedif the API supports it (OpenAI, for example, added reproducible outputs viaseed).
Wrap these settings in an eval configuration object that is injected at harness startup, separate from the production configuration:
## eval_config.py
from dataclasses import dataclass
@dataclass
class EvalConfig:
model: str = "gpt-4o"
temperature: float = 0.0
seed: int = 42
max_tokens: int = 1024
# Inject this config into your agent factory during eval runs
Stubbing External Tools
An agent that calls a live weather API, a database, or a search engine during an eval introduces three problems: (1) cost, (2) latency, (3) variance. The clean solution is to wrap every external tool behind an interface and swap in a mock tool layer during eval runs.
Here is a pattern using Python's unittest.mock combined with a simple registry:
## tools/registry.py
class ToolRegistry:
"""Holds callable tools by name. Swap implementations for testing."""
def __init__(self):
self._tools: dict = {}
def register(self, name: str, fn):
self._tools[name] = fn
def call(self, name: str, **kwargs):
if name not in self._tools:
raise ValueError(f"Unknown tool: {name}")
return self._tools[name](**kwargs)
## In your eval harness setup:
def build_mock_registry(fixture_data: dict) -> ToolRegistry:
"""
Build a ToolRegistry where every tool returns pre-recorded fixture data.
`fixture_data` maps tool_name -> list of responses (consumed in order).
"""
registry = ToolRegistry()
for tool_name, responses in fixture_data.items():
call_queue = iter(responses) # Each call pops the next canned response
def make_stub(queue):
def stub(**kwargs):
try:
return next(queue)
except StopIteration:
raise RuntimeError(
f"Tool '{tool_name}' was called more times than expected."
)
return stub
registry.register(tool_name, make_stub(call_queue))
return registry
The fixture data for a given golden example lives alongside it in your JSONL file or in a sidecar fixtures directory, so the stub responses are version-controlled together with the expected outputs.
💡 Real-World Example: A customer-support agent that calls a CRM API can be fully evaluated offline by replaying the exact CRM responses that were recorded during the original production interaction. The agent sees the same data it would have seen live, the eval is fast, and the result is deterministic.
The Minimal Eval Harness
With a golden dataset and a mock tool layer in place, the harness is the glue that runs every example through the agent and writes structured results to a report. A well-designed harness has five responsibilities:
┌──────────────────────────────────────────────────────┐
│ EVAL HARNESS │
│ │
│ 1. LOAD ──▶ Read JSONL dataset │
│ 2. SETUP ──▶ Inject EvalConfig + Mock Registry │
│ 3. RUN ──▶ Execute agent per example │
│ 4. SCORE ──▶ Apply assertions / metrics / rubrics │
│ 5. REPORT ──▶ Write structured JSON results │
└──────────────────────────────────────────────────────┘
Here is a minimal but complete implementation:
## harness.py
import json
import time
import traceback
from pathlib import Path
from typing import Any
from eval_config import EvalConfig
from tools.registry import build_mock_registry
from agent import run_agent # Your agent's main entry point
from scorers import score_example # Returns a dict with 'score' and 'details'
def run_eval_suite(
dataset_path: str,
fixtures_dir: str,
output_path: str,
config: EvalConfig,
prompt_version: str,
model_version: str,
) -> dict:
"""
Run every example in the JSONL dataset through the agent with mocked tools.
Writes a structured JSON report to output_path.
Returns a summary dict.
"""
results = []
total_score = 0.0
with open(dataset_path) as f:
examples = [json.loads(line) for line in f if line.strip()]
for example in examples:
example_id = example["id"]
# Load pre-recorded fixture responses for this example's tools
fixture_file = Path(fixtures_dir) / f"{example_id}.json"
fixture_data = json.loads(fixture_file.read_text()) if fixture_file.exists() else {}
mock_registry = build_mock_registry(fixture_data)
start = time.monotonic()
try:
# Run agent with controlled config and mocked tools
agent_output = run_agent(
prompt=example.get("prompt") or example.get("turns"),
tool_registry=mock_registry,
config=config,
)
elapsed = time.monotonic() - start
error = None
except Exception as e:
agent_output = None
elapsed = time.monotonic() - start
error = traceback.format_exc()
# Score the output against the golden example
score_result = score_example(example, agent_output) if not error else {"score": 0.0, "details": "exception"}
total_score += score_result["score"]
results.append({
"id": example_id,
"tags": example.get("tags", []),
"score": score_result["score"],
"baseline_score": example.get("baseline_score", 1.0),
"details": score_result["details"],
"latency_s": round(elapsed, 3),
"error": error,
# Provenance: link results to exact versions
"prompt_version": prompt_version,
"model_version": model_version,
})
avg_score = total_score / len(examples) if examples else 0.0
report = {
"summary": {
"avg_score": round(avg_score, 4),
"total_examples": len(examples),
"prompt_version": prompt_version,
"model_version": model_version,
},
"results": results,
}
Path(output_path).write_text(json.dumps(report, indent=2))
print(f"Eval complete. Avg score: {avg_score:.4f}. Report written to {output_path}")
return report
This harness is intentionally simple. Every moving part is visible: there is no magic, no hidden state, and the output is a plain JSON file that a CI system, a dashboard, or a human can read directly. As your needs grow you can add parallelism, LLM-as-judge scoring, or streaming — but the five-step structure remains stable.
💡 Mental Model: Think of the harness as a test runner for your agent in the same way pytest is a test runner for your functions. The JSONL file is the test suite; each record is a test case; the report is the JUnit XML equivalent.
Regression Testing Strategy
Running the harness once tells you where you are. Running it on every commit and failing the build when scores drop tells you when something broke — and that is where the real value lies.
Pinning Baseline Scores and Defining Thresholds
Each golden example carries a baseline_score. The regression gate is a CI step that compares the current score on every example (and the aggregate) against those baselines and fails if the delta exceeds a configured tolerance.
The tolerance exists because not every small score drop is a real regression. LLMs at temperature 0 are highly reproducible but not perfectly so across model versions. A tolerance of 5–10% per example and 2–3% aggregate is a reasonable starting point.
The regression gate logic is straightforward:
## ci_gate.py
import json
import sys
PER_EXAMPLE_TOLERANCE = 0.10 # Allow up to 10% drop per example
AGGREGATE_TOLERANCE = 0.03 # Allow up to 3% aggregate drop
BASELINE_AGGREGATE = 0.92 # Your pinned overall baseline
def check_regression(report_path: str):
report = json.loads(open(report_path).read())
failures = []
for result in report["results"]:
drop = result["baseline_score"] - result["score"]
if drop > PER_EXAMPLE_TOLERANCE:
failures.append(
f"[REGRESSION] {result['id']}: "
f"score={result['score']:.2f}, baseline={result['baseline_score']:.2f}, "
f"drop={drop:.2f} > tolerance={PER_EXAMPLE_TOLERANCE}"
)
avg = report["summary"]["avg_score"]
agg_drop = BASELINE_AGGREGATE - avg
if agg_drop > AGGREGATE_TOLERANCE:
failures.append(
f"[REGRESSION] Aggregate: avg={avg:.4f}, baseline={BASELINE_AGGREGATE:.4f}, "
f"drop={agg_drop:.4f} > tolerance={AGGREGATE_TOLERANCE}"
)
if failures:
print("\n".join(failures))
sys.exit(1) # Non-zero exit fails the CI step
else:
print(f"Regression gate passed. Avg score: {avg:.4f}")
sys.exit(0)
if __name__ == "__main__":
check_regression("eval_report.json")
🎯 Key Principle: The regression gate should be a hard blocker in CI, not an optional warning. A warning that nobody acts on is not a test; it is noise with extra steps.
Structuring Regression Gates by Pyramid Layer
Just as the eval pyramid has multiple layers, your CI pipeline should have corresponding gates:
CI PIPELINE
│
├── Unit Evals gate ──▶ Fail fast, runs in < 30s
│ (tool call correctness, output format)
│
├── Chain Evals gate ──▶ Runs in < 5 min
│ (multi-step reasoning traces)
│
├── Integration Evals gate──▶ Runs in < 20 min
│ (end-to-end scenarios, LLM-as-judge)
│
└── Production Evals ──▶ Async, alerts on drift
(shadow traffic, live metrics)
Faster layers run first and gate the slower ones. If the unit evals fail, there is no point spending money on integration evals.
⚠️ Common Mistake: Mistake 2 — Setting regression thresholds too tight. If every 1% fluctuation fails your build, developers will disable the gate or force-push through it. Calibrate tolerances against the natural variance you observe on your model and seed combination, then tighten gradually as the system matures.
Linking Results to Prompt and Model Versions
Imagine your aggregate score drops from 0.91 to 0.84 on a Tuesday morning. You want to answer one question immediately: what changed? There are three candidates — the prompt, the model, and the agent code — and you need provenance data to distinguish them.
Prompt versioning means every prompt template is stored in version control with a unique identifier (a Git hash, a semver tag, or a slug like v2.3.1-system-prompt). The harness reads the prompt version at startup and writes it into every row of the eval report.
Model versioning means you never refer to a model as gpt-4o in your eval config; you pin to gpt-4o-2024-08-06 (the specific snapshot). Model providers periodically update models under the same name, and an unversioned model reference makes it impossible to distinguish a model update from a prompt change.
With both recorded in the report, root-cause analysis becomes a structured query:
Score drop detected on 2024-11-12
│
├── prompt_version: v2.3.0 → v2.3.1 ◀── Changed
├── model_version: gpt-4o-2024-08-06 ◀── Unchanged
└── agent_code_sha: a3f1c92 ◀── Unchanged
Conclusion: Investigate prompt change in v2.3.1
This provenance chain transforms a mystery into a hypothesis you can test by rolling back the prompt version and re-running the harness.
🤔 Did you know? Some teams maintain a model version compatibility matrix in their eval report history — a table of (prompt_version, model_version) → (avg_score, date) — so they can evaluate a new model snapshot against all historical prompts before deciding to adopt it.
📋 Quick Reference Card: Provenance Fields Every Eval Report Should Contain
| 🔒 Field | 📚 What to Record | 🎯 Why It Matters |
|---|---|---|
🔧 prompt_version |
Git tag or hash of prompt template | Isolates prompt changes from model changes |
🧠 model_version |
Full model snapshot ID | Detects silent model updates |
📚 agent_code_sha |
Git SHA of agent codebase | Ties score to exact code state |
🔒 dataset_version |
Git tag or hash of JSONL file | Ensures you compare apples to apples |
🎯 eval_config_hash |
Hash of temperature, seed, etc. | Guards against config drift |
Putting It All Together
The patterns in this section compose into a workflow that treats agent evaluation with the same rigor as production software:
Author/Capture Version Control
Golden Examples ──────────────▶ JSONL Dataset + Fixtures
│ │
▼ ▼
Mock Tool Layer ◀────────── Eval Harness reads both
│
▼
Agent (temp=0, seed=42)
│
▼
Structured Report (JSON)
│ │
▼ ▼
CI Gate Dashboard / History
(hard (provenance queries,
blocker) trend analysis)
Golden datasets give you coverage. Mocking gives you reproducibility. The harness gives you automation. The regression gate gives you a safety net. Provenance gives you debuggability. No single piece works without the others.
💡 Remember: Eval infrastructure is not a one-time build — it is a living system. Treat it like production code: review changes to the dataset, test the harness itself, and allocate engineering time to maintain it as the agent evolves. Teams that skip this investment find that their eval suite becomes a liability rather than an asset, producing stale signals that erode trust until nobody runs it at all.
❌ Wrong thinking: "We can add proper evals later once the agent is more mature." ✅ Correct thinking: "Evals are how we define and verify maturity. They come first, or they come too late."
With this infrastructure in place, you are ready to explore the failure modes that even well-intentioned teams encounter — which is exactly where the next section takes you.
Common Pitfalls When Evaluating Agents
Building an eval pipeline for an agentic system feels straightforward until the first time a production agent silently degrades, a critical tool misuse slips through a green eval suite, or a team discovers their carefully tuned prompts are only good against the exact twenty examples they kept iterating against. The gap between "we have evals" and "our evals actually protect us" is wider than most teams expect—and the failure modes are remarkably consistent across organizations.
This section catalogs the five most common pitfalls teams encounter when standing up evaluation pipelines for agentic systems. For each one, you'll see how to recognize the pattern early, why it's particularly dangerous in agentic contexts, and how to restructure your approach before it costs you in production.
Pitfall 1: Evaluating Only Final Answers
⚠️ Common Mistake: Grading the output of an agent the same way you'd grade the output of a single LLM call—by looking only at the final response delivered to the user.
In a traditional LLM application, the path from prompt to response is a single hop. Grading the final answer is a reasonable proxy for grading everything that happened. Agentic systems break this assumption completely. An agent might call five tools, reason across three intermediate scratchpad steps, discard one retrieved document in favor of a hallucinated fact, and still produce a final answer that looks correct to a surface-level evaluator.
Consider a research agent tasked with summarizing recent earnings calls. The agent:
- Queries a document retrieval tool and gets back three relevant chunks
- Hallucinates a fourth "fact" about revenue growth that was never in any retrieved document
- Uses that hallucinated fact to support its summary
- Produces a final summary that sounds confident and well-structured
If your evaluator only sees the final summary and checks whether it "sounds accurate and comprehensive," it may pass. The hallucinated intermediate fact is invisible to final-answer-only evaluation.
## ❌ Final-answer-only evaluation — misses intermediate failures
def evaluate_agent_run_naive(final_answer: str, expected_answer: str) -> dict:
"""
Only checks whether the final answer is correct.
Completely blind to what happened in the middle.
"""
score = llm_judge(
prompt=f"Does this answer correctly address the question?\nAnswer: {final_answer}",
rubric="Score 1 if correct, 0 if not."
)
return {"final_answer_score": score}
## ✅ Trace-aware evaluation — inspects intermediate steps
def evaluate_agent_run_with_trace(trace: AgentTrace, expected_answer: str) -> dict:
"""
AgentTrace contains:
- trace.steps: list of {tool_name, tool_input, tool_output, reasoning}
- trace.final_answer: the user-facing response
"""
results = {}
# 1. Check tool calls for misuse or hallucinated inputs
for i, step in enumerate(trace.steps):
if step.tool_name == "document_retrieval":
# Assert that claims in reasoning are grounded in tool_output
grounding_score = check_grounding(
claim=step.reasoning,
source=step.tool_output
)
results[f"step_{i}_grounding"] = grounding_score
# Detect if the agent called an unexpected tool
results[f"step_{i}_tool_valid"] = step.tool_name in ALLOWED_TOOLS
# 2. Check final answer against expected
results["final_answer_correct"] = exact_or_semantic_match(
trace.final_answer, expected_answer
)
return results
The second function treats the agent's execution trace as a first-class evaluation artifact. Every reasoning step and every tool call becomes a checkable unit. This is the agentic equivalent of inspecting intermediate variables in a debugger rather than only reading the program's exit code.
💡 Pro Tip: Instrument your agent to emit structured traces (tool name, inputs, outputs, and the reasoning snippet that preceded each call) and store them alongside your eval results. Even when you can't automate every intermediate check, human reviewers can spot patterns in traces that final-answer metrics will never surface.
Pitfall 2: Using the Same LLM as Both Generator and Judge Without Calibration
⚠️ Common Mistake: Spinning up an LLM-as-Judge evaluator using the same model family—or even the same model—that your agent uses for generation, and treating those scores as ground truth.
Correlated bias is the silent killer here. When a GPT-4-class model generates an answer and a GPT-4-class model judges it, both models share the same training distribution, the same stylistic preferences, and often the same systematic blind spots. The judge tends to reward the kinds of answers the generator tends to produce. Errors that both models make confidently will be graded as correct.
Without calibration:
Generator (GPT-4o) ──► Response ──► Judge (GPT-4o)
│
Shared blind spots ──► Inflated scores
With calibration:
Generator (GPT-4o) ──► Response ──► Judge (GPT-4o)
│
Calibration set
(human-labeled)
│
Bias-adjusted scores ──► Trustworthy signal
The fix involves a calibration step: before trusting your LLM judge at scale, run it against a held-out set of examples with known human labels. Measure the judge's agreement rate with humans and its error direction (does it systematically over-score? under-score for a particular task type?). Use that calibration data to adjust thresholds or to weight the judge's scores.
from scipy.stats import pearsonr
def calibrate_llm_judge(
calibration_examples: list[dict], # [{"response": ..., "human_score": 0-5}, ...]
judge_fn: callable
) -> dict:
"""
Runs the judge on a calibration set and computes:
- Pearson correlation with human scores
- Mean bias (judge_score - human_score)
- Recommended score adjustment
"""
judge_scores = []
human_scores = []
for example in calibration_examples:
judge_score = judge_fn(example["response"]) # returns 0-5 float
judge_scores.append(judge_score)
human_scores.append(example["human_score"])
correlation, _ = pearsonr(judge_scores, human_scores)
mean_bias = sum(
j - h for j, h in zip(judge_scores, human_scores)
) / len(judge_scores)
return {
"pearson_r": round(correlation, 3),
"mean_bias": round(mean_bias, 3),
# Subtract bias when using judge scores in production
"recommended_adjustment": -round(mean_bias, 3),
"trustworthy": correlation > 0.75 # threshold you set based on risk tolerance
}
🎯 Key Principle: An LLM judge you haven't calibrated against human labels is a hypothesis, not a measurement. Treat calibration as a prerequisite, not an optimization.
💡 Real-World Example: A team building a customer support agent used GPT-4 Turbo for both the agent and the judge. Their eval showed 87% task completion. When they sampled 50 responses for human review, the actual rate was 61%. The judge was systematically rewarding fluent, confident-sounding responses even when they contained factual errors—exactly the kind of errors their generator model was most likely to make.
Pitfall 3: Overfitting Your Golden Dataset
⚠️ Common Mistake: Iterating on prompts, tool descriptions, or agent configuration against the same eval set you use to measure quality—and celebrating when scores go up.
This is eval set overfitting, and it is the agentic equivalent of training on your test set. Every time you look at a failing eval case and tweak the agent to pass it, you are encoding that specific example's solution into the agent. The eval score rises. Generalization does not.
Dangerous cycle:
Eval Set ──► Run Agent ──► Score: 72%
↑ │
│ Inspect failures │
│ ↓ │
└── Tune prompts to fix ◄───┘
specific failures
Result after 10 cycles: Score 94%, but only on THAT eval set.
Deploy to production ──► Silent regression.
Safe cycle:
Held-out Test Set (never touched during dev)
Dev Eval Set ──► Tune ──► Dev score rises
│
Periodic check ──► Test Set score
│
Honest generalization signal
The structural fix is to maintain separate development and test eval sets and to treat the test set as a locked artifact. You are allowed to look at your dev eval set as often as you want. You may only run the test set at meaningful checkpoints—major prompt changes, model upgrades, before a production release.
Beyond set separation, adopt continuous dataset refresh: periodically add new examples drawn from production traffic or from adversarially generated inputs that probe edge cases. A golden dataset that was comprehensive when you created it may be stale three months later if your users' queries have evolved.
🧠 Mnemonic: Think of your eval sets like a scientific study: dev set = training cohort, test set = validation cohort. Publishing results from the training cohort proves nothing. Only the validation cohort measures real generalization.
💡 Pro Tip: Version your eval datasets alongside your code in the same repository or artifact store. When a score drops, you should be able to diff both the agent code and the dataset to understand what changed.
Pitfall 4: Neglecting Latency and Cost as First-Class Eval Dimensions
⚠️ Common Mistake: Designing an eval pipeline that measures only quality—accuracy, coherence, task completion—while treating latency and cost as deployment concerns to be handled "later."
In agentic systems, "later" is often too late. An agent that completes a task correctly but requires 47 seconds and $0.23 per run may be technically correct and operationally unusable. More insidiously, quality optimizations almost always increase cost and latency. A team that only measures quality will keep optimizing in one direction until the production economics make the product unshippable.
Latency and cost need to be tracked at every layer of the testing pyramid, not just in production monitoring:
- Unit evals: Track token counts for individual LLM calls. A unit that consumes 4,000 tokens in a single reasoning step is a budget risk before it ever hits integration.
- Chain evals: Measure end-to-end latency for multi-step chains under realistic tool response times. A 200ms tool call repeated 15 times is 3 seconds of latency before the LLM even responds.
- Integration evals: Simulate concurrent users. Agents that perform fine in isolation often expose cost spikes under concurrency due to repeated context loading.
- Production evals: Track cost-per-task alongside quality metrics in the same dashboard.
import time
from dataclasses import dataclass, field
@dataclass
class AgentEvalResult:
task_id: str
quality_score: float # 0.0 - 1.0
latency_seconds: float # wall-clock time for the full run
total_tokens: int # input + output tokens across all LLM calls
tool_call_count: int # number of tool invocations
estimated_cost_usd: float # computed from token counts + model pricing
passed: bool = field(init=False)
# Thresholds — set these based on your SLA and unit economics
QUALITY_THRESHOLD = 0.80
LATENCY_THRESHOLD_S = 15.0
COST_THRESHOLD_USD = 0.05
def __post_init__(self):
# An eval "passes" only if ALL three dimensions are within budget
self.passed = (
self.quality_score >= self.QUALITY_THRESHOLD
and self.latency_seconds <= self.LATENCY_THRESHOLD_S
and self.estimated_cost_usd <= self.COST_THRESHOLD_USD
)
def run_timed_eval(agent, task: dict, pricing_per_1k_tokens: float = 0.01) -> AgentEvalResult:
start = time.perf_counter()
trace = agent.run(task["input"]) # returns AgentTrace
elapsed = time.perf_counter() - start
total_tokens = sum(step.tokens_used for step in trace.steps)
cost = (total_tokens / 1000) * pricing_per_1k_tokens
quality = evaluate_trace_quality(trace, task["expected_output"])
return AgentEvalResult(
task_id=task["id"],
quality_score=quality,
latency_seconds=round(elapsed, 2),
total_tokens=total_tokens,
tool_call_count=len(trace.steps),
estimated_cost_usd=round(cost, 5)
)
The AgentEvalResult dataclass above treats a run as failing if any of the three dimensions—quality, latency, or cost—exceeds its threshold. This forces trade-off conversations to happen during development rather than after deployment.
📋 Quick Reference Card: Eval Dimensions by Layer
| 🔧 Layer | 🎯 Quality Signal | ⏱️ Latency Signal | 💰 Cost Signal |
|---|---|---|---|
| 🔬 Unit | Assertion pass rate | Token count per call | Cost per LLM call |
| 🔗 Chain | Step correctness | Chain wall-clock time | Tokens per chain |
| 🧩 Integration | Task completion rate | P95 latency under load | Cost per task |
| 🚀 Production | User satisfaction | Real-world P99 latency | Daily/monthly spend |
🤔 Did you know? Research on production LLM applications consistently finds that cost and latency optimizations often conflict with quality optimizations. Teams that measure all three from day one tend to find the right trade-off point 30–40% faster than teams that optimize quality first and retrofit the economics later.
Pitfall 5: Treating Eval as a One-Time Gate
⚠️ Common Mistake: Running a comprehensive eval suite before launch, passing the bar, and then treating the agent as validated until the next major feature change.
This is the most dangerous pitfall of the five because it produces silent degradation—a system that was working correctly at the moment of its eval run and is quietly failing weeks later. Agentic systems have more silent degradation vectors than almost any other class of software:
Silent degradation sources:
Agent ──────────────────────────────────────────────────────────────────┐
│ │
├── LLM API ──► Model updated by provider (same version string, ────── ┤
│ different weights after safety fine-tuning) │
│ ▼
├── Tool A ──► Schema change in external API response format ──► Parsing fails silently
│
├── Tool B ──► Rate limit policy tightened ──► Intermittent failures average out
│ in metrics but spike for specific users
│
└── Retrieval DB ──► New documents indexed ──► Retrieval distribution shifts
──► Previously reliable answers now hallucinate
The solution is to treat evaluation as a continuous feedback loop rather than a checkpoint. This means:
Scheduled regression runs: Automatically re-run your full eval suite on a cadence (daily for high-stakes agents, weekly for lower-risk ones) even when no code has changed. If scores drop, you know something in the environment changed.
Model version pinning with drift alerts: When a provider updates a model (even a minor update), run your eval suite against both versions before migrating. Subscribe to provider changelogs and treat model updates as deployments.
Production shadow evals: Route a small percentage of real production traffic through your eval harness. Grade live responses using your automated metrics. This catches distribution shifts that your golden dataset never anticipated.
Tool API change detection: Many silent failures come from external tool APIs changing their response schemas. Add lightweight schema assertions to your integration evals and run them on a heartbeat schedule.
💡 Real-World Example: A team shipped a document Q&A agent that scored 91% on their eval suite at launch. Three months later, their retrieval tool provider silently changed the format of metadata fields in search results. The agent's citation extraction logic broke. The final answers still looked plausible, but citations were now fabricated. Because eval ran only at release time, this went undetected for six weeks until a user reported incorrect citations in a legal document.
🎯 Key Principle: An eval suite that runs once is a historical artifact. An eval suite that runs continuously is a safety system.
Putting It All Together: A Pitfall Audit Checklist
Before you ship an eval pipeline for an agentic system, run through this checklist. Each item maps directly to one of the pitfalls above.
┌─────────────────────────────────────────────────────────────┐
│ AGENTIC EVAL PIPELINE AUDIT │
├────┬────────────────────────────────────────┬───────────────┤
│ # │ Check │ Pitfall │
├────┼────────────────────────────────────────┼───────────────┤
│ 1 │ Are intermediate trace steps graded? │ Final-only │
│ 2 │ Is the LLM judge calibrated vs humans? │ Correlated │
│ │ │ bias │
│ 3 │ Are dev and test eval sets separate? │ Overfitting │
│ 4 │ Are latency + cost tracked as metrics? │ Economics │
│ 5 │ Is eval scheduled to run continuously? │ One-time gate │
└────┴────────────────────────────────────────┴───────────────┘
If any of these five boxes is unchecked, you have a known gap in your coverage. The good news is that each fix is independent—you can address them one at a time without needing to rebuild your pipeline from scratch. Start with whichever failure mode is most likely to surface first given your agent's specific architecture and risk profile.
❌ Wrong thinking: "Our eval suite passed, so the agent is correct."
✅ Correct thinking: "Our eval suite passed today, against this dataset, measuring these dimensions. We have scheduled runs, calibrated judges, locked test sets, and trace-level checks to ensure it stays correct tomorrow."
Evaluation is not a finish line. For agentic systems operating in dynamic environments with non-deterministic behavior, it is the ongoing practice that makes trustworthy deployment possible.
Key Takeaways and What Comes Next
You've traveled a long road through this lesson. You started by confronting why traditional testing intuitions break down when agents enter the picture—non-determinism, multi-step reasoning chains, external tool calls, and emergent failure modes that no unit test can anticipate on its own. You then built up a complete mental model from scratch: a reimagined testing pyramid, a toolkit of evaluation primitives, concrete implementation patterns, and a catalog of the pitfalls waiting to trap teams who move too fast.
This final section is your consolidation point. Think of it as the map you fold up and keep in your pocket before heading into the territory. Everything here is something you can return to when you're deep in an eval sprint and need to re-orient.
The Big Picture in One View
Before drilling into individual takeaways, it helps to see all the major ideas sitting next to each other. The diagram below collapses the entire lesson into a single visual:
┌─────────────────────────────────────────────────────────────┐
│ AGENTIC EVAL LANDSCAPE │
│ │
│ PYRAMID LAYER WHAT YOU TEST PRIMARY TOOL │
│ ───────────── ───────────── ──────────── │
│ 🔬 Unit Single step / tool Deterministic │
│ 🔗 Chain Multi-step reasoning Statistical │
│ 🔄 Integration Full workflow + tools Rubric / LLM-J │
│ 🚀 Production Live user traffic Monitoring │
│ │
│ PRIMITIVE BEST FOR COST │
│ ───────── ──────── ──── │
│ Assertion Exact correctness Negligible │
│ Metric Aggregate quality Low–Medium │
│ Rubric Nuanced judgment Medium–High │
│ │
│ REPRODUCIBILITY MECHANISM │
│ ─────────────── ───────── │
│ Seeded runs Fix LLM randomness │
│ Mocked tools Eliminate external flakiness │
│ Versioned data Freeze eval inputs │
│ │
│ CI/CD ROLE Evals run on every merge, not just ship │
└─────────────────────────────────────────────────────────────┘
Every column in that diagram corresponds to a decision you make when designing an eval. Choosing the wrong layer for a given question wastes money and produces misleading signal. Choosing the wrong primitive produces either false confidence (assertions alone) or unactionable noise (rubrics alone on high-volume runs).
Takeaway 1: The Four-Layer Pyramid Maps Testing Responsibility to Cost
The single most important structural insight from this lesson is that not all agent testing belongs at the same granularity. The reimagined pyramid exists because cost and signal have an inverse relationship: cheap tests (unit) run thousands of times per day and catch regressions early; expensive tests (production monitoring) run continuously at the cost of real user experience.
🎯 Key Principle: Every test has a natural home in the pyramid. When you find yourself writing a slow, expensive test to check a property that should be fast and cheap—or skipping production monitoring because "the integration tests passed"—you've misallocated testing effort.
The practical implication is that your eval suite should have a deliberate layer budget:
- Unit evals should number in the hundreds to thousands. They run in seconds. Flipping an assertion here costs nothing.
- Chain evals should number in the dozens to hundreds. They run in minutes. Each one exercises a meaningful reasoning path.
- Integration evals should number in the tens to low hundreds. They run in tens of minutes. Each one exercises a full agentic workflow with real (or realistically mocked) infrastructure.
- Production evals don't have a "number"—they are always on, sampling live traffic and feeding dashboards.
🧠 Mnemonic: "Fewer, Slower, Richer" — as you climb the pyramid, the tests get fewer in count, slower in execution, and richer in the signal they return.
## Example: Pyramid-aware pytest markers let you run the right layer at the right time
import pytest
## Run with: pytest -m unit (fast CI gate)
## Run with: pytest -m "unit or chain" (pre-merge gate)
## Run with: pytest -m integration (nightly or pre-release)
@pytest.mark.unit
def test_tool_call_schema_is_valid():
"""Unit eval: does the tool invocation match the expected JSON schema?"""
from myagent.tools import search_tool
call = search_tool.build_call(query="climate policy")
assert call["name"] == "search"
assert "query" in call["arguments"]
assert isinstance(call["arguments"]["query"], str)
@pytest.mark.chain
def test_research_chain_cites_sources(seeded_llm):
"""Chain eval: does a multi-step research flow include citations?"""
from myagent.chains import research_chain
result = research_chain.run(
topic="renewable energy trends",
llm=seeded_llm # deterministic via seed
)
# Statistical metric: at least 2 sources cited
citations = result.metadata.get("citations", [])
assert len(citations) >= 2, f"Expected ≥2 citations, got {len(citations)}"
@pytest.mark.integration
def test_full_report_workflow_end_to_end(mocked_tools, versioned_dataset):
"""Integration eval: full report generation with mocked external calls."""
from myagent.workflows import report_workflow
sample = versioned_dataset["report_v3"][0] # pinned input
output = report_workflow.run(
inputs=sample["inputs"],
tools=mocked_tools # no real HTTP calls
)
# Rubric scored by LLM-as-Judge (see child lesson)
score = rubric_judge.score(output, criteria=sample["rubric"])
assert score.overall >= 0.75, f"Quality score too low: {score}"
This pattern lets your CI pipeline run pytest -m unit in under 30 seconds, pytest -m "unit or chain" in under 5 minutes, and schedule the integration suite for nightly runs—each gate calibrated to its cost.
Takeaway 2: Effective Evals Combine All Three Primitives
Assertions, statistical metrics, and rubric-based scoring are not alternatives—they are complements. The failure mode isn't choosing the wrong one; it's over-relying on one and ignoring the others.
❌ Wrong thinking: "My assertions all pass, so the agent is working correctly." ✅ Correct thinking: "My assertions confirm structural correctness; my metrics confirm aggregate quality; my rubrics confirm nuanced judgment. All three need to be green."
💡 Mental Model: Think of the three primitives as three different senses. Assertions are touch—binary, immediate, unmistakable. Metrics are sight—you see trends and patterns over populations. Rubrics are hearing—you pick up subtle tonal qualities that resist quantification.
A mature eval suite wires them together in a single harness:
from dataclasses import dataclass
from typing import Optional
@dataclass
class EvalResult:
# Assertion layer
schema_valid: bool
no_hallucinated_tool_calls: bool
# Metric layer
answer_similarity_score: float # 0.0–1.0
tool_call_precision: float # 0.0–1.0
latency_p95_ms: float
# Rubric layer
helpfulness_score: Optional[float] # 0.0–1.0, LLM-as-Judge
groundedness_score: Optional[float] # 0.0–1.0, LLM-as-Judge
def passes_gate(self) -> bool:
"""Composite gate: all three primitive types must pass."""
assertions_ok = self.schema_valid and self.no_hallucinated_tool_calls
metrics_ok = (
self.answer_similarity_score >= 0.80 and
self.tool_call_precision >= 0.90
)
rubrics_ok = (
(self.helpfulness_score or 0) >= 0.70 and
(self.groundedness_score or 0) >= 0.75
)
return assertions_ok and metrics_ok and rubrics_ok
The passes_gate() method enforces that no single layer can carry the whole evaluation. A perfectly helpful response that fails schema validation still blocks the release. A schema-valid response that scores 0.3 on groundedness still blocks the release.
Takeaway 3: Reproducibility Is an Engineering Choice, Not a Coincidence
Non-determinism is native to LLMs. Treating it as background noise—something to shrug at—is the fastest path to an eval suite that tells you nothing reliable. Reproducibility must be engineered in deliberately.
The three mechanisms that do the work:
🔧 Seeded runs — pass a fixed seed and temperature=0 to your model client so that identical inputs produce identical (or near-identical) outputs across runs. Not all model providers guarantee full determinism even with seeds, but it dramatically narrows variance.
🔧 Mocked tools — replace real HTTP calls, database queries, and file reads with fixtures that return the same payload every time. The agent's reasoning is what you're testing; real infrastructure introduces failure modes orthogonal to agent quality.
🔧 Versioned datasets — pin your eval inputs to a named, immutable dataset version. eval_dataset_v4_2024-11 should never change after creation. When you add new cases, you cut a new version.
⚠️ Common Mistake — Mistake 1: Updating your eval dataset in place when you add new test cases. This makes historical comparisons meaningless because the baseline shifted under your feet. Always version datasets like you version code.
🤔 Did you know? Some teams run their full eval suite twice per CI job and flag any test where the two runs disagree. This "flakiness detector" surfaces tests that are under-seeded or leaking real non-determinism—before those tests give you false confidence in production.
Takeaway 4: Eval Is a Continuous Engineering Discipline
This is the mindset shift that separates teams who get real value from evals and teams who treat them as a tax. Evaluation is not a pre-ship checklist. It is an engineering discipline with the same continuous lifecycle as the code it tests.
What that means in practice:
📚 Evals live in version control alongside agent code. They are reviewed in pull requests. They have owners.
🎯 Evals run in CI/CD on every merge to main, not just before releases. A regression introduced on a Tuesday should be caught on Tuesday, not discovered by users two weeks later.
🔧 Production signals feed back into evals. When a new failure pattern appears in production logs, a new eval case is written to cover it. The eval suite grows with the agent's failure history.
🚀 Eval infrastructure is treated like production infrastructure. It has SLAs. It has on-call. It doesn't rot.
## Example: GitHub Actions workflow that runs pyramid layers as separate jobs
## .github/workflows/agent-evals.yml (abbreviated)
## jobs:
## unit-evals:
## runs-on: ubuntu-latest
## steps:
## - uses: actions/checkout@v4
## - run: pip install -e ".[dev]"
## - run: pytest -m unit --tb=short -q
## # Fails fast — blocks PR merge
##
## chain-evals:
## needs: unit-evals # only runs if unit passes
## runs-on: ubuntu-latest
## steps:
## - run: pytest -m chain --tb=short -q
## # Still blocks merge, but runs after unit gate
##
## integration-evals:
## if: github.ref == 'refs/heads/main' # nightly on main only
## needs: chain-evals
## runs-on: ubuntu-latest
## steps:
## - run: pytest -m integration --tb=long
## # Results posted to Slack; does not block hotfixes
## Key design principle:
## Fast gates block merges. Slow gates inform releases.
## Production monitoring runs independently, always.
print("Eval pipeline: always on, always honest.")
The workflow above encodes the pyramid in CI: unit evals are a hard gate on every PR, chain evals run immediately after, and integration evals run nightly on main. Production monitoring is a separate always-on system that feeds dashboards and alerts, not a CI job.
🎯 Key Principle: The eval pipeline is the immune system of your agentic application. It doesn't run when you remember to. It runs automatically, every time, and it tells you when something is wrong before your users do.
📋 Quick Reference Card: The Five Core Takeaways
| # | 🎯 Takeaway | ⚡ One-Line Version |
|---|---|---|
| 1 | 🔬 Four-Layer Pyramid | Match test granularity to cost—unit, chain, integration, production |
| 2 | 🔧 Three Eval Primitives | Use assertions AND metrics AND rubrics—never just one |
| 3 | 🔒 Reproducibility by Design | Seed runs, mock tools, version datasets deliberately |
| 4 | 🚀 Eval as Engineering | CI/CD + production monitoring = permanent discipline, not checklist |
| 5 | 📚 Continuous Improvement | Every production failure becomes a new eval case |
⚠️ Three Critical Points to Carry Forward
Before you move to the next lessons, cement these:
⚠️ Non-determinism is your adversary, but it's a manageable one. Every unreproducible eval result is a masked bug—either in the agent or in the eval itself. Treat flaky evals with the same urgency as flaky production systems.
⚠️ LLM-as-Judge is powerful but not neutral. When you use a language model to score another language model's output, you are introducing a second source of non-determinism and potential bias. The upcoming child lesson on LLM-as-Judge covers how to calibrate, validate, and A/B test your judges so they remain trustworthy signal sources rather than laundered guesses.
⚠️ An eval suite that never fails is not a healthy eval suite. If every run is green, either your agent is genuinely perfect (unlikely) or your evals aren't covering meaningful failure territory (almost certain). Healthy eval suites catch regressions regularly because they're written to probe the boundaries where agents actually fail.
Practical Next Steps Before the Child Lessons
You now have the conceptual map. Before you dive into the deeper technical content that follows, here are three concrete things you can do with what you've learned:
1. Audit your current test suite against the pyramid. If you have an existing agentic project, open your test directory and label each test with the pyramid layer it belongs to. Most teams discover they have abundant unit tests, sparse chain tests, and no integration or production evals at all. That audit tells you exactly where to invest next.
2. Pick one eval primitive you're not yet using and add it. If you have assertions but no metrics, write a semantic similarity scorer for your top three agent outputs. If you have metrics but no rubrics, write a three-criteria rubric and score ten recent outputs manually—you'll quickly see what the numbers miss.
3. Set up one CI gate at the unit level.
Even a single pytest -m unit job on your pull request pipeline changes the culture of how your team thinks about agent quality. It makes evals visible, automatic, and consequential—which is the prerequisite for everything else.
💡 Pro Tip: The hardest part of building an eval culture isn't the tooling—it's the habit. Start with the smallest possible CI gate that makes eval results visible to the whole team. Visibility creates accountability, and accountability drives the discipline that makes eval genuinely useful.
What the Child Lessons Cover
This lesson gave you the landscape. The two child lessons that follow give you the terrain.
Structuring the Testing Pyramid for Agents goes deep on each pyramid layer—how to design chain evals that don't over-index on a single reasoning path, how to structure integration evals so they're fast enough to run nightly but rich enough to catch real bugs, and how to instrument production agents for ongoing eval without degrading user experience.
LLM-as-Judge and A/B Evaluation Frameworks covers the judge design problem end to end: how to write rubrics that produce consistent scores, how to validate a judge against human raters, how to run A/B comparisons between agent versions, and how to detect when your judge has drifted from ground truth.
Together, those two lessons take the principles you've internalized here and turn them into reproducible, scalable engineering practice. The map you're holding is accurate. The territory ahead is where the real craft lives.
🧠 Mnemonic to carry into the next lessons: "PACER" — Pyramid, Assertions, Continuous, Eval-as-Engineering, Reproducibility. If any one of these is missing from your eval strategy, you're leaving signal on the table.