You are viewing a preview of this lesson. Sign in to start learning
Back to Agentic AI as a Part of Software Development

LLM-as-Judge & A/B Evals

Golden sets, rubrics, pairwise comparison, shadow deployments, and detecting silent regressions in production.

Why Evaluating Agents Is a Different Problem

Imagine you've just shipped a customer-support agent that's been running flawlessly for three weeks. Users seem happy. Tickets are getting resolved. Then, quietly, something changes. A model provider silently updates a base model. A new product category gets added to your catalog. A prompt template gets a two-word tweak from a well-meaning teammate. Nothing crashes. No exception is thrown. No alert fires. But over the next two weeks, your agent starts misclassifying refund requests, hallucinating return-policy details, and occasionally routing urgent complaints to a low-priority queue. You won't find out until your customer satisfaction scores drop — or until an angry tweet goes viral. Grab the free flashcards embedded in this section to lock in the vocabulary as you go; they'll make the later sections click much faster.

This scenario isn't hypothetical. It's the defining operational challenge of building production-grade LLM-based agents, and it's the reason that evaluating these systems demands an entirely different mental model than the testing practices most software engineers have spent years refining. Welcome to the world of agentic AI evaluation, where correctness is probabilistic, regressions are silent, and waiting for user complaints is a strategy you cannot afford.


The Contract Between Tests and Code — and Why Agents Break It

Traditional software testing is built on a foundational assumption so obvious we rarely state it: determinism. Given the same input, a function produces the same output. Always. This is the contract that makes unit tests meaningful. You assert add(2, 3) === 5, and if that assertion ever fails, you know exactly where to look. The test is a precise specification of behavior, and a failing test is a precise signal that the behavior has changed.

LLM-based agents violate this contract completely. When you send the same user message to a language model twice, you may get two different responses — both of which are entirely reasonable, contextually appropriate, and correct by any human standard. This is the non-determinism problem, and it is not a bug. It is an inherent property of probabilistic text generation. Temperature settings, sampling strategies, and the stochastic nature of transformer inference all contribute to output variability that is, in many ways, desirable. A customer-service agent that always responds with identical phrasing would feel robotic and brittle. Variability is a feature. But it makes your test suite nearly useless.

Consider a simple example. Suppose you ask an agent: "What is your return policy for electronics?" A deterministic system might return a fixed string you can assert against. An LLM agent might return any of the following, all of which could be perfectly correct:

  • "Electronics can be returned within 30 days of purchase with original packaging."
  • "You have 30 days to return electronics, as long as they're in their original box."
  • "Our return window for electronics is 30 days. Please keep the original packaging."

A naive assertEqual test would fail on responses two and three even though they're semantically identical to response one. Conversely, a naive test that simply checks for the substring "30 days" would pass on a response that's otherwise completely wrong. Neither approach gives you real signal.

## ❌ The naive approach: brittle exact-match testing
def test_return_policy_response():
    agent = CustomerSupportAgent()
    response = agent.ask("What is your return policy for electronics?")
    
    # This will fail 2 out of 3 times even for correct responses
    assert response == "Electronics can be returned within 30 days of purchase with original packaging."

## ❌ The substring approach: too permissive, misses semantic errors
def test_return_policy_substring():
    agent = CustomerSupportAgent()
    response = agent.ask("What is your return policy for electronics?")
    
    # This passes even if the rest of the response is hallucinated nonsense
    assert "30 days" in response

Both of these tests are worse than no test at all. The exact-match test creates false failures that erode trust in the test suite. The substring test creates false passes that give you a false sense of security. What you actually need is a way to evaluate the semantic correctness, tone, completeness, and safety of a response — which brings us to the evaluation strategies this lesson is built around.

💡 Mental Model: Think of a traditional test as a ruler — precise, binary, and indifferent to context. Evaluating an LLM agent is more like using a rubric to grade an essay. You're assessing multiple dimensions of quality simultaneously, and a human (or a well-designed automated judge) can recognize a correct answer even when the exact words differ.


Silent Regressions: The Most Dangerous Kind of Bug

In classical software development, bugs tend to be loud. A null pointer dereference crashes the process. A database connection failure throws an exception. A misconfigured API returns a 500 error. These failures are observable, traceable, and — critically — they stop execution. Something clearly breaks, and your monitoring stack catches it.

Silent regressions in agent behavior are the opposite in every way. The agent keeps running. It keeps returning responses. No exceptions are thrown. Your uptime dashboards stay green. But the quality of the responses has degraded in ways that are invisible to your infrastructure monitoring. This degradation can be subtle: slightly less accurate answers, mildly inappropriate tone, occasional misrouting of requests, a growing tendency to hallucinate specific product details under certain conditions.

Silent regressions are especially dangerous for three compounding reasons:

🧠 They accumulate invisibly. A single degraded response is undetectable in aggregate metrics. By the time you have enough degraded responses to show up as a blip in your customer satisfaction scores, you may have served thousands of users poorly.

📚 They are hard to attribute. When a user complains three weeks after a regression was introduced, your logs don't have a clean "before" and "after" boundary. You're doing forensic archaeology on a system with many interacting variables.

🔧 They erode trust asymmetrically. Research consistently shows that users who have a bad experience with an AI system are disproportionately unlikely to return, and disproportionately likely to share negative feedback publicly. A silent regression isn't a quiet problem — it's a trust bomb with a slow fuse.

What causes silent regressions in the first place? The list is longer than most teams anticipate:

Trigger Why It's Subtle
🔄 Model provider updates Base model weights change without versioning
📝 Prompt template edits Small wording changes shift model behavior at scale
🗄️ Retrieval corpus changes New documents alter what the agent "knows"
⚙️ System prompt drift Accumulated small edits change tone and guardrails
🌡️ Temperature adjustments Changed sampling creates different output distributions
🔗 Tool schema changes Downstream API updates alter how tools are invoked

🎯 Key Principle: A silent regression is not a corner case. For any agent serving real users at any meaningful scale, silent regressions are the default failure mode. Your architecture should assume they will occur and build detection into the system from day one.


The Cost Equation: User Complaints vs. Proactive Eval Pipelines

Let's be precise about the economics here, because this is where the argument for investing in rigorous eval infrastructure becomes impossible to dismiss.

The reactive approach — waiting for user complaints — has a deceptively low apparent cost. You don't pay to build an eval pipeline. You don't pay for the compute to run automated judges. You just wait for signals from the real world and fix problems as they surface. For a team that's resource-constrained or moving fast, this can feel like a reasonable tradeoff.

Here is what that tradeoff actually costs:

REACTIVE REGRESSION DETECTION TIMELINE

 Day 0  ─── Regression introduced (silent)
   │
 Day 3  ─── First affected users experience degraded quality
   │         [no alert fires]
   │
 Day 7  ─── Some users notice something feels "off"
   │         [a few users stop using the feature]
   │
 Day 14 ─── First user complaint tickets arrive
   │         [support team starts investigating]
   │
 Day 17 ─── Engineering team identifies the regression
   │         [2,800+ user interactions served at degraded quality]
   │
 Day 19 ─── Fix deployed
   │         [trust damage already done]
   │
 Day 21 ─── Post-mortem reveals regression was detectable
             on Day 0 with a proper eval pipeline

Contrast this with a proactive eval pipeline:

PROACTIVE REGRESSION DETECTION TIMELINE

 Day 0  ─── New agent version deployed to shadow environment
   │
 Day 0  ─── Eval pipeline runs against golden set
   │         LLM-as-Judge scores flagged as degraded
   │
 Day 0  ─── A/B comparison shows statistical divergence
   │         from baseline on key quality dimensions
   │
 Day 0  ─── Deployment halted automatically
   │         [0 users affected]
   │
 Day 1  ─── Root cause identified and fixed
             [user trust intact]

The compute cost of running an automated eval pipeline — even using a capable LLM-as-Judge — is typically a small fraction of the cost of one meaningful churn event or one viral negative review. More importantly, the speed of detection is measured in minutes rather than weeks.

⚠️ Common Mistake — Mistake 1: Treating eval pipelines as a "nice to have" that you'll add after the system is stable. In practice, you only discover how to build a good eval pipeline by doing it during development, when you still have the context to define what "correct" looks like. By the time you're in production firefighting mode, that context is gone.

💡 Real-World Example: A major e-commerce platform running a product-recommendation agent discovered through their eval pipeline that a prompt change intended to make responses more concise had inadvertently reduced the agent's tendency to mention compatibility warnings for electronics accessories. No user had complained yet — the average session was too short for the gap to be obvious. The eval pipeline caught a 34% drop in "safety information completeness" scores within minutes of the prompt change being staged. The business impact of that information gap — returns, negative reviews, potential liability — would have been substantial.


Two Complementary Strategies: LLM-as-Judge and A/B Evals

This lesson is structured around two evaluation strategies that work together to give you comprehensive coverage of agent quality. Understanding why both are necessary — and why neither is sufficient on its own — is the conceptual foundation for everything that follows.

LLM-as-Judge: Automated Semantic Quality Assessment

The first strategy addresses the core problem we identified earlier: you can't use exact-match testing on probabilistic outputs, but you need some automated way to assess quality at scale. The solution is to use a capable language model as an automated evaluator — an LLM-as-Judge.

The insight is elegant: if the problem is that human-quality semantic evaluation doesn't scale, and the reason it doesn't scale is that it requires human judgment, then we can use an LLM to approximate that human judgment at machine speed and cost. A well-designed judge prompt can evaluate an agent's response against a rubric that captures the dimensions you care about: factual accuracy, tone, completeness, safety, adherence to policy, and more.

## A simplified LLM-as-Judge scorer
import openai

JUDGE_PROMPT = """
You are an expert evaluator for a customer support AI agent.
Score the following response on a scale of 1-5 for each dimension.

User query: {query}
Agent response: {response}
Ground truth (reference answer): {reference}

Evaluate on:
1. Factual accuracy (does it match the reference facts?)
2. Completeness (does it address all parts of the query?)
3. Tone appropriateness (professional, empathetic, clear?)
4. Safety (no harmful, misleading, or inappropriate content?)

Return a JSON object with keys: accuracy, completeness, tone, safety.
Each value should be an integer from 1 to 5.
Also include a brief 'reasoning' string explaining the scores.
"""

def score_response(query: str, response: str, reference: str) -> dict:
    """Use an LLM judge to score an agent response against a reference."""
    prompt = JUDGE_PROMPT.format(
        query=query,
        response=response,
        reference=reference
    )
    
    result = openai.chat.completions.create(
        model="gpt-4o",  # Use a capable model as judge
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    
    scores = result.choices[0].message.content
    return scores  # Returns {accuracy: N, completeness: N, tone: N, safety: N, reasoning: "..."}

This approach scales to thousands of evaluations per run, runs automatically on every deployment candidate, and produces structured scores you can track over time. It's not perfect — LLM judges have their own biases and failure modes, which we'll address in depth in Section 2 — but it gives you a signal that's orders of magnitude more meaningful than substring matching.

A/B Evals: Comparative Quality Assessment

The second strategy takes a different angle. Rather than asking "is this response good in absolute terms?", A/B evaluation asks "is this response better or worse than the current production baseline?"

This comparative approach has a powerful property: it sidesteps many of the hardest problems in absolute quality scoring. You don't need to define a perfect rubric. You don't need to worry about whether your judge's absolute scores are calibrated. You just need to determine, reliably, which of two responses is preferable — a much easier judgment for both humans and automated judges.

A/B evals are the foundation of shadow deployments, where a new agent version runs in parallel with the production version, processing the same real traffic, and its outputs are evaluated against the baseline without being shown to users. This lets you validate a new version against real-world input distributions before you commit to rolling it out.

SHADOW DEPLOYMENT ARCHITECTURE

  User Request
       │
       ▼
  ┌─────────┐
  │  Router │ ─────────────────────────────────┐
  └────┬────┘                                   │
       │ (live traffic)            (shadow copy)
       ▼                                        ▼
 ┌──────────────┐                    ┌──────────────────┐
 │  Production  │                    │  Candidate Agent │
 │  Agent v1.2  │                    │  (Agent v1.3)    │
 └──────┬───────┘                    └────────┬─────────┘
        │                                     │
        │ (response served to user)            │ (response NOT served)
        ▼                                     ▼
  ┌───────────┐                     ┌─────────────────┐
  │   User    │                     │  Eval Pipeline  │
  └───────────┘                     │  (LLM-as-Judge) │
                                    │  A/B Comparison │
                                    └────────┬────────┘
                                             │
                                             ▼
                                    ┌─────────────────┐
                                    │  Quality Report │
                                    │  v1.3 vs v1.2   │
                                    └─────────────────┘

The two strategies are genuinely complementary. LLM-as-Judge gives you absolute quality scores that let you catch regressions against fixed standards — your rubric doesn't change, so a drop in scores over time is a clear signal. A/B evals give you relative comparisons that are robust to the difficulty of absolute calibration and that anchor your evaluation in the actual distribution of production traffic.

🎯 Key Principle: Use LLM-as-Judge to catch regressions against fixed quality standards. Use A/B evals to validate that a new version is an improvement — or at minimum not a degradation — before it touches users.

💡 Pro Tip: Neither strategy replaces human evaluation entirely. The right architecture includes automated eval pipelines for continuous coverage, a curated golden set of human-verified examples as your ground truth anchor (covered in Section 3), and periodic human review of samples flagged by your automated systems. The automated layer is your smoke detector; human review is your fire investigation.


What Comes Next

The rest of this lesson builds a complete eval architecture layer by layer. In Section 2, you'll implement a full LLM-as-Judge scorer, learning how to design rubrics that produce consistent and meaningful scores, how to structure judge prompts to minimize bias, and how to interpret the output. In Section 3, you'll learn how to build and maintain a golden set — the curated ground-truth dataset that anchors your entire eval pipeline to human judgment. Section 4 dives into the mechanics of A/B evals and shadow deployments, with code for running pairwise comparisons at scale. Section 5 addresses the ongoing monitoring layer for catching quality drift in live production. And Section 6 wraps everything into a reference architecture you can adapt immediately.

By the end, you'll have both the conceptual framework and the practical tools to build eval infrastructure that makes silent regressions impossible to ignore — and that turns deployment from a moment of anxiety into a moment of confidence.

🧠 Mnemonic: Think of your eval pipeline as GRADE: Golden sets anchor truth, Rubrics define quality, Automated judges scale assessment, Differential A/B comparison catches regressions, Evidence-based monitoring catches production drift. GRADE is what separates agents you trust from agents you hope.

The question is never whether your agent will degrade. The question is whether you'll find out before or after your users do.

LLM-as-Judge: Building Automated Quality Scorers

In the previous section, we established why evaluating agents is fundamentally different from testing deterministic software. Now we turn to the most powerful tool in the modern eval arsenal: using a language model to evaluate another language model. This pattern, called LLM-as-Judge, lets you scale quality assessment beyond what any human review team could handle, while still capturing nuanced, semantic understanding that rule-based metrics can never achieve.

What Is LLM-as-Judge?

The core idea is straightforward: rather than manually checking whether your agent's output is good, you write a second prompt that instructs a separate model call to act as an objective evaluator. The judge model receives the agent's input, the agent's output, and a structured set of criteria — a rubric — and returns a score along with a rationale for that score.

This creates a clean separation between the producer (your agent) and the evaluator (the judge). They can be the same underlying model family, or different ones. Many teams use a frontier model like GPT-4o or Claude Sonnet as the judge, even when their production agent uses a smaller, cheaper model.

┌─────────────────────────────────────────────────────┐
│                  EVAL PIPELINE                      │
│                                                     │
│  User Input ──► Agent Model ──► Agent Output        │
│                    │                  │             │
│                    │                  ▼             │
│                    │          ┌───────────────┐     │
│                    └─────────►│  Judge Model  │     │
│                    (context)  │               │     │
│                               │  + Rubric     │     │
│                               └───────┬───────┘     │
│                                       │             │
│                                       ▼             │
│                              Score + Rationale      │
└─────────────────────────────────────────────────────┘

The judge receives the full context: what the user asked, what reference information (if any) the agent had access to, and what the agent produced. It then reasons against each rubric dimension and emits a structured score. That score flows into your eval dashboard, your CI pipeline, or your A/B comparison system.

💡 Mental Model: Think of the judge model as a senior engineer doing a code review — it has full context, a checklist (the rubric), and produces a reasoned verdict. The difference is that this reviewer works at millions of reviews per hour and never gets tired.

Designing Effective Rubrics

A rubric is only as good as its clarity. Vague criteria produce inconsistent scores, and inconsistent scores are worse than no scores — they give you false confidence. Here is how to design rubrics that actually work.

Choose the Right Scoring Dimensions

Not every dimension applies to every task. The most common dimensions, and when to use them:

Dimension What It Measures When to Use
🎯 Correctness Factual accuracy against a known answer QA, data extraction, math
📚 Groundedness Claims supported by provided context (no hallucination) RAG systems, summarization
🔧 Task Completion Whether the agent actually did what was asked Tool-using agents, instruction following
🧠 Reasoning Quality Coherence and soundness of the reasoning chain CoT agents, analysis tasks
🎯 Tone/Format Appropriateness of style, length, and structure Customer-facing responses
🔒 Safety Absence of harmful, biased, or policy-violating content Any production system

For a customer support agent, you might use Correctness, Task Completion, and Tone. For a RAG-based research assistant, Groundedness is non-negotiable. Resist the temptation to score every dimension for every task — each dimension you add increases judge latency and cognitive load on the model, which can reduce score quality.

Define Unambiguous Scoring Scales

A 1–10 scale sounds natural but is a rubric antipattern. The difference between a 6 and a 7 is undefined, which means your judge will be inconsistent across runs. Instead, use anchored ordinal scales where each level is defined by a concrete behavioral description.

Here is a well-defined 4-point Groundedness scale:

Score 1 — Hallucinated: Response makes claims not found in or directly 
           contradicted by the provided context.
Score 2 — Partially Grounded: Most claims are supported, but one or more 
           significant claims are not traceable to the context.
Score 3 — Grounded: All factual claims are traceable to the context; 
           minor interpolations are reasonable and clearly inferential.
Score 4 — Fully Grounded: All claims are explicitly supported by the 
           context, and the response correctly attributes uncertainty 
           where the context is ambiguous.

Each level is a behavioral story, not an abstract number. The judge has to assign a story, which is a much easier task than placing a number on an infinite scale.

🎯 Key Principle: Every rubric level should be distinguishable by a concrete example. If you cannot write a sample output that clearly belongs to level 3 but not level 2, the boundary is too blurry.

Prompt Engineering for Judges

The judge prompt is a first-class engineering artifact. Treat it with the same care you would give a production system prompt.

Chain-of-Thought Rationale

Always instruct the judge to produce its reasoning before its score. This is not just for human interpretability — it measurably improves score accuracy. When the model reasons step by step, it is less likely to pattern-match to surface features and more likely to engage with the actual content. Structure the judge output as:

  1. Analysis — A free-text critique of the response against each rubric dimension
  2. Score — The integer score for each dimension
  3. Overall — An optional normalized aggregate

This ordering matters. If you ask for the score first and the rationale second, the model anchors on the score and reverse-engineers a justification. Always rationale-first.

Self-Consistency Sampling

Self-consistency is the practice of running the same judge prompt multiple times (typically 3–5 times) and averaging or majority-voting the results. LLMs are stochastic, and a single score can be noisy. For high-stakes decisions — like choosing between two agent versions — sampling three judge outputs and taking the median is a cheap way to buy significant reliability.

Single judge run:  Score = 3    (could be 2 or 4)
Self-consistency:  Scores = [3, 3, 2] → Median = 3  (more reliable)

The cost tradeoff is real: self-consistency triples your judge API costs. Reserve it for batch evals where you need high confidence, not for real-time scoring.

Preventing Style Bias

One of the most insidious judge failure modes is rewarding verbosity — longer, more formal, more elaborate responses score higher not because they are correct, but because they look impressive. To counteract this, your judge prompt should explicitly instruct the model to penalize unnecessary length and to evaluate conciseness as a positive quality.

Add language like:

"A shorter, accurate response is strictly better than a longer response that adds unnecessary hedging, filler phrases, or redundant restatements. Do not award higher scores for length, formality, or elaborate structure."

⚠️ Common Mistake: Forgetting to include anti-verbosity instructions. Without them, judge models consistently favor longer outputs, creating a perverse incentive in your eval pipeline that will pull your agent toward producing bloated, hedge-heavy responses.

Code Walkthrough: A Reusable Python Judge Class

Let's build a production-ready judge class. It accepts a rubric definition, calls an LLM, parses structured output, and returns a normalized result that can plug into any eval pipeline.

Step 1: Define the Rubric Schema
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class RubricDimension:
    """
    A single scoring dimension within a rubric.
    Each level_descriptions entry maps a score (int) to a behavioral description.
    """
    name: str                              # e.g., "groundedness"
    weight: float                          # relative importance, must sum to 1.0 across rubric
    level_descriptions: dict[int, str]     # e.g., {1: "Hallucinated", 2: "Partial", ...}
    max_score: int = field(init=False)

    def __post_init__(self):
        self.max_score = max(self.level_descriptions.keys())

    def to_prompt_text(self) -> str:
        """Render this dimension as a block of text for injection into a judge prompt."""
        lines = [f"### {self.name.upper()} (weight: {self.weight})"]
        for score, description in sorted(self.level_descriptions.items()):
            lines.append(f"  Score {score}: {description}")
        return "\n".join(lines)


## Example: a groundedness rubric dimension
groundedness_dim = RubricDimension(
    name="groundedness",
    weight=0.5,
    level_descriptions={
        1: "Response makes claims not found in or contradicted by the context.",
        2: "Most claims are supported; one or more significant claims are not traceable.",
        3: "All factual claims traceable to context; inferential steps are reasonable.",
        4: "All claims explicitly supported; uncertainty correctly attributed.",
    }
)

task_completion_dim = RubricDimension(
    name="task_completion",
    weight=0.5,
    level_descriptions={
        1: "The agent completely failed to address the user's request.",
        2: "The agent partially addressed the request but missed key elements.",
        3: "The agent addressed all elements of the request adequately.",
        4: "The agent fully completed the task with precision and appropriate detail.",
    }
)

This schema separates structure from prompt text. The to_prompt_text() method lets each dimension render itself into the judge prompt — no string concatenation soup in the calling code.

Step 2: The Judge Class
import json
import re
from openai import OpenAI  # pip install openai

client = OpenAI()  # reads OPENAI_API_KEY from environment


@dataclass
class JudgeScore:
    dimension_scores: dict[str, int]   # raw score per dimension
    rationale: str                     # the judge's chain-of-thought
    normalized_score: float            # weighted aggregate, 0.0 – 1.0


class LLMJudge:
    """
    A reusable LLM-as-Judge evaluator.

    Usage:
        judge = LLMJudge(dimensions=[groundedness_dim, task_completion_dim])
        result = judge.score(
            user_input="What is the refund policy?",
            agent_output="You can return items within 30 days.",
            context="Our refund policy allows returns within 30 days of purchase."
        )
        print(result.normalized_score)  # e.g., 0.875
    """

    SYSTEM_PROMPT = (
        "You are a rigorous, impartial evaluator of AI assistant responses. "
        "You will assess the response against the provided rubric. "
        "A shorter, accurate response is strictly better than a verbose one. "
        "Do not award higher scores for length, formality, or elaborate phrasing. "
        "Always provide your analysis BEFORE your scores."
    )

    def __init__(
        self,
        dimensions: list[RubricDimension],
        model: str = "gpt-4o",
        temperature: float = 0.0,
        n_samples: int = 1,   # set >1 for self-consistency sampling
    ):
        self.dimensions = dimensions
        self.model = model
        self.temperature = temperature
        self.n_samples = n_samples

        # Validate weights sum to ~1.0
        total_weight = sum(d.weight for d in dimensions)
        if not (0.99 <= total_weight <= 1.01):
            raise ValueError(f"Rubric dimension weights must sum to 1.0, got {total_weight}")

    def _build_user_prompt(self, user_input: str, agent_output: str, context: Optional[str]) -> str:
        rubric_block = "\n\n".join(d.to_prompt_text() for d in self.dimensions)
        dimension_names = ", ".join(f'"{d.name}"' for d in self.dimensions)

        ctx_block = f"\n\n### CONTEXT PROVIDED TO AGENT\n{context}" if context else ""

        return f"""## EVALUATION TASK

#### USER INPUT
{user_input}{ctx_block}

#### AGENT OUTPUT
{agent_output}

### RUBRIC
{rubric_block}

### INSTRUCTIONS
1. Write a concise analysis of the agent output for each rubric dimension.
2. Then output a JSON block with your scores in this exact format:

```json
{{
  "rationale": "<your analysis here>",
  "scores": {{{dimension_names}}}
}}

Output ONLY the analysis and the JSON block. No other text."""

def _call_judge(self, user_prompt: str) -> dict:
    """Make a single judge call and parse the JSON response."""
    response = client.chat.completions.create(
        model=self.model,
        temperature=self.temperature,
        messages=[
            {"role": "system", "content": self.SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ],
    )
    raw = response.choices[0].message.content

    # Extract JSON block robustly (the model may wrap it in markdown fences)
    json_match = re.search(r'```json\s*(\{.*?\})\s*```', raw, re.DOTALL)
    if json_match:
        return json.loads(json_match.group(1))
    # Fallback: try parsing the whole response as JSON
    return json.loads(raw)

def score(
    self,
    user_input: str,
    agent_output: str,
    context: Optional[str] = None,
) -> JudgeScore:
    """Run the judge (with optional self-consistency) and return a JudgeScore."""
    user_prompt = self._build_user_prompt(user_input, agent_output, context)

    all_score_dicts: list[dict[str, int]] = []
    last_rationale = ""

    for _ in range(self.n_samples):
        parsed = self._call_judge(user_prompt)
        all_score_dicts.append(parsed["scores"])
        last_rationale = parsed["rationale"]

    # Aggregate: take the median score per dimension across samples
    aggregated_scores: dict[str, int] = {}
    for dim in self.dimensions:
        raw_scores = sorted(s[dim.name] for s in all_score_dicts)
        aggregated_scores[dim.name] = raw_scores[len(raw_scores) // 2]  # median

    # Compute normalized weighted score in [0.0, 1.0]
    normalized = sum(
        (aggregated_scores[d.name] / d.max_score) * d.weight
        for d in self.dimensions
    )

    return JudgeScore(
        dimension_scores=aggregated_scores,
        rationale=last_rationale,
        normalized_score=round(normalized, 4),
    )

A few design decisions worth noting. The weight validation at construction time catches rubric authoring errors immediately rather than silently producing wrong scores. The JSON extraction uses a regex with a plaintext fallback, because frontier models occasionally forget to wrap output in fences. The median aggregation across self-consistency samples is more robust to outliers than a mean — a single runaway score of 1 does not tank an otherwise solid 3/3/3 result.

💡 **Pro Tip:** Set `temperature=0.0` for deterministic judge behavior in CI pipelines where you need reproducible scores. Use a small temperature (0.3–0.5) only when running self-consistency sampling where you *want* variance between samples.



<div class="lesson-flashcard-placeholder" data-flashcards="[{&quot;q&quot;:&quot;Where should rubric dimension weights sum to?&quot;,&quot;a&quot;:&quot;1.0&quot;},{&quot;q&quot;:&quot;Which aggregation method is used for self-consistency scores?&quot;,&quot;a&quot;:&quot;median&quot;},{&quot;q&quot;:&quot;What temperature gives deterministic judge scores?&quot;,&quot;a&quot;:&quot;0.0&quot;}]" id="flashcard-set-4"></div>



#### Bias and Failure Modes in LLM Judges

No tool this powerful comes without failure modes. Understanding them is what separates teams that use LLM-as-Judge well from teams that launder bad agent behavior through a credulous eval pipeline.

##### Position Bias

**Position bias** occurs when a judge assigns higher scores to whichever response appears first in a pairwise comparison — not because of quality, but because of order. This matters most in A/B evals (covered in Section 4), but it can also affect single-output scoring when you ask the judge to compare against a reference answer. The mitigation is to run comparisons in both orders and check for consistency. If the judge consistently prefers "Response A" regardless of which actual output is labeled A, you have a position bias problem.

##### Leniency Bias

**Leniency bias** is the tendency for judge models to avoid assigning low scores, even when an output clearly merits them. This manifests as a distribution of scores clustered at the top of your scale, which destroys your ability to differentiate between mediocre and excellent outputs. To detect it, deliberately include known-bad outputs — garbage responses, hallucinations, refusals that ignore the question — in your calibration set and verify the judge scores them low.

Healthy score distribution: Leniency-biased distribution:


█ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ───────────────── ───────────────────────── 1 2 3 4 1 2 3 4 5 6 7


If your score distribution looks like the right chart, your judge is not calibrating against your rubric — it is being polite.

##### Self-Preference Bias

If your judge model and your agent model are from the same family (e.g., both are GPT-4o variants), the judge may subtly prefer outputs that *sound like* its own generation style. This is called **self-preference bias**. You can detect it by comparing judge scores on outputs from your model versus a different model family on a held-out set where humans have rated both. If the judge systematically rates same-family outputs higher than human raters do, switch to a cross-family judge.

##### Calibration Checks Against Human Annotations

The gold standard for validating your judge is **calibration against human annotations**. Collect a small set of 50–200 agent outputs, have human raters score them against the same rubric, and then run your judge on the same set. Compute the **Spearman rank correlation** between human scores and judge scores per dimension.

A correlation above 0.8 indicates a well-calibrated judge. Below 0.6 means the judge is not reliably tracking what humans care about, and you need to revise the rubric or the judge prompt before trusting the scores.

```python
from scipy.stats import spearmanr
import numpy as np

def compute_judge_calibration(
    human_scores: list[int],
    judge_scores: list[int],
    dimension_name: str,
) -> None:
    """Compute and report Spearman correlation between human and judge scores."""
    assert len(human_scores) == len(judge_scores), "Score lists must be same length"

    correlation, p_value = spearmanr(human_scores, judge_scores)

    print(f"Dimension: {dimension_name}")
    print(f"  Spearman r: {correlation:.3f}  (p={p_value:.4f})")

    if correlation >= 0.8:
        print("  ✅ Well-calibrated: judge tracks human judgment reliably")
    elif correlation >= 0.6:
        print("  ⚠️  Moderate calibration: review rubric boundary definitions")
    else:
        print("  ❌ Poor calibration: rubric or judge prompt needs significant revision")

    # Also flag leniency bias
    mean_human = np.mean(human_scores)
    mean_judge = np.mean(judge_scores)
    if mean_judge > mean_human + 0.5:
        print(f"  ⚠️  Leniency bias detected: judge mean ({mean_judge:.2f}) "
              f"exceeds human mean ({mean_human:.2f}) by >0.5 points")

## Example usage
compute_judge_calibration(
    human_scores=[1, 2, 2, 3, 3, 3, 4, 4, 4, 4],
    judge_scores=[2, 2, 3, 3, 3, 4, 4, 4, 4, 4],
    dimension_name="groundedness",
)

⚠️ Common Mistake: Running calibration only once at rubric design time and never revisiting it. As your agent evolves and produces qualitatively different outputs, your judge's calibration can drift. Re-run calibration checks whenever you significantly change your agent's system prompt, its tool set, or its underlying model.

Putting It All Together

📋 Quick Reference Card: LLM-as-Judge Design Checklist

Step What to Do Watch Out For
🎯 Rubric Design 4-point anchored scales per dimension Ambiguous level boundaries
🧠 Prompt Structure Rationale before score Score-first anchoring
🔧 Anti-verbosity Explicit instructions in judge prompt Style rewarded over substance
📚 Self-Consistency Median across 3+ samples for high-stakes 3x cost, reserve for batch evals
🔒 Calibration Spearman r ≥ 0.8 against human labels One-time calibration drift
🎯 Bias Checks Test with known-bad outputs Leniency hiding real failures

The LLM-as-Judge pattern is not a silver bullet — it is a power tool that requires careful setup. A poorly designed rubric with an uncalibrated judge can give you the illusion of rigorous evaluation while systematically missing the failure modes that matter most. But a well-engineered judge system, anchored to human annotations and actively monitored for bias, lets you run thousands of eval cases per hour, integrate quality gates into your CI pipeline, and catch regressions before they reach production. That capability is what the rest of this lesson builds on — in the next section, we'll see how to anchor your judge to a golden set that gives it a stable, human-curated ground truth to reason against.

Golden Sets: Creating and Maintaining Ground-Truth Evaluation Suites

In the previous section, we built an LLM-as-Judge scorer that can evaluate agent outputs automatically. But that judge needs something to judge against. Without a stable, human-vetted reference point, your automated evaluations are measuring consistency rather than correctness — you might be consistently producing bad outputs, and your scorer would happily give them all a green light. This is where golden sets enter the picture.

A golden set is the bedrock of a trustworthy eval pipeline. It represents what you and your domain experts collectively believe "good" looks like, preserved in a format that a machine can compare against. Think of it as writing down your team's collective judgment so that CI/CD can enforce it at scale.

What Is a Golden Set?

A golden set (sometimes called a golden dataset or eval suite) is a curated collection of inputs paired with one or more of the following: reference outputs, pass/fail criteria, or rubric-based expectations — all of which have been reviewed and approved by humans. The key word is curated. A golden set is not a random sample of your logs. It is a deliberate, representative, and carefully maintained library of test cases.

Each entry in a golden set is typically called a golden example, and it contains at minimum:

  • 🔧 Input: The prompt, query, or task the agent receives
  • 📚 Expected output or criteria: Either a reference answer, a structured rubric, or explicit pass/fail rules
  • 🎯 Metadata: Tags, source, date added, difficulty level, and which capability the example tests

💡 Mental Model: Think of a golden set the way a law school thinks about landmark cases. Each case is precedent — a documented decision about what the right answer looks like for a specific situation. New decisions should be consistent with the precedent unless you deliberately update it.

Golden Set Structure

┌─────────────────────────────────────────────────────────────────┐
│                        GOLDEN SET v2.4                          │
├────────────────┬──────────────────┬────────────────────────────┤
│   INPUT        │  EXPECTED OUTPUT │   METADATA                 │
│  (prompt /     │  (reference ans  │  (tags, source, date,      │
│   task)        │  or criteria)    │   capability, difficulty)  │
├────────────────┼──────────────────┼────────────────────────────┤
│ Example 001    │  Pass criteria   │  [summarization, hard,     │
│                │  + reference     │   production-traffic]      │
├────────────────┼──────────────────┼────────────────────────────┤
│ Example 002    │  Exact match     │  [tool-calling, medium,    │
│                │  + rubric        │   adversarial]             │
├────────────────┼──────────────────┼────────────────────────────┤
│ Example N...   │  ...             │  ...                       │
└────────────────┴──────────────────┴────────────────────────────┘

🎯 Key Principle: A golden set is only as trustworthy as the process used to create it. If the examples are wrong, ambiguous, or unrepresentative, every regression alert and pass-rate metric built on top of it is measuring the wrong thing.

Sourcing Your Golden Examples

One of the most common mistakes teams make is building a golden set entirely from synthetic, idealized examples dreamed up in a conference room. Real agents fail in ways you never anticipate, and your golden set needs to reflect that reality.

From Production Traffic

The richest source of golden examples is real production traffic. After your agent has been running for even a few days, you have logs of what users actually asked and what the agent actually produced. Mining these logs gives you examples that are grounded in genuine user intent rather than your assumptions about it.

The workflow looks like this: sample a slice of recent production requests, have a human reviewer (often a domain expert or a member of your product team) evaluate the agent's response, and if the response is good, promote that input-output pair into the golden set. If the response was bad but the input was interesting, write a corrected reference output and add it with a "known failure" tag — these become your regression tests.

💡 Real-World Example: A team building a customer-support agent discovers through log sampling that users frequently ask compound questions like "Can I return this and get a price match?" The agent handles each question separately but fails to synthesize a unified answer. None of the team's synthetic examples covered compound queries. The production log becomes the source of a new golden example category.

Adversarial Edge Cases

Adversarial examples are inputs deliberately constructed to probe failure modes: ambiguous phrasing, topic boundary violations, jailbreak attempts, multilingual inputs, extremely long contexts, and queries that seem to request one thing but imply another. These should be sourced through structured red-teaming sessions where engineers, product managers, or even external testers try to break the agent.

Every time a new failure mode is discovered in production or during testing, a representative example should be added to the golden set so that the failure can never silently re-emerge.

Domain-Expert Contributions

For agents operating in specialized domains — legal research, medical triage, financial analysis — neither engineers nor product managers are qualified to write reference outputs. You need domain experts to author or at least approve the expected answers. Establish a lightweight contribution process: a form or annotated spreadsheet where domain experts can submit new examples, flag existing ones they disagree with, and approve proposed reference outputs.

🤔 Did you know? Research in NLP evaluation shows that even expert annotators disagree on the correct answer roughly 15–30% of the time on open-ended generation tasks. This is not a reason to give up on golden sets — it is a reason to treat inter-annotator agreement as a metric you must measure.

Annotation Guidelines and Inter-Annotator Agreement

A golden set is a shared artifact. If different reviewers would make different pass/fail decisions on the same example, then the golden set is not actually encoding ground truth — it is encoding noise. Annotation guidelines are the written rules that ensure reviewers apply consistent judgment.

Good annotation guidelines specify:

  • 📋 What counts as a passing response (with positive and negative examples)
  • 🔧 Specific rubric dimensions and how to score edge cases within each
  • 📚 Tie-breaking rules when a response is partially correct
  • 🎯 Scope boundaries — which parts of the response are in scope for evaluation

After annotators are trained on the guidelines, measure inter-annotator agreement (IAA) using a metric like Cohen's Kappa (for binary pass/fail) or Krippendorff's Alpha (for multi-level scores). A Kappa below 0.6 is a warning sign that your guidelines are too ambiguous. Below 0.4 means your golden set labels are unreliable and should not be trusted.

⚠️ Common Mistake — Mistake 1: Skipping IAA measurement because it feels like overhead. If two annotators disagree 30% of the time, a 70% pass rate on your golden set could mean anything from "genuinely passing 70% of the time" to "annotator A would say 85% pass, annotator B would say 55% pass." The metric is meaningless without knowing this.

Annotation Reliability Pipeline

  Raw Examples
       │
       ▼
┌─────────────────┐
│ Write Annotation │
│ Guidelines v1    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     Kappa < 0.6?
│ Pilot Annotation │─────────────────► Revise Guidelines
│ (2+ annotators)  │                        │
└────────┬────────┘                         │
         │ Kappa ≥ 0.6                       │
         ▼                                  │
┌─────────────────┐◄──────────────────────── ┘
│ Full Annotation  │
│ + Spot Checks    │
└────────┬────────┘
         │
         ▼
   Golden Set (Trusted)

Versioning and Drift Management

A golden set is not a static artifact. Your agent evolves, your product scope changes, and what counts as a "correct" response today may be wrong or irrelevant six months from now. Without deliberate version management, your golden set silently becomes a liability: stale examples start generating false positives (alerting on regressions that are actually improvements) or false negatives (missing real regressions because the coverage no longer matches current behavior).

Versioning Strategy

Treat your golden set like code. Store it in version control alongside your agent's code, tag releases (e.g., golden-set-v2.4), and maintain a changelog that explains what was added, modified, or retired and why. Every PR that changes the golden set should require at least one reviewer — ideally someone from the domain team, not just engineering.

Semantic versioning works well here:

  • A major version bump (v2 → v3) means the evaluation criteria or scope changed significantly — old pass rates are no longer directly comparable.
  • A minor version bump (v2.3 → v2.4) means new examples were added or existing ones were corrected — pass rates are still directionally comparable.
  • A patch bump (v2.4.0 → v2.4.1) means metadata fixes, typos corrected, no functional change to evaluation logic.
When to Retire Stale Examples

An example becomes stale when the world it was written to capture no longer matches the agent's current scope or the product's current requirements. Common triggers for retirement:

  • 🔧 The product feature the example tested was removed or significantly redesigned
  • 📚 The expected output format changed intentionally (e.g., you stopped generating bullet-point summaries)
  • 🎯 The reference answer was found to be factually incorrect after a domain review
  • 🧠 The example tests a capability the agent was explicitly descoped from

⚠️ Common Mistake — Mistake 2: Retiring examples too eagerly to make pass rates look better. Stale example retirement should be documented, reviewed, and based on product decisions — not on the fact that the agent currently fails those examples. Failure is information.

💡 Pro Tip: Add an expiry_review_date field to each golden example's metadata. Any example that hasn't been reviewed in the last 90 days should be automatically flagged in your CI output as a candidate for audit. This creates a lightweight governance loop without requiring a full manual review every sprint.

Golden Set Lifecycle

  Domain Expert /        Engineering /
  Production Traffic     Red Teaming
        │                     │
        └──────────┬──────────┘
                   │
                   ▼
          ┌──────────────┐
          │  Candidate   │
          │   Example    │
          └──────┬───────┘
                 │
                 ▼
          ┌──────────────┐     IAA < 0.6 or
          │  Annotation  │─────ambiguous──► Revise / Discard
          │  + IAA Check │
          └──────┬───────┘
                 │ Approved
                 ▼
          ┌──────────────┐
          │  Golden Set  │◄──────────────────────────┐
          │  (versioned) │                           │
          └──────┬───────┘                           │
                 │                                   │
         ┌───────┴──────────┐              ┌─────────┴───────┐
         │  Periodic Audit  │              │ Scope Change /  │
         │  (90-day flag)   │              │ Product Update  │
         └───────┬──────────┘              └─────────────────┘
                 │
         ┌───────┴──────┐
         │   Retire or  │
         │   Update     │
         └──────────────┘

Storing and Loading Golden Sets in Code

Now let's make this concrete. A golden set that lives in a spreadsheet is not an eval pipeline — it is documentation. For a golden set to function as automated quality enforcement, it must be stored in a machine-readable format and integrated into your CI/CD system.

Structured Storage Format

JSON and YAML are both excellent choices. JSON is preferable when your pipeline is heavily Python-based and the golden set will be consumed programmatically. YAML is preferable when humans frequently edit the golden set directly and readability matters. The structure below works for either.

// golden_set_v2_4.json
// Each entry represents one golden example.
// "criteria" is used by LLM-as-Judge; "exact_match" is used for deterministic checks.

[
  {
    "id": "GS-001",
    "capability": "summarization",
    "difficulty": "medium",
    "source": "production-traffic",
    "added": "2024-09-12",
    "expiry_review_date": "2024-12-12",
    "tags": ["compound-query", "return-policy"],
    "input": "Can I return a item I bought 45 days ago AND get a price match on the same item?",
    "reference_output": "Our return window is 30 days, so a return is unfortunately not possible for this purchase. However, our price match policy is independent of the return window — if you find the item at a lower price at an eligible retailer within 60 days of purchase, we can match that price. Please contact support with proof of the lower price.",
    "criteria": {
      "accuracy": "Correctly states 30-day return window and 60-day price match window. Does not conflate the two policies.",
      "completeness": "Addresses BOTH the return question and the price match question.",
      "tone": "Empathetic and helpful, not dismissive."
    },
    "exact_match_fields": null
  },
  {
    "id": "GS-002",
    "capability": "tool-calling",
    "difficulty": "hard",
    "source": "adversarial",
    "added": "2024-10-01",
    "expiry_review_date": "2025-01-01",
    "tags": ["tool-selection", "ambiguous-intent"],
    "input": "What's the weather like in the capital of the country that hosted the 2022 World Cup?",
    "reference_output": null,
    "criteria": {
      "tool_selection": "Agent must call a knowledge-lookup tool to identify Qatar as the 2022 host and Doha as its capital BEFORE calling the weather tool.",
      "sequencing": "Two distinct tool calls in correct order: (1) country/capital lookup, (2) weather lookup for Doha.",
      "accuracy": "Final answer reflects current weather for Doha, Qatar."
    },
    "exact_match_fields": {
      "tool_call_sequence": ["knowledge_lookup", "weather_lookup"]
    }
  }
]

This format separates criteria-based evaluation (which requires an LLM-as-Judge) from exact-match checks (which are deterministic and should be enforced without calling another LLM). Mixing these in the same entry lets you run a fast deterministic pass first, then escalate to the judge only for examples that require semantic evaluation.

CI-Compatible Test Harness

The following example shows a pytest-compatible test harness that loads the golden set, runs the agent against each input, applies both exact-match and LLM-as-Judge evaluation, and reports a pass rate that can gate CI/CD deployments.

## tests/test_golden_set.py
## Run with: pytest tests/test_golden_set.py -v --tb=short

import json
import pytest
from pathlib import Path
from your_agent import run_agent          # Replace with your agent entrypoint
from your_eval import llm_judge_score     # Replace with your LLM-as-Judge scorer

## ── Load the golden set once at module level ──────────────────────────────────
GOLDEN_SET_PATH = Path("evals/golden_set_v2_4.json")

with open(GOLDEN_SET_PATH) as f:
    GOLDEN_EXAMPLES = json.load(f)

## ── Parametrize each example as a separate test case ─────────────────────────
@pytest.mark.parametrize(
    "example",
    GOLDEN_EXAMPLES,
    ids=[ex["id"] for ex in GOLDEN_EXAMPLES],  # Test names become GS-001, GS-002...
)
def test_golden_example(example):
    """
    For each golden example:
    1. Run the agent on the input.
    2. Apply exact-match checks (fast, deterministic).
    3. Apply LLM-as-Judge scoring against the criteria rubric.
    4. Assert pass on all applicable checks.
    """
    agent_output = run_agent(example["input"])

    # ── Step 1: Exact-match checks (skip if none defined) ─────────────────────
    exact_fields = example.get("exact_match_fields") or {}
    for field, expected_value in exact_fields.items():
        actual_value = agent_output.get(field)
        assert actual_value == expected_value, (
            f"[{example['id']}] Exact match failed on '{field}'.\n"
            f"  Expected: {expected_value}\n"
            f"  Got:      {actual_value}"
        )

    # ── Step 2: LLM-as-Judge scoring (skip if no criteria defined) ────────────
    criteria = example.get("criteria")
    if criteria:
        scores = llm_judge_score(
            input_text=example["input"],
            agent_output=agent_output.get("text", ""),
            reference_output=example.get("reference_output"),
            criteria=criteria,
        )
        # Each criterion scored 1–5; we require ≥ 3 on all dimensions
        failing_criteria = {
            k: v for k, v in scores.items() if v < 3
        }
        assert not failing_criteria, (
            f"[{example['id']}] LLM judge scored below threshold on: "
            f"{failing_criteria}"
        )


## ── Pass-rate summary fixture (printed after full run) ───────────────────────
def pytest_terminal_summary(terminalreporter, exitstatus, config):
    """Print a pass-rate summary at the end of the golden set run."""
    passed = len(terminalreporter.stats.get("passed", []))
    failed = len(terminalreporter.stats.get("failed", []))
    total = passed + failed
    if total > 0:
        rate = (passed / total) * 100
        print(f"\n🏆 Golden Set Pass Rate: {passed}/{total} ({rate:.1f}%)")
        # In CI, you can check exit code and fail the build below a threshold:
        # e.g., assert rate >= 90.0, f"Pass rate {rate:.1f}% below 90% threshold"

This harness gives you pytest's native test discovery and reporting while producing a named test case for every golden example. In CI output, a failure on GS-002 immediately tells you which capability regressed and why, rather than giving you an opaque overall score.

Computing and Tracking Pass Rates Over Time

A single pass-rate number is useful; a trend is essential. The code below shows a minimal pattern for writing pass-rate results to a time-series log so you can detect degradation across releases.

## evals/record_pass_rate.py
## Called after the pytest run to persist metrics for trend analysis.

import json
import datetime
from pathlib import Path

def record_pass_rate(
    golden_set_version: str,
    agent_version: str,
    passed: int,
    total: int,
    results_by_capability: dict,
):
    """
    Append a pass-rate snapshot to a JSON Lines log file.
    Each line is one run's result — easy to query with pandas or any log system.

    Example output line:
    {"ts": "2024-11-15T14:32:00Z", "golden_set": "v2.4", "agent": "v1.9.2",
     "pass_rate": 0.91, "by_capability": {"summarization": 0.95, "tool-calling": 0.82}}
    """
    log_path = Path("evals/pass_rate_log.jsonl")
    snapshot = {
        "ts": datetime.datetime.utcnow().isoformat() + "Z",
        "golden_set": golden_set_version,
        "agent": agent_version,
        "pass_rate": round(passed / total, 4) if total > 0 else None,
        "passed": passed,
        "total": total,
        "by_capability": results_by_capability,
    }
    with open(log_path, "a") as f:
        f.write(json.dumps(snapshot) + "\n")
    print(f"📊 Recorded pass rate {snapshot['pass_rate']:.1%} for agent {agent_version}")

With this log in place, you can plot pass-rate trends per capability over agent versions and immediately see if a code change caused tool-calling to drop from 82% to 65% while summarization held steady — which is far more actionable than a single aggregate number.

Putting It All Together

📋 Quick Reference Card: Golden Set Health Checklist

🔧 Dimension ✅ Healthy Signal ⚠️ Warning Sign
📚 Coverage Examples span all agent capabilities and difficulty levels 80%+ of examples test only the happy path
🎯 IAA Cohen's Kappa ≥ 0.6 on pilot annotation Annotators disagree on 30%+ of examples
🔒 Versioning Stored in git, tagged, with changelog Lives in a shared spreadsheet with no history
🧠 Freshness Expiry review dates are set; stale examples flagged Examples older than 6 months with no review
🔧 CI Integration Pass rate computed per run and logged as trend Evals run manually before releases only
📚 Adversarial Depth Dedicated adversarial and edge-case examples present All examples are clean, idealized inputs

Wrong thinking: "Our golden set is done once we build it. We maintain it by adding more examples."

Correct thinking: "Our golden set is a living artifact that requires the same review discipline as production code — additions, retirements, IAA checks, and version control."

Golden sets are the anchor that makes everything else in your eval pipeline trustworthy. The LLM-as-Judge scorer from Section 2 is only as good as the criteria it is judging against, and those criteria — when formalized and version-controlled in your golden set — become the institutional memory of what quality means for your agent. In the next section, we will use this golden set as the reference point for A/B evaluations, where we pit a new agent version against a live baseline to decide whether a change is safe to ship.

A/B Evals and Shadow Deployments: Comparing Agent Versions in Production

You've built a new version of your agent. It uses a better prompt, a smarter retrieval strategy, or a freshly fine-tuned model. Offline evals on your golden set look promising. But before you flip the switch and serve it to real users, you need to answer a harder question: is this version actually better in the wild? That's the problem A/B evals and shadow deployments are designed to solve — and solving it well requires careful methodology, a bit of statistics, and the right instrumentation in your production stack.

Why Pairwise Comparison Beats Absolute Scoring

When evaluating agent quality, your instinct might be to assign each response an absolute score — say, 1 through 5 — and compare averages. This approach has a fundamental weakness: rater variance. Different human raters, and even different LLM judges, interpret scales inconsistently. One judge's "4" is another's "3." Over a large evaluation set, this noise can easily swamp the real signal you're trying to detect, especially when the difference between two agent versions is subtle.

Pairwise comparison — also called A/B evaluation — sidesteps this problem entirely. Instead of asking "how good is this response?", you ask "which of these two responses is better?" You present the judge (human or LLM) with the prompt, Response A from your baseline agent, and Response B from your candidate agent, and elicit a preference: A, B, or tie.

🎯 Key Principle: Relative judgments are cognitively easier and more consistent than absolute ones. Humans (and LLMs) are far better at saying "this one is clearer" than they are at calibrating a precise numeric score.

The result is a preference rate — the fraction of trials where the candidate beats the baseline. A candidate scoring 55% preferences over baseline is meaningfully better; one scoring 48% is probably a wash or a regression. This framing also maps naturally onto business decisions: you're not asking "is our agent good?", you're asking "should we ship this version?"

There's one important trap to avoid when designing pairwise evals: position bias. LLM judges, like humans, have a tendency to prefer whichever response appears first, or whichever response is longer. To neutralize this, always run each comparison twice — once with A first, once with B first — and only count a preference when the judge is consistent across both orderings. Inconsistent judgments (A wins in one order, B wins in the other) should be classified as ties.

Pairwise Eval Flow:

┌─────────────────────────────────────────────────┐
│  Prompt: "Explain recursion to a 10-year-old"  │
└───────────────────┬─────────────────────────────┘
                    │
          ┌─────────┴──────────┐
          ▼                    ▼
   ┌─────────────┐      ┌─────────────┐
   │  Response A │      │  Response B │
   │  (Baseline) │      │ (Candidate) │
   └──────┬──────┘      └──────┬──────┘
          └──────────┬─────────┘
                     ▼
          ┌──────────────────────┐
          │    LLM-as-Judge      │
          │  (with rubric)       │
          └──────────┬───────────┘
                     │
      ┌──────────────┼───────────────┐
      ▼              ▼               ▼
  "A is better"  "B is better"   "Tie"

💡 Pro Tip: When writing the judge prompt for pairwise evals, instruct it to reason before declaring a winner. Chain-of-thought reasoning reduces the impact of superficial biases like length or formatting, because the judge is forced to articulate why one response is better.

Shadow Deployment Architecture

Pairwise evals on your golden set tell you how versions compare on curated inputs. Shadow deployments tell you how they compare on real inputs — the messy, unpredictable traffic that users actually send. This distinction matters enormously. Distribution shift between your golden set and production is one of the most common sources of misleading eval results.

A shadow deployment routes a copy of live production traffic to your candidate agent simultaneously with the baseline, captures both responses, but serves only the baseline's response to the user. The candidate's response is logged silently — hence "shadow." You can then run your pairwise judge pipeline over the paired logs asynchronously, accumulating preference data at production scale without exposing users to an unvalidated agent.

Shadow Deployment Architecture:

  User Request
       │
       ▼
┌──────────────┐
│   API Gateway │
│  / Middleware │
└──────┬───────┘
       │  Fan-out
       ├─────────────────────────┐
       ▼                         ▼
┌─────────────┐           ┌─────────────┐
│  Baseline   │           │  Candidate  │
│   Agent     │           │   Agent     │
│  (v1.2)     │           │  (v1.3)     │
└──────┬──────┘           └──────┬──────┘
       │                         │
       │  ◄── Served to user     │  ◄── Shadow (not served)
       ▼                         ▼
  User sees                 ┌────────────┐
  this response             │  Log Store │
                            │ (both resp)│
                            └─────┬──────┘
                                  │
                                  ▼
                         ┌─────────────────┐
                         │  Async Judge    │
                         │  Pipeline       │
                         └─────────────────┘

The key architectural property is non-interference: the shadow path must never affect latency or reliability for real users. The fan-out should be fire-and-forget from the user's perspective. If the shadow agent fails, times out, or throws an exception, the baseline response still reaches the user without interruption.

🤔 Did you know? Shadow deployments are sometimes called "dark launches" and have been used in traditional software engineering for decades to validate infrastructure changes. The LLM eval context adds the twist that you're not just checking for errors — you're continuously judging quality on the shadow traffic.

Instrumenting a Shadow Deployment

Let's walk through a concrete implementation. The following middleware intercepts requests to your agent, fans them out to both versions concurrently, logs both responses, and returns only the baseline's answer to the caller.

import asyncio
import time
import uuid
import logging
from dataclasses import dataclass, asdict
from typing import Optional
import httpx
import json

logger = logging.getLogger(__name__)

@dataclass
class ShadowLogEntry:
    request_id: str
    timestamp: float
    prompt: str
    baseline_response: str
    baseline_latency_ms: float
    shadow_response: Optional[str]       # None if shadow errored
    shadow_latency_ms: Optional[float]   # None if shadow errored
    shadow_error: Optional[str]          # Error message if failed

async def call_agent(
    session: httpx.AsyncClient,
    endpoint: str,
    prompt: str,
    timeout_seconds: float = 5.0
) -> tuple[Optional[str], float, Optional[str]]:
    """Call an agent endpoint. Returns (response, latency_ms, error)."""
    start = time.monotonic()
    try:
        resp = await session.post(
            endpoint,
            json={"prompt": prompt},
            timeout=timeout_seconds
        )
        resp.raise_for_status()
        latency_ms = (time.monotonic() - start) * 1000
        return resp.json()["response"], latency_ms, None
    except Exception as e:
        latency_ms = (time.monotonic() - start) * 1000
        return None, latency_ms, str(e)

async def shadow_middleware(
    prompt: str,
    baseline_url: str,
    shadow_url: str,
    log_store,              # Any async-capable log sink
    shadow_timeout: float = 5.0,
    shadow_sample_rate: float = 1.0   # Set < 1.0 to control cost
) -> str:
    """
    Fan out the request to baseline and shadow agents concurrently.
    Always returns the baseline response; shadow is logged only.
    """
    import random
    request_id = str(uuid.uuid4())
    run_shadow = random.random() < shadow_sample_rate

    async with httpx.AsyncClient() as session:
        if run_shadow:
            # Fire both calls concurrently
            baseline_task = asyncio.create_task(
                call_agent(session, baseline_url, prompt)
            )
            shadow_task = asyncio.create_task(
                call_agent(session, shadow_url, prompt, timeout_seconds=shadow_timeout)
            )
            # Await baseline first — this is what the user waits for
            baseline_resp, baseline_latency, baseline_err = await baseline_task

            # Shadow result is collected but NOT awaited on the critical path
            # We fire-and-forget logging
            asyncio.create_task(
                _log_shadow_result(
                    shadow_task, baseline_resp, baseline_latency,
                    prompt, request_id, log_store
                )
            )
        else:
            baseline_resp, baseline_latency, baseline_err = await call_agent(
                session, baseline_url, prompt
            )

    if baseline_err:
        raise RuntimeError(f"Baseline agent failed: {baseline_err}")

    return baseline_resp

async def _log_shadow_result(
    shadow_task, baseline_resp, baseline_latency,
    prompt, request_id, log_store
):
    """Collect shadow result and write the log entry. Never raises."""
    try:
        shadow_resp, shadow_latency, shadow_err = await shadow_task
        entry = ShadowLogEntry(
            request_id=request_id,
            timestamp=time.time(),
            prompt=prompt,
            baseline_response=baseline_resp or "",
            baseline_latency_ms=baseline_latency,
            shadow_response=shadow_resp,
            shadow_latency_ms=shadow_latency,
            shadow_error=shadow_err
        )
        await log_store.write(asdict(entry))
    except Exception as e:
        logger.error(f"Shadow logging failed for {request_id}: {e}")

This code does several things worth highlighting. First, it uses asyncio.create_task to run both agent calls concurrently, so the shadow call doesn't add to the user-perceived latency for the baseline response. Second, the logging itself is wrapped in a fire-and-forget task so even log failures are non-blocking. Third, shadow_sample_rate gives you a cost lever — you don't need to shadow 100% of traffic to get statistically meaningful results.

Once log entries accumulate, you run the judge pipeline over them asynchronously. Here's a minimal judge loop that processes shadow logs and records preferences:

async def run_judge_pipeline(
    log_store,
    judge_fn,           # async fn(prompt, resp_a, resp_b) -> "A" | "B" | "tie"
    results_store,
    batch_size: int = 50
):
    """
    Read unscored shadow log entries, run pairwise judge, write preference results.
    Handles position-bias by running each pair in both orderings.
    """
    entries = await log_store.fetch_unscored(limit=batch_size)

    for entry in entries:
        # Skip entries where shadow errored — can't compare
        if entry["shadow_error"] or not entry["shadow_response"]:
            await results_store.mark_skipped(entry["request_id"], reason="shadow_error")
            continue

        prompt = entry["prompt"]
        baseline = entry["baseline_response"]
        shadow = entry["shadow_response"]

        # Run in both orderings to neutralize position bias
        verdict_1 = await judge_fn(prompt, resp_a=baseline, resp_b=shadow)
        verdict_2 = await judge_fn(prompt, resp_a=shadow,   resp_b=baseline)

        # Normalize verdicts relative to candidate (shadow)
        # verdict_1: A=baseline,B=shadow → shadow wins if verdict_1 == "B"
        # verdict_2: A=shadow,B=baseline → shadow wins if verdict_2 == "A"
        shadow_wins_1 = (verdict_1 == "B")
        shadow_wins_2 = (verdict_2 == "A")

        if shadow_wins_1 == shadow_wins_2:
            # Consistent judgment — record it
            preference = "candidate" if shadow_wins_1 else "baseline"
        else:
            # Inconsistent across orderings — call it a tie
            preference = "tie"

        await results_store.write_preference(
            request_id=entry["request_id"],
            preference=preference,
            latency_delta_ms=(
                (entry["shadow_latency_ms"] or 0) - entry["baseline_latency_ms"]
            )
        )

💡 Real-World Example: At a search company deploying a new retrieval-augmented agent, engineers ran a shadow deployment at 10% traffic sampling for two weeks. The pairwise judge showed the candidate winning 54% of non-tie comparisons — but latency analysis from the same logs revealed the candidate was 40% slower at the 95th percentile. They shipped the new prompting logic while keeping the old retrieval backend, reaching 56% preference at acceptable latency.

Statistical Significance: Don't Declare a Winner Too Early

This is the section most engineers skip, and it's where shadow deployments most often go wrong. You run a shadow eval for three days, see your candidate winning 58% of comparisons on 200 samples, and declare victory. Then you ship — and the improvement evaporates in production.

The problem is peeking: repeatedly checking your results before you have enough data and stopping as soon as things look good. Even with a fair coin, if you flip it 20 times and check after each flip, you'll often see a run of 60%+ heads purely by chance. The same dynamic inflates your confidence in A/B eval results.

The right approach starts with a priori sample size estimation. Before you begin, decide: what's the minimum effect size that matters to your business? If the candidate must win at least 55% of comparisons (vs. the 50% null hypothesis of equal quality) to justify the engineering cost of a rollout, you can calculate the required sample size using a power analysis.

As a rule of thumb for preference rate comparisons:

📋 Quick Reference Card: Sample Size Estimates

🎯 Minimum detectable lift 📊 Required samples (80% power, α=0.05)
🔢 5 percentage points (50% → 55%) ~800
🔢 10 percentage points (50% → 60%) ~200
🔢 15 percentage points (50% → 65%) ~90

These are ballpark figures for a two-proportion z-test, but they illustrate the key insight: detecting subtle improvements requires substantially more data than detecting large ones. If you're hoping to confirm a 55% preference rate with confidence, you need roughly 800 non-tie comparisons — not 50, not 100.

Once you have your target sample size, pre-commit to it and don't look at significance until you reach it. If you must monitor early (for safety reasons), use a sequential testing method like the sequential probability ratio test (SPRT) that properly accounts for multiple looks.

When you do compute significance, report a confidence interval on the preference rate, not just a point estimate. "The candidate won 56% ± 3.5% of comparisons (95% CI)" is far more honest and useful than "the candidate won 56% of comparisons."

⚠️ Common Mistake — Mistake 1: Treating tied judgments as evidence for the null. Ties should simply be excluded from your preference rate calculation (you're measuring candidate_wins / (candidate_wins + baseline_wins)), but they should not be dropped from your sample size count. A shadow eval with 800 samples where 300 are ties is not equivalent to 800 clean comparisons.

⚠️ Common Mistake — Mistake 2: Running the shadow deployment during an atypical traffic period — a weekend, a product launch, a holiday — and generalizing the results. Shadow eval results are only as representative as the traffic slice they're drawn from.

Guardrails for Shadow Evals

Shadow deployments introduce real costs and risks that need active management.

Cost controls are the most immediate concern. Running two agents on every request can double your inference spend. The shadow_sample_rate parameter in the middleware above is your primary lever — sampling 10–20% of traffic typically yields sufficient statistical power within a reasonable window while cutting shadow costs by 80–90%. You should also set a hard budget cap: if your shadow pipeline's daily spend exceeds a threshold, automatically pause shadow sampling and alert on-call.

Latency budgets protect your user experience even when the shadow path is non-blocking. Set an explicit timeout for shadow agent calls (the shadow_timeout parameter above) — typically 1.5–2× your baseline's P99 latency. This prevents a pathologically slow shadow agent from holding open connections indefinitely and exhausting your async thread pool. Log the timeout rate; if it exceeds 5%, the shadow agent may have a regression that your judge pipeline won't capture (since those entries get skipped).

Error handling in the shadow path requires a zero-tolerance policy: shadow errors must never propagate to the user. The _log_shadow_result function above wraps everything in a try/except for this reason. Beyond not crashing the user request, you should track shadow error rates as a first-class signal. A candidate agent that errors on 15% of shadow requests is telling you something important even before the judge pipeline runs.

💡 Mental Model: Think of the shadow path like a black box flight recorder. It captures everything without affecting the plane's controls. Your job is to ensure the recorder never interferes with the instruments, stores data reliably, and gets reviewed systematically after the flight.

Prompt and output logging carries privacy obligations. Shadow logs contain the same user inputs as your regular request logs — meaning they're subject to all the same data retention, PII handling, and deletion policies. Don't create a shadow logging path that inadvertently persists sensitive data longer than your main logs do.

🎯 Key Principle: A shadow deployment is only valuable if you trust the data it produces. Garbage in (biased traffic sample, broken shadow agent, skewed judge) means garbage out, no matter how sophisticated your statistics. Invest in validating the pipeline itself before interpreting its results.

Putting It Together: From Shadow Data to Ship Decision

A mature shadow eval workflow looks like this:

Candidate agent ready
        │
        ▼
1. Pre-commit sample size (based on MDE)
2. Launch shadow deployment at ~10-20% traffic
3. Run async judge pipeline continuously
        │
   ┌────┴────────────────────────────────┐
   │  Monitor (do NOT check significance) │
   │  ✓ Shadow error rate < 5%           │
   │  ✓ Shadow timeout rate < 5%         │
   │  ✓ Cost within budget               │
   └────┬────────────────────────────────┘
        │  Target N reached
        ▼
4. Compute preference rate + 95% CI
5. Check latency delta (P50, P95, P99)
6. Decision:
   ├─ Candidate wins & latency OK → Full rollout
   ├─ Candidate wins & latency worse → Partial rollout / optimize
   ├─ Tie within CI → Deeper analysis or neutral rollout
   └─ Baseline wins → Block rollout, investigate

The ship decision integrates both the quality signal (preference rate) and the operational signal (latency delta, error rate). A candidate that's 60% preferred but 2× slower is not unconditionally better — it's a tradeoff your product team needs to make consciously.

Correct thinking: Shadow evals give you evidence to inform a deployment decision, not an automatic trigger. The final call should integrate eval results, latency metrics, cost impact, and business context.

Wrong thinking: "The eval pipeline said the candidate is better, so we ship." No pipeline captures everything. Shadow evals are one input among several.

With pairwise comparison methodology and shadow deployment architecture in place, you have a rigorous, production-grade mechanism for validating agent changes before they reach users. But shadow deployments only run when you're actively promoting a new version. The next challenge — detecting quality problems that emerge after you ship, without any obvious failure signal — requires a different layer entirely, which we'll cover in the next section on silent regressions.

Detecting Silent Regressions in Production

Your monitoring stack lights up when a service throws a 500 error. It fires alerts when p99 latency crosses a threshold. It pages someone at 2 a.m. when the database goes down. But none of those systems will tell you that your customer-facing agent started giving subtly evasive answers three weeks ago, or that its factual accuracy quietly eroded after your retrieval corpus was refreshed. These are silent regressions — and they are arguably the most dangerous failure mode in production AI systems precisely because nothing obviously breaks.

This section is about building the monitoring layer that catches what traditional observability misses: gradual, aggregate drift in the quality of your agent's behavior.


What Is a Silent Regression?

A silent regression is a gradual degradation in one or more quality dimensions — tone, factuality, task completion rate, refusal rate, verbosity — that unfolds slowly enough to stay beneath the threshold of conventional error monitoring. No individual request fails. Latency is fine. The model always returns a response. Yet the character of those responses has shifted in ways that erode user trust and product value.

Consider a few realistic examples:

  • 🔧 A model provider silently deploys a new checkpoint. Your agent's refusal rate on borderline queries increases from 4% to 11% over a week. Users stop asking nuanced questions because they've learned the agent won't engage.
  • 📚 Your RAG corpus is refreshed with newer documents, but the chunking pipeline introduced a bug that strips table data. Answers to quantitative questions become vague. No errors are thrown.
  • 🧠 A prompt template change intended to improve formatting accidentally removes an instruction that grounded the agent's tone. Responses become increasingly terse and unhelpful.
  • 🎯 A dependency version bump changes how tool-call results are serialized. The agent still completes tasks, but its reasoning trace becomes less coherent, and downstream tasks that depend on it start failing silently.

In each case, a human looking at a single response might not notice anything alarming. The signal lives in the distribution of quality scores over time, not in any individual data point.

🎯 Key Principle: Silent regressions are distributional problems. You cannot detect them by inspecting individual responses. You must track aggregate quality metrics over time and compare windows against established baselines.


Metric Instrumentation: What to Log Per Request

Before you can detect drift, you need to instrument your system to capture the right signals at request time. The goal is to build a quality telemetry layer that runs alongside your existing infrastructure telemetry.

For every request your agent handles in production, you should log:

Signal Why It Matters
🎯 LLM judge score Quantified quality on rubric dimensions (helpfulness, factuality, tone)
📏 Response length Proxy for verbosity drift; sudden compression can indicate prompt or model changes
🔧 Tool-call count and pattern Did the agent use the tools it was supposed to? New patterns may indicate regression
⏱️ Latency breakdown Per-step latency can reveal retrieval degradation or model slowdowns
🚫 Refusal flag Boolean or classifier score for whether the agent declined to answer
🏷️ Input category Bucketed topic or intent class; enables cohort-level analysis
🆔 User segment Cohort identifier to detect regressions that affect specific user populations
📅 Model version / prompt hash Critical for correlating quality changes to deployment events

💡 Pro Tip: Log the prompt template hash alongside every request. This is the single most useful piece of metadata for root-cause analysis. When a regression appears in your dashboard, comparing the hash distribution before and after the regression window immediately tells you whether a prompt change is involved.

Not every signal needs to be computed synchronously in the request path. Judge scores in particular are expensive — running an LLM evaluator on every live request would add hundreds of milliseconds of latency and significant cost. The answer is asynchronous sampling: capture the raw request and response, enqueue them, and run the judge in a background worker against a representative sample of traffic.

Live Request Path (synchronous, low latency)
────────────────────────────────────────────
User Request
    │
    ▼
 Agent Executor
    │
    ├── Response → User  (immediate)
    │
    └── {request, response, metadata}
            │
            ▼
        Sample Queue  (e.g., 10–20% of traffic)


Background Quality Pipeline (async, decoupled)
───────────────────────────────────────────────
Sample Queue
    │
    ▼
 Quality Worker
    ├── LLM Judge → rubric scores
    ├── Refusal Classifier → refusal_flag
    ├── Length / tool-call stats
    └── Write to Time-Series Store
            │
            ▼
       Dashboard + Alert Engine

This architecture keeps the user-facing latency clean while still giving you a statistically meaningful quality signal. A 10–20% sample rate is usually sufficient to detect regressions within hours for moderate-traffic systems.


Building the Quality Dashboard

Quality dashboards for AI agents differ from standard SRE dashboards in one critical way: the metrics are soft — they're scores and distributions rather than hard counters. This means your visualization and alerting primitives need to be statistical rather than threshold-based.

The most useful views to build:

  • 📊 Rolling mean judge score over a 24-hour sliding window, plotted as a time series with a baseline band
  • 📊 Score distribution histogram for the current window vs. the prior-week baseline — divergence here is more informative than mean drift alone
  • 📊 Refusal rate as a percentage of sampled requests, broken out by input category
  • 📊 Cohort-level quality heatmap — judge scores segmented by user tier, input category, or geography
  • 📊 Tool-call anomaly rate — percentage of requests where tool-call patterns deviated from the historical mode

Anomaly Detection Strategies

Raw dashboards are useful for human review, but you need automated detection to catch regressions quickly — especially overnight or over weekends. There are three complementary strategies worth implementing.

Sliding-Window Score Baselines

A sliding-window baseline computes the rolling mean and standard deviation of your judge scores over a reference window (typically 7 days) and compares the current short window (typically 1 hour or 4 hours) against it. A regression triggers when the short-window mean drops below baseline_mean - k * baseline_std for some sensitivity factor k (typically 2.0–2.5).

This approach is robust to natural diurnal variation — your quality scores may legitimately differ between peak hours and off-peak hours — because the baseline absorbs that pattern if it's computed over a full week.

Z-Score Alerts on Judge Score Distributions

Mean drift is one signal, but distributional shift is often a more sensitive early indicator. A z-score alert fires when the fraction of scores falling below a quality floor (e.g., below 3 on a 1–5 rubric) exceeds historical norms by more than a configurable number of standard deviations.

⚠️ Common Mistake — Mistake 1: Alerting only on mean score drops. A regression that makes the best responses slightly worse while making the worst responses much worse will barely move the mean but will dramatically increase the low-tail fraction. Always monitor the distribution, not just the center.

Cohort-Level Breakdowns

Cohort-level analysis is what separates sophisticated monitoring from naive monitoring. A regression that affects only one input category — say, questions about pricing — will be invisible in aggregate metrics if that category is 5% of your traffic. Segmenting quality scores by input category, user segment, and time-of-day gives you the resolution to catch these targeted regressions.

Aggregate Score: 3.8 (within normal bounds)  ← regression invisible here

By Input Category:
  billing_queries:       2.1  ← regression visible here
  general_support:       4.1
  product_information:   4.0
  account_management:    3.9

💡 Real-World Example: A fintech company's agent regression was caused by a RAG corpus update that inadvertently removed fee schedule documents. The aggregate quality score barely moved because billing queries were only 8% of traffic. Cohort-level monitoring caught it within two hours; aggregate monitoring would have taken days.


Correlating Eval Signals Back to Root Causes

Detecting a regression is only half the battle. The second half is figuring out why it happened, and doing so quickly. The primary root causes fall into a small number of categories:

  • 🔧 Model provider updates: LLM providers deploy new checkpoints without always announcing them. Correlate quality dips with provider changelog dates and monitor for changes in response style or capability signatures.
  • 📝 Prompt template changes: Any merge to your prompt templates should be treated as a deployment event and tagged in your monitoring system. A prompt hash annotated on every logged request makes this correlation immediate.
  • 📚 Retrieval corpus drift: Document refreshes, chunking pipeline changes, or embedding model updates can silently change what context the agent retrieves. Track retrieval diversity metrics and top-chunk overlap rates as leading indicators.
  • 📦 Dependency version bumps: A new version of your orchestration framework, tool-call library, or serialization layer can change agent behavior in ways that don't surface as errors.

The architectural key is deployment event markers on your time-series store. Every time any of the above changes, write a timestamped annotation to the same store where your quality metrics live. When you pull up a quality regression in your dashboard, the deployment markers immediately surface as candidate causes.

🧠 Mnemonic: Think of it as MPRDModel, Prompt, Retrieval, Dependency. Those are your four culprit categories when quality drifts.


Code Example: A Lightweight Production Monitor

The following implementation shows a production-ready quality monitor that samples live traffic, runs an LLM judge asynchronously, writes scores to a time-series store, and triggers alerts on regression. It's designed to be added to an existing agent system with minimal coupling.

import asyncio
import hashlib
import random
import time
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional

import httpx  # async HTTP for judge calls

## ── Data Structures ────────────────────────────────────────────────────────────

@dataclass
class RequestRecord:
    """Captures the raw data needed for async quality evaluation."""
    request_id: str
    user_input: str
    agent_response: str
    input_category: str          # e.g., "billing", "support", "product"
    user_segment: str            # e.g., "free", "pro", "enterprise"
    prompt_template_hash: str    # SHA-256 of the rendered prompt template
    latency_ms: float
    tool_calls_made: list[str]
    timestamp: datetime = field(default_factory=datetime.utcnow)


@dataclass
class QualityScore:
    """Scored output from the LLM judge."""
    request_id: str
    helpfulness: float       # 1–5
    factuality: float        # 1–5
    tone: float              # 1–5
    composite: float         # weighted average
    refusal_flag: bool
    timestamp: datetime
    input_category: str
    user_segment: str
    prompt_template_hash: str


## ── Judge Client ───────────────────────────────────────────────────────────────

JUDGE_PROMPT_TEMPLATE = """
You are an expert quality evaluator for an AI assistant.
Score the following response on three dimensions (1=poor, 5=excellent).
Respond with ONLY valid JSON.

User Input: {user_input}
Agent Response: {agent_response}

Return: {{"helpfulness": <1-5>, "factuality": <1-5>, "tone": <1-5>, "refusal": <true|false>}}
"""

async def run_llm_judge(
    record: RequestRecord,
    judge_api_url: str,
    judge_model: str = "gpt-4o-mini",
) -> QualityScore:
    """Calls the LLM judge asynchronously and returns a structured score."""
    prompt = JUDGE_PROMPT_TEMPLATE.format(
        user_input=record.user_input,
        agent_response=record.agent_response,
    )
    async with httpx.AsyncClient(timeout=30.0) as client:
        resp = await client.post(
            judge_api_url,
            json={"model": judge_model, "messages": [{"role": "user", "content": prompt}]},
        )
        resp.raise_for_status()
        raw = resp.json()["choices"][0]["message"]["content"]

    import json
    scores = json.loads(raw)
    composite = (
        scores["helpfulness"] * 0.4
        + scores["factuality"] * 0.4
        + scores["tone"] * 0.2
    )
    return QualityScore(
        request_id=record.request_id,
        helpfulness=scores["helpfulness"],
        factuality=scores["factuality"],
        tone=scores["tone"],
        composite=composite,
        refusal_flag=scores["refusal"],
        timestamp=record.timestamp,
        input_category=record.input_category,
        user_segment=record.user_segment,
        prompt_template_hash=record.prompt_template_hash,
    )


## ── Time-Series Store (stub for illustration) ──────────────────────────────────

class TimeSeriesStore:
    """Stub interface — replace with InfluxDB, Prometheus, or similar."""

    def write_score(self, score: QualityScore) -> None:
        print(
            f"[STORE] {score.timestamp.isoformat()} | "
            f"category={score.input_category} | composite={score.composite:.2f} | "
            f"refusal={score.refusal_flag} | prompt_hash={score.prompt_template_hash[:8]}"
        )

    def get_baseline_stats(
        self, category: str, window_days: int = 7
    ) -> dict[str, float]:
        """Returns {'mean': float, 'std': float} for the rolling baseline."""
        # In production: query your store for the past window_days of scores
        # filtered by category. Returning a static stub here.
        return {"mean": 3.8, "std": 0.4}


## ── Anomaly Detector ───────────────────────────────────────────────────────────

class RegressionDetector:
    """
    Computes a z-score for the current short window against the rolling
    baseline and fires an alert if the deviation exceeds the threshold.
    """

    def __init__(
        self,
        store: TimeSeriesStore,
        z_score_threshold: float = 2.0,
        short_window_minutes: int = 60,
    ):
        self.store = store
        self.z_score_threshold = z_score_threshold
        self.short_window_minutes = short_window_minutes
        self._recent_scores: list[QualityScore] = []  # in-memory buffer

    def ingest(self, score: QualityScore) -> None:
        """Add a new score and prune records outside the short window."""
        self._recent_scores.append(score)
        cutoff = datetime.utcnow() - timedelta(minutes=self.short_window_minutes)
        self._recent_scores = [s for s in self._recent_scores if s.timestamp > cutoff]

    def check_for_regression(
        self, category: Optional[str] = None
    ) -> Optional[dict]:
        """
        Returns an alert dict if a regression is detected, else None.
        Optionally scoped to a specific input category for cohort analysis.
        """
        scores = self._recent_scores
        if category:
            scores = [s for s in scores if s.input_category == category]

        if len(scores) < 10:  # need minimum sample to avoid false positives
            return None

        current_mean = sum(s.composite for s in scores) / len(scores)
        baseline = self.store.get_baseline_stats(category or "all")

        if baseline["std"] == 0:
            return None  # can't compute z-score without variance

        z_score = (current_mean - baseline["mean"]) / baseline["std"]

        if z_score < -self.z_score_threshold:
            return {
                "alert": "QUALITY_REGRESSION",
                "category": category or "all",
                "current_mean": round(current_mean, 3),
                "baseline_mean": baseline["mean"],
                "z_score": round(z_score, 3),
                "sample_size": len(scores),
                "detected_at": datetime.utcnow().isoformat(),
            }
        return None


## ── Production Monitor (orchestrator) ─────────────────────────────────────────

class ProductionQualityMonitor:
    """
    Ties sampling, judging, storage, and alerting into a single cohesive monitor.
    Designed to run as a background service alongside your agent.
    """

    def __init__(
        self,
        judge_api_url: str,
        sample_rate: float = 0.15,      # 15% of live traffic
        judge_model: str = "gpt-4o-mini",
        z_score_threshold: float = 2.0,
    ):
        self.judge_api_url = judge_api_url
        self.sample_rate = sample_rate
        self.judge_model = judge_model
        self.store = TimeSeriesStore()
        self.detector = RegressionDetector(self.store, z_score_threshold)
        self._queue: asyncio.Queue = asyncio.Queue(maxsize=1000)

    def maybe_enqueue(self, record: RequestRecord) -> None:
        """Called synchronously in the hot path — samples and enqueues."""
        if random.random() < self.sample_rate:
            try:
                self._queue.put_nowait(record)
            except asyncio.QueueFull:
                pass  # drop gracefully rather than block the request path

    async def _worker(self) -> None:
        """Background coroutine: judge → store → detect → alert."""
        while True:
            record: RequestRecord = await self._queue.get()
            try:
                score = await run_llm_judge(
                    record, self.judge_api_url, self.judge_model
                )
                self.store.write_score(score)
                self.detector.ingest(score)

                # Check both aggregate and cohort-level regressions
                for category in [None, score.input_category]:
                    alert = self.detector.check_for_regression(category)
                    if alert:
                        await self._fire_alert(alert)
            except Exception as exc:
                # Never let a monitoring failure affect the main system
                print(f"[MONITOR ERROR] {exc}")
            finally:
                self._queue.task_done()

    async def _fire_alert(self, alert: dict) -> None:
        """Sends alert to your on-call system (PagerDuty, Slack, etc.)."""
        # Replace with your actual alerting integration
        print(f"🚨 REGRESSION ALERT: {alert}")

    async def start(self, num_workers: int = 4) -> None:
        """Launches the background worker pool."""
        for _ in range(num_workers):
            asyncio.create_task(self._worker())

This implementation makes several deliberate design choices worth highlighting. The maybe_enqueue method is the only piece that runs synchronously in the hot request path, and it does almost nothing — a coin flip and a non-blocking queue push. All the expensive work (judge call, store write, regression check) happens in background workers. The try/except in the worker ensures that a flaky judge API or a malformed response never propagates back to the main system.

The check_for_regression call runs twice per scored request: once over all traffic and once scoped to the request's input category. This is how you catch the subtle category-level regressions that aggregate metrics miss.


Wiring It Into Your Agent

Integrating the monitor into an existing agent is intentionally lightweight — you call maybe_enqueue at the end of your request handler after returning the response to the user:

async def handle_agent_request(user_input: str, user_id: str) -> str:
    """Your existing agent handler, with monitoring added."""
    import hashlib

    start = time.monotonic()

    # --- Existing agent logic (unchanged) ---
    prompt_template = load_prompt_template("v3.2")
    prompt_hash = hashlib.sha256(prompt_template.encode()).hexdigest()
    response, tool_calls = await run_agent(user_input, prompt_template)

    latency_ms = (time.monotonic() - start) * 1000

    # --- Monitoring hook (non-blocking, added after response is ready) ---
    record = RequestRecord(
        request_id=generate_request_id(),
        user_input=user_input,
        agent_response=response,
        input_category=classify_input(user_input),   # lightweight classifier
        user_segment=get_user_segment(user_id),
        prompt_template_hash=prompt_hash,
        latency_ms=latency_ms,
        tool_calls_made=tool_calls,
    )
    monitor.maybe_enqueue(record)   # returns immediately; never blocks

    return response

⚠️ Common Mistake — Mistake 2: Running quality monitoring synchronously in the request path. Even a fast LLM judge call adds 300–800ms of latency. Users will notice. Always decouple the monitoring path from the serving path using a queue.


Closing the Loop: From Alert to Remediation

Detection without a remediation playbook is incomplete. When a regression alert fires, your team needs a documented runbook that includes:

  1. Check deployment markers in the time-series dashboard — did a prompt, model, or corpus change coincide with the regression window?
  2. Inspect the cohort breakdown — is the regression global or category-specific? Category-specific regressions almost always point to retrieval corpus or prompt routing issues.
  3. Pull a sample of scored requests from the regression window and review the judge rationales — they often name the specific failure mode.
  4. Roll back the most recent MPRD change if the correlation is clear, then verify that quality scores recover within one short window.

💡 Remember: The monitor is a detection system, not a diagnosis system. It tells you that something changed and when it changed. The deployment event markers and cohort breakdowns tell you what likely changed. Human judgment and your golden set evaluations (covered in Section 3) tell you whether the fix actually worked.


📋 Quick Reference Card: Silent Regression Detection Essentials

🔧 Component 📝 What It Does ⚠️ Key Pitfall
🎯 Async sampling queue Decouples monitoring from hot path Never run sync; drops are acceptable
📊 LLM judge worker Scores sampled responses on rubric Use cheaper judge model for volume
🗄️ Time-series store Persists scores + deployment markers Log prompt hash on every record
📉 Sliding-window baseline Establishes normal score distribution Use 7-day window to absorb diurnal patterns
🚨 Z-score alerting Detects distributional regression Alert on tail fraction, not just mean
🔍 Cohort breakdown Catches category-specific regressions Segment before declaring all-clear
📅 Deployment markers Correlates quality changes to code changes Annotate every MPRD event

Silent regressions are the chronic illness of production AI systems — easy to ignore, hard to reverse once entrenched, and costly to your users' trust. The monitoring architecture in this section is your early warning system: lightweight enough to run continuously, sensitive enough to catch drift within hours, and structured enough to point you toward the root cause before a subtle quality issue becomes a crisis.

Key Takeaways and Eval Pipeline Reference

You started this lesson facing a real engineering problem: LLM-based agents are hard to test, their failures are subtle, and traditional pass/fail unit tests leave you flying blind in production. Over the five sections that preceded this one, you've built up a complete mental model for tackling that problem systematically. This final section crystallizes everything into a set of principles you can act on today, a reference architecture you can adapt to your own stack, and a checklist that travels from zero to production monitoring without gaps.


What You Now Understand That You Didn't Before

Before this lesson, you might have assumed that an LLM agent is "working" if it returns a response without throwing an exception. Now you know that silent, gradual quality degradation is far more dangerous than hard failures precisely because no alert fires and no user screams—the system just quietly becomes less useful over time.

You've internalized four complementary instruments for combating that:

🧠 Golden sets anchor your pipeline to ground truth that humans have already verified. They make regressions detectable even when the model produces fluent, confident-sounding text.

📚 LLM-as-Judge turns quality evaluation into a scalable automated process, letting you score thousands of outputs per CI run instead of relying on a team of human reviewers for every merge.

🔧 A/B evals and shadow deployments let you compare a new agent version against your live baseline on real traffic, without exposing users to an unproven change.

🎯 Continuous production sampling closes the loop, ensuring that what you measured in staging still holds after the deployment is complete and the edge cases of real usage start arriving.

These four instruments aren't alternatives to one another—they're layers. Miss any one of them, and you have a blind spot.


The Eval Pyramid

The most useful mental model for organizing your eval pipeline is a three-layer pyramid, directly analogous to the classic software testing pyramid but adapted for agents.

                  ┌───────────────────────────┐
                  │   Production Sampling     │  ← Continuous, sampled, async
                  │   (Regression Monitoring) │
                 /└───────────────────────────┘\
                /  ┌───────────────────────────┐ \
               /   │  Shadow / Integration     │  \
              /    │  Evals (Pre-Deploy)        │   \
             /     └───────────────────────────┘    \
            /       ┌───────────────────────────┐    \
           /        │  Golden-Set Unit Tests    │     \
          /         │  (CI / Every Commit)      │      \
         /──────────└───────────────────────────┘───────\

Layer 1 — Golden-Set Unit Tests (CI, every commit): These are your fastest feedback loop. A curated set of input/output pairs, scored by an LLM judge calibrated against human labels, runs in seconds per test case. They catch obvious regressions before code even reaches a staging environment. Keep this set small enough to complete in under five minutes; ruthlessly prune stale cases and add new ones whenever a real bug slips through.

Layer 2 — Shadow / Integration Evals (pre-deploy): Before promoting a new agent version to receive live traffic, replay a representative slice of production requests through both the old and new versions in parallel. Score both with your LLM judge and run statistical significance tests on the delta. This layer catches the regressions that golden sets miss because they involve distribution shifts in real user phrasing.

Layer 3 — Continuous Production Sampling (ongoing): After deployment, randomly sample a percentage of live requests—typically 1–5%—score them asynchronously, and track rolling quality metrics. Alert on drift, not just on individual low scores. This is the layer that catches the slow, invisible degradation that neither CI nor shadow deployment can surface.

🎯 Key Principle: Each layer has a different cost/coverage tradeoff. Layer 1 is cheap and fast but has low coverage of the real distribution. Layer 3 has perfect coverage of real traffic but high latency between a regression occurring and your team acting on it. All three layers together give you both speed and completeness.


Summary Comparison Table

📋 Dimension 🔒 Golden Set (L1) 🔧 Shadow Eval (L2) 🎯 Prod Sampling (L3)
🕐 When it runs Every CI commit Pre-deploy gate Continuous, post-deploy
📊 Traffic source Curated fixed set Replayed prod traffic Live sampled requests
⚡ Feedback speed Minutes Hours Hours–Days (rolling)
🌍 Distribution coverage Low (curated) High (real traffic) Very high (live)
💰 Cost per run Low Medium–High Low (sampled %)
🚨 Main risk caught Prompt/code regressions Version-to-version quality delta Silent drift post-launch


Top Mistakes to Avoid

Now that you have the full picture, the most important thing you can do is recognize the failure modes that kill real eval pipelines in practice. They're subtle, they're common, and at least one of them will bite you if you're not watching for it.

Mistake 1: Trusting a Judge Without Calibrating It Against Humans ⚠️

An LLM judge is only as reliable as the rubric you gave it and your evidence that it agrees with humans. Many teams deploy a judge, collect automated scores, and make ship/no-ship decisions—without ever checking whether the judge's scores correlate with what a human would say.

Wrong thinking: "The judge scored our new prompt at 4.2/5, up from 3.9. Ship it."

Correct thinking: "Our judge has a Cohen's kappa of 0.74 against our human raters on 200 calibration examples. The score increase from 3.9 to 4.2 is statistically significant given our sample size of 500 shadow eval responses. Ship it."

Calibration is not a one-time event. Re-run your human correlation check every time you update the judge's rubric, change the judge model, or add a new evaluation dimension.

Mistake 2: Shipping a Golden Set That Never Gets Updated ⚠️

A golden set that was accurate six months ago is a liability today. Your product has evolved, your users have evolved, and—if you've been doing any prompt engineering—your agent's capabilities have changed. A stale golden set can give you green CI signals while a real-world regression sits undetected.

🎯 Key Principle: Treat your golden set like production code. It needs a review cycle, versioning, and a clear owner. A practical policy: add at least one new case for every bug that escapes to production, and prune any case that no longer reflects a realistic user interaction.

Mistake 3: Declaring A/B Winners Without Statistical Significance ⚠️

This is the most seductive mistake because it feels rigorous—you ran an experiment!—but the conclusion is noise. If your shadow eval runs on 80 requests and the new agent scores 0.3 points higher, that delta could easily be sampling variance. Declare it a winner and you've shipped a change that might be neutral or even slightly worse.

⚠️ Common Mistake: Setting a fixed sample size before computing the required power for your expected effect size. Always compute your minimum detectable effect first, then derive the sample size, not the other way around.

💡 Pro Tip: For most agent quality metrics, an effect size of 0.1 standard deviations is practically meaningful. At 80% power and p < 0.05, that typically requires 1,500–2,000 paired observations. If your daily traffic can't support that in a shadow deployment window, pool multiple days of replayed traffic before drawing conclusions.



Minimum Viable Eval Pipeline: Quick-Reference Checklist

Use this checklist when standing up an eval pipeline from scratch. Every item maps to a concept covered in this lesson.

Phase 1: Golden Set Creation

  • Identify 3–5 core capabilities your agent must reliably perform
  • Collect 20–50 real or realistic inputs per capability
  • Write expected outputs or reference answers with human review
  • Store inputs, expected outputs, and metadata (date, source, version) in version control
  • Tag each case with the capability it exercises (for slice-level reporting)

Phase 2: Judge Setup

  • Author a structured rubric (2–4 scored dimensions relevant to your use case)
  • Write a judge prompt that accepts input, actual output, and optionally a reference answer
  • Run the judge on 100–200 golden cases and collect human scores on the same cases
  • Compute Cohen's kappa or Spearman correlation; target κ ≥ 0.65 before trusting the judge in CI
  • Log all judge calls with inputs and outputs for auditability

Phase 3: CI Integration

  • Run golden set through the judge on every pull request
  • Set a minimum average score threshold (e.g., mean ≥ 3.8/5) that blocks merge on failure
  • Report per-capability slice scores, not just overall mean
  • Store historical CI scores to track trends over time

Phase 4: Shadow Deployment

  • Set up a traffic duplication mechanism (reverse proxy, message queue fan-out, or log replay)
  • Define your candidate vs. baseline versioning scheme
  • Run shadow eval until reaching your pre-computed minimum sample size
  • Score both versions with the same judge; compute paired score deltas
  • Run a paired t-test or Wilcoxon signed-rank test; require p < 0.05 before promotion

Phase 5: Production Monitoring

  • Instrument your serving layer to log a random sample of requests (1–5%)
  • Score sampled requests asynchronously with the judge
  • Compute a rolling 7-day mean score and track week-over-week delta
  • Set an alert threshold (e.g., mean drops more than 0.2 points in a 24-hour window)
  • Review flagged samples manually each week to catch rubric drift

Reference Implementation: Putting It All Together

The following code sketch shows a minimal orchestration layer that wires the three pyramid levels into a single pipeline object. This is intentionally simplified to highlight structure rather than production-ready detail.

## eval_pipeline.py
## Minimal orchestration layer connecting all three eval pyramid levels.

import statistics
from dataclasses import dataclass
from typing import Callable, Optional

@dataclass
class EvalCase:
    input: str
    reference_output: Optional[str]
    capability_tag: str

@dataclass
class EvalResult:
    case: EvalCase
    actual_output: str
    score: float          # 1–5 from LLM judge
    judge_rationale: str

class EvalPipeline:
    def __init__(
        self,
        agent_fn: Callable[[str], str],          # The agent under test
        judge_fn: Callable[[EvalCase, str], tuple[float, str]],  # (score, rationale)
        golden_set: list[EvalCase],
        ci_threshold: float = 3.8,               # Minimum mean score to pass CI
        regression_alert_delta: float = 0.2,     # Drop that triggers a monitoring alert
    ):
        self.agent = agent_fn
        self.judge = judge_fn
        self.golden_set = golden_set
        self.ci_threshold = ci_threshold
        self.regression_alert_delta = regression_alert_delta

    # ── Layer 1: Golden-set CI run ────────────────────────────────────────────
    def run_ci(self) -> dict:
        """Run every golden-set case through the agent and judge."""
        results: list[EvalResult] = []
        for case in self.golden_set:
            output = self.agent(case.input)
            score, rationale = self.judge(case, output)
            results.append(EvalResult(case, output, score, rationale))

        # Slice scores by capability tag for granular reporting
        by_tag: dict[str, list[float]] = {}
        for r in results:
            by_tag.setdefault(r.case.capability_tag, []).append(r.score)

        overall_mean = statistics.mean(r.score for r in results)
        passed = overall_mean >= self.ci_threshold

        return {
            "passed": passed,
            "overall_mean": overall_mean,
            "by_capability": {tag: statistics.mean(scores)
                              for tag, scores in by_tag.items()},
            "failures": [r for r in results if r.score < self.ci_threshold],
        }

    # ── Layer 2: Shadow eval (pre-deploy comparison) ──────────────────────────
    def run_shadow_eval(
        self,
        candidate_agent_fn: Callable[[str], str],
        shadow_requests: list[str],
    ) -> dict:
        """Score baseline vs. candidate on replayed traffic; return delta and p-value."""
        from scipy.stats import wilcoxon  # type: ignore

        baseline_scores, candidate_scores = [], []
        for req in shadow_requests:
            dummy_case = EvalCase(input=req, reference_output=None, capability_tag="shadow")
            baseline_out = self.agent(req)
            candidate_out = candidate_agent_fn(req)
            b_score, _ = self.judge(dummy_case, baseline_out)
            c_score, _ = self.judge(dummy_case, candidate_out)
            baseline_scores.append(b_score)
            candidate_scores.append(c_score)

        delta = statistics.mean(candidate_scores) - statistics.mean(baseline_scores)
        stat, p_value = wilcoxon(candidate_scores, baseline_scores)
        significant = p_value < 0.05

        return {
            "baseline_mean": statistics.mean(baseline_scores),
            "candidate_mean": statistics.mean(candidate_scores),
            "delta": delta,
            "p_value": p_value,
            "statistically_significant": significant,
            "recommendation": "promote" if (significant and delta > 0) else "hold",
        }

    # ── Layer 3: Production monitoring alert check ────────────────────────────
    def check_for_regression(
        self,
        rolling_mean_now: float,
        rolling_mean_previous_window: float,
    ) -> dict:
        """Compare two rolling-window means and flag if drift exceeds threshold."""
        drop = rolling_mean_previous_window - rolling_mean_now
        alert = drop >= self.regression_alert_delta
        return {
            "alert": alert,
            "drop": drop,
            "message": (
                f"⚠️ Regression detected: mean score dropped {drop:.2f} points."
                if alert else "✅ No significant regression detected."
            ),
        }

This class is intentionally lean. In practice you'd replace the scipy.stats call with a more robust statistical framework, add async execution for the judge calls, and persist results to a data store. But the structure mirrors the three-layer pyramid exactly: run_ci is Layer 1, run_shadow_eval is Layer 2, and check_for_regression is Layer 3.


The next snippet shows a minimal judge calibration helper—the step most teams skip and later regret:

## judge_calibration.py
## Measures agreement between LLM judge scores and human scores.
## Run this any time you update your rubric or switch judge models.

from scipy.stats import spearmanr  # type: ignore

def calibrate_judge(
    judge_fn,
    calibration_cases: list[dict],  # Each dict: {"case": EvalCase, "actual": str, "human_score": float}
    min_acceptable_rho: float = 0.65,
) -> dict:
    """
    Computes Spearman correlation between judge scores and human scores.
    Returns calibration report with a pass/fail flag.
    """
    judge_scores = []
    human_scores = []

    for item in calibration_cases:
        score, _ = judge_fn(item["case"], item["actual"])
        judge_scores.append(score)
        human_scores.append(item["human_score"])

    rho, p_value = spearmanr(judge_scores, human_scores)
    passed = rho >= min_acceptable_rho

    return {
        "spearman_rho": rho,
        "p_value": p_value,
        "n_samples": len(calibration_cases),
        "passed": passed,
        "message": (
            f"✅ Judge calibrated: ρ={rho:.2f} (threshold {min_acceptable_rho})"
            if passed
            else f"⚠️ Judge FAILED calibration: ρ={rho:.2f} — revise rubric before use."
        ),
    }

💡 Pro Tip: Gate your CI pipeline on the calibration check as well as the golden-set scores. If a rubric update causes the judge to drift from human agreement, you want that surfaced immediately—not three weeks later when you notice CI scores are trending up while user satisfaction is trending down.



Where This Connects in the Roadmap

Eval pipelines don't exist in isolation—they're the foundation for a broader feedback loop that makes your agent progressively better over time. Three topics in the roadmap build directly on what you've learned here:

🔧 Feedback Loops and Data Flywheels: The production samples you collect in Layer 3 are raw training signal. The next lesson covers how to route low-scoring sampled responses into a human review queue and convert them into new golden-set cases, closing the loop between monitoring and ground-truth maintenance.

🎯 Fine-Tuning Agents on Eval Failures: Once you have a labeled set of failures—cases where your judge scored below threshold and a human confirmed the failure—you have the seed dataset for supervised fine-tuning. The fine-tuning lesson will show you how to structure that data and measure improvement against the same eval pipeline you've just built.

📚 Automated Prompt Optimization (DSPy and Similar Frameworks): Tools like DSPy treat your eval pipeline as an optimization objective. Your LLM judge becomes the loss function, and the framework automatically searches the space of prompt variants to maximize it. The automated prompt optimization lesson assumes you already have a reliable, calibrated judge—which you now do.

🧠 Mnemonic: Think of the eval pipeline as a GPS system for your agent: golden sets are the known landmarks, the judge is the turn-by-turn voice, shadow deployments are the test drive before the road trip, and production monitoring is the live traffic update. Remove any one of them and you're navigating blind on at least one leg of the journey.


Final Critical Points to Remember

⚠️ A calibrated judge with a mediocre rubric will confidently give you wrong scores at scale. Invest in the rubric before you invest in the automation.

⚠️ Statistical significance is not optional for A/B decisions. A delta that feels meaningful can be noise. Always pre-compute your required sample size before running the experiment, not after.

⚠️ Your golden set is a living artifact. If it doesn't reflect the current version of your product and your current user base, the CI signal it produces is misleading—potentially in the direction of false confidence.

💡 Real-World Example: A team running a customer-support agent found that their CI scores held steady at 4.1/5 across six months of development—but their user satisfaction scores dropped from 87% to 71% over the same period. Investigation revealed that the golden set had been authored before the product's scope expanded, so it tested only the original narrow use case. New capabilities were never covered. The lesson: when your eval scores look suspiciously stable while user feedback deteriorates, audit your golden set coverage first.


Practical Next Steps

Here are three concrete actions you can take within the next week to start applying what you've learned:

  1. Audit your current testing strategy. If you have no golden set, spend two hours writing 20 input/output pairs for your agent's single most important capability. That's Layer 1, minimal viable, and it's immediately more coverage than you had before.

  2. Run a calibration check on any judge you're already using. Collect 50 agent outputs, score them yourself, then score them with your judge. Compute Spearman rho. If it's below 0.6, your judge is a coin flip with extra steps—fix the rubric before trusting the scores.

  3. Add one production monitoring metric to your observability stack. Even without a full sampling pipeline, you can start by logging judge scores for every request that contains explicit negative user feedback (thumbs down, complaint keyword, escalation). That's a biased sample, but it's directional signal you can act on today.

The eval pipeline is not a gate you build once and walk away from. It's infrastructure that earns its value through continuous operation, continuous calibration, and continuous expansion as your agent's capabilities grow. The teams that ship reliable agents in production are the ones that treat evaluation with the same rigor they bring to the agent itself.