Why Rigorous Eval Exists
Classical metrics broke down, human eval doesn't scale, and the cost of being wrong is the organizing principle that determines how much rigor your eval pipeline actually needs.
The Evaluation Gap: Why LLM Outputs Resist Simple Scoring
Imagine you've just shipped a new version of your LLM-powered customer support assistant. Your team spent two weeks tuning the prompt, switched to a newer model checkpoint, and ran it against a handful of test cases. Everything looked fine. Two days after deployment, a senior engineer notices the assistant is now giving subtly evasive answers to billing questions — not wrong exactly, but unhelpful in a way that's driving up escalations. Nobody caught it because nobody had a systematic way to catch it. This is the evaluation gap: the distance between "we looked at some outputs and they seemed okay" and "we have principled evidence that this system behaves as intended across the inputs that actually matter."
Closing that gap requires understanding why LLM outputs are structurally harder to evaluate than the outputs of most software engineers have spent their careers building.
The Deterministic Contract That LLMs Break
Traditional software operates under a contract that most engineers internalize so thoroughly they forget it's a contract at all: given the same inputs, the system produces the same output, and that output either satisfies the specification or it doesn't. A sorting function either returns a correctly sorted list or it doesn't. A payment processor either charges the right amount or it doesn't.
This deterministic, binary structure makes evaluation straightforward in principle. You write a test suite, enumerate the expected outputs, and run a pass/fail check.
## Traditional software evaluation: deterministic and binary
def sort_integers(nums: list[int]) -> list[int]:
return sorted(nums)
def test_sort_integers():
assert sort_integers([3, 1, 2]) == [1, 2, 3] # exactly one correct answer
assert sort_integers([]) == [] # empty input, one correct answer
assert sort_integers([-1, 0, 1]) == [-1, 0, 1] # negative numbers, one answer
This works because the space of correct outputs has exactly one member for each input. LLMs shatter this assumption. Ask a language model to "summarize this customer complaint in two sentences" and there are thousands of valid outputs. A test that checks for any specific string will fail on every valid output except the one it was written against.
## Why naive string matching fails for LLM outputs
expected = "The customer is unhappy with the shipping delay and wants a refund."
## All three are valid summaries — all would fail an exact-match test
output_a = "Customer reports dissatisfaction due to delayed shipment and is requesting reimbursement."
output_b = "A shipping delay has left this customer frustrated and seeking a refund."
output_c = "The customer's complaint centers on a late delivery; they are asking for their money back."
assert output_a == expected # ❌ fails
assert output_b == expected # ❌ fails
assert output_c == expected # ❌ fails
This isn't a quirk to engineer around — it's a fundamental property of open-ended text generation. Evaluating within that space requires a richer framework than pass/fail matching.
Quality Has No Single Number
The difficulty compounds when you recognize that LLM output quality isn't a single property — it's a bundle of distinct, often partially independent dimensions that resist collapsing into one score.
Consider a prompt asking a model to answer a user's medical question in a reassuring tone without providing specific diagnostic advice. A given output might be fluent, factually accurate, task-complete, tone-appropriate, and safe — but these dimensions interact without reducing to each other. An output can be fluent and factually accurate but tonally wrong. It can be appropriately cautious but so vague it doesn't address the question.
Output Quality Dimensions (Illustrative — not exhaustive)
FLUENCY
│
┌─────────────┼─────────────┐
│ │ │
FACTUAL TASK TONE
ACCURACY COMPLETION APPROPRIATENESS
│ │ │
└─────────────┼─────────────┘
│
SAFETY
One number hides which dimension drove the score.
A 72% might mean "great but slightly off-tone"
or "fluent but factually broken" — very different problems.
When you assign a single score to such an output, you're making an implicit weighting decision about which dimensions matter most — and that decision is a design choice, whether you make it consciously or not.
💡 Real-World Example: A document translation system might achieve a high token-overlap score against a reference translation while failing to preserve the register of the original — translating formal legal language into casual phrasing. The aggregate score looks healthy; the output is unusable for its intended purpose.
Evaluation Is a Design Decision
Choosing what to measure is a design decision with consequences, not a neutral technical task. Because you can't measure everything simultaneously with equal weight, every eval pipeline embeds a set of priorities — explicit or not. And because modern LLM development is iterative, the metrics you choose shape what the system gets good at.
🎯 Key Principle: Measurement creates pressure. If your eval pipeline only measures fluency and task completion, your optimization iterations will improve fluency and task completion. Factual accuracy and safety will drift — not because anyone decided to deprioritize them, but because they weren't being watched.
This is Goodhart's Law in its practical form: when a measure becomes a target, it ceases to be a good measure. The design decision has three components:
- What to measure — which quality dimensions your eval operationalizes
- How to weight them — since any aggregate score requires weighting
- What test inputs to use — since the distribution of your eval set determines what "good performance" actually means
## The same outputs, three different eval functions, three different conclusions
outputs = [
{"response": "Your order will arrive in 3-5 days.", "is_factual": True, "tone": "neutral", "complete": True},
{"response": "We totally understand your frustration and we're SO sorry!", "is_factual": True, "tone": "overly_casual", "complete": False},
{"response": "Shipments typically resolve within the standard window.", "is_factual": True, "tone": "appropriate", "complete": False},
]
def eval_factual_only(outputs):
return sum(o["is_factual"] for o in outputs) / len(outputs)
def eval_completion_only(outputs):
return sum(o["complete"] for o in outputs) / len(outputs)
def eval_weighted(outputs):
scores = []
for o in outputs:
score = (0.4 * o["is_factual"] +
0.4 * o["complete"] +
0.2 * (1 if o["tone"] == "appropriate" else 0))
scores.append(score)
return sum(scores) / len(scores)
print(f"Factual-only score: {eval_factual_only(outputs):.2f}") # 1.00 — looks perfect
print(f"Completion-only score: {eval_completion_only(outputs):.2f}") # 0.33 — looks broken
print(f"Weighted score: {eval_weighted(outputs):.2f}") # 0.60 — more honest
The same three outputs look anywhere from perfect to broken depending solely on what you measure. The weights themselves are a design decision that should be made deliberately, documented, and revisited as the system's context evolves.
The Default: Ad-Hoc Spot-Checking and Its Failure Mode
With this structural complexity in place, it's worth being concrete about what happens when teams don't build deliberate eval pipelines. The default isn't chaos — it's something more insidious: ad-hoc spot-checking, which produces the feeling of validation without the substance.
Spot-checking typically looks like this: a developer makes a prompt change, runs it against five to ten examples they happen to think of, eyeballs the outputs, decides things look better, and ships. This process has several systematic failure modes.
Availability bias in test case selection. The examples a developer thinks of are almost always clear, well-formed, happy-path cases. Edge cases that actually break systems — unusual phrasings, adversarial queries, inputs that combine features in unexpected ways — are precisely the ones hard to think of in the moment.
No regression detection. Without a stable, versioned test suite, you have no way to know whether a change that improves performance on tested examples causes degradation elsewhere. The billing question failures at the opening of this section were real regressions a systematic eval suite would have caught.
Inconsistency across sessions. Human judgment about output quality varies across sessions, evaluators, and fatigue levels. Without a consistent scoring function, the evaluation result depends partly on who ran it and when.
No audit trail. Ad-hoc spot-checking leaves no persistent record of what was tested, what the outputs were, or what judgment was applied.
Ad-hoc spot-checking over time:
Sprint 1 Sprint 2 Sprint 3
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Test: │ │ Test: │ │ Test: │
│ 5 cases │ │ 7 cases │ │ 4 cases │
│ (diff.) │ │ (diff.) │ │ (diff.) │
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
"Looks good" "Looks fine" "Seems okay"
Result: no comparability, no regression detection,
no audit trail, no coverage guarantees.
Systematic eval pipeline over time:
Sprint 1 Sprint 2 Sprint 3
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Suite: │ │ Suite: │ │ Suite: │
│ 200 cases│───▶│ 200 cases│───▶│ 200 cases│
│ (fixed) │ │ (fixed) │ │ (fixed) │
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
score: 0.81 score: 0.79 score: 0.84
▲
Regression detected!
Investigate before ship.
💡 Mental Model: Think of ad-hoc spot-checking as sampling from a distribution you're actively biasing toward cases you already know work. A deliberate eval pipeline is designed to sample from the distribution of inputs your system will actually encounter. The gap between these two distributions is where production failures hide.
Why This Demands a Discipline, Not a Tool
The evaluation gap isn't closed by picking the right library or metric. It's closed by treating evaluation as a first-class engineering discipline — a set of principled practices that sit alongside system development, not downstream of it.
This means eval is designed before or during system development, not after. The test set is treated as a versioned artifact with the same rigor as the model itself. The scoring function is a choice that gets documented and revisited. Results are stored in a way that enables historical comparison.
The stakes vary enormously by context. An internal productivity tool where the cost of a bad output is mild user friction demands a different level of eval rigor than a system providing health information where a wrong answer can cause real harm. The principle that the cost of being wrong determines how much rigor your eval pipeline needs is the organizing logic of this entire roadmap — and it starts here, with understanding why the gap exists in the first place.
📋 Quick Reference: Traditional Software vs. LLM Evaluation
| Traditional Software | LLM Systems | |
|---|---|---|
| Output space | Finite, well-defined | Effectively infinite |
| Correctness | Binary (pass/fail) | Graded, multi-dimensional |
| Test method | Exact-match assertion | Scoring function + judgment |
| Coverage | Enumerate cases | Sample from distribution |
| Metric design | Often implicit in spec | Explicit design decision |
| Regression detection | Green/red test suite | Score comparison over time |
With this foundation in place, the next section defines what rigorous evaluation actually looks like in practice — moving from the problem to the engineering properties a well-designed eval pipeline must have.
What Rigorous Eval Actually Means in Practice
Informal testing and rigorous evaluation can look identical from the outside: both involve running a model on some inputs and inspecting the outputs. The difference is entirely structural. Informal testing tells you whether the model does something reasonable on the examples you happened to try. Rigorous evaluation tells you whether the model meets a defined standard across the range of inputs it will encounter in production — and whether that standard is still met after the next change.
Rather than listing abstract virtues, it helps to think of the four core properties as failure modes in reverse: each property exists because a specific class of errors is common and costly when that property is missing.
INFORMAL TESTING RIGOROUS EVAL
───────────────── ──────────────────────────────────
Ad-hoc inputs Pinned, versioned test set
Run once, eyeballed Reproducible: same result each run
Team's memorable cases Coverage: samples real distribution
"Looks about the same" Sensitivity: detects regressions
No baseline to compare Comparability: stable reference point
Reproducibility
Reproducibility means that running the same eval suite twice on the same model checkpoint produces the same result. The sources of variation that break it in practice are more numerous than they appear.
Prompt drift: if the prompts in your eval are stored without version control, they will silently change. A paraphrase that seems equivalent often isn't — different phrasing elicits different behavior, and what looks like a quality improvement may just be an unrecorded prompt tweak.
Test set mutation: adding examples changes what the score measures. Removing examples is worse, because it's often motivated by examples the model fails on.
Inference randomness: language models use sampling by default. The fix is to pin inference parameters — set temperature to 0 for deterministic greedy decoding, or fix a random seed when sampling is required.
⚠️ Common Mistake: Setting temperature=0 is sufficient for deterministic outputs only when the underlying inference implementation is itself deterministic. Floating-point non-determinism across hardware and library versions can still introduce small variations. For the highest reproducibility requirements, log the full output text alongside scores rather than relying on score reproducibility alone.
Together, pinned prompts, pinned test sets, and pinned inference parameters define the minimum reproducibility contract.
Coverage
Coverage is the property that your test set samples the actual input distribution your system will face in production — not the inputs that are easy to construct, memorable to the team, or representative of the happy path.
The failure mode is convenience sampling: test sets built by asking engineers to write examples cluster around clean, well-formed inputs where the task is unambiguous and the model is likely to succeed. This produces optimistic scores that don't predict real-world performance.
🎯 Key Principle: Coverage is not about having more examples — it's about having the right examples. A test set of 50 examples that faithfully samples the production distribution is more useful than 500 examples drawn from a narrow, atypical slice of it.
Achieving good coverage requires sampling from logs of actual usage, stratifying across meaningful dimensions (input length, topic, edge-case categories), and explicitly including failure modes discovered during earlier evaluations.
Sensitivity
Sensitivity is the property that your metric produces meaningfully different scores when the model's output quality changes meaningfully. A metric that returns 0.91 for an excellent model and 0.89 for a model with a systematic factual error has failed at its primary job.
Sensitivity failures are common because metrics are often chosen for convenience. Token-overlap metrics can be insensitive to factual errors if the factually wrong output shares enough surface-level vocabulary with the reference.
A practical way to validate sensitivity: deliberately inject a systematic error into a small subset of outputs, then check whether your metric score drops proportionally. If a 20% injection of clearly wrong answers moves your score by less than a percentage point, the metric is not sensitive enough.
⚠️ Common Mistake: Aggregating scores into a single number too early can mask sensitivity. A model that fails catastrophically on one input subcategory but performs well on others may show an acceptable aggregate score. Reporting per-category breakdowns alongside the aggregate is a practical guard against this.
Comparability
Comparability means that a score produced today can be meaningfully compared to a score produced last month. Without it, you cannot answer the most important operational question: is the system getting better or worse over time?
Comparability requires two things. First, stable baselines: a fixed reference point against which new scores are measured. Second, versioned eval artifacts: every component of the eval — the test set, the prompts, the scoring logic, the inference configuration — must be version-controlled and tagged alongside the score it produces.
💡 Mental Model: Think of an eval result not as a number but as a triple: (score, model_state, eval_state). Two scores are comparable only if their eval_state components are identical. If anything in the eval pipeline changed between runs, you are not measuring the same thing.
A Minimal Reproducible Eval Harness
The four properties above translate directly into engineering decisions in the code that runs your evaluations. The following harness is a minimal but complete implementation that demonstrates all four properties.
import json
import hashlib
import time
from pathlib import Path
from typing import Callable
def load_pinned_dataset(path: str) -> tuple[list[dict], str]:
"""
Load a fixed test set from a JSON-lines file.
Computes a content hash so the caller can detect accidental mutation.
Each record: {"id": str, "prompt": str, "expected": str}
"""
raw_bytes = Path(path).read_bytes()
content_hash = hashlib.sha256(raw_bytes).hexdigest()[:12]
records = [
json.loads(line)
for line in raw_bytes.decode().splitlines()
if line.strip()
]
return records, content_hash
def score_exact_match(output: str, expected: str) -> float:
"""Simple deterministic scorer: 1.0 if outputs match exactly, else 0.0."""
return 1.0 if output.strip() == expected.strip() else 0.0
def run_eval(
dataset_path: str,
model_fn: Callable[[str], str],
scorer_fn: Callable[[str, str], float],
model_id: str,
prompt_version: str,
output_path: str,
) -> dict:
"""
Full eval run: load dataset, run inference, score outputs, log results.
Returns a summary dict with aggregate score and per-example breakdowns.
"""
records, dataset_hash = load_pinned_dataset(dataset_path)
per_example = []
for record in records:
output = model_fn(record["prompt"])
score = scorer_fn(output, record["expected"])
per_example.append({
"id": record["id"],
"prompt": record["prompt"],
"expected": record["expected"],
"output": output, # raw output stored for retroactive re-scoring
"score": score,
})
aggregate_score = sum(e["score"] for e in per_example) / len(per_example)
result = {
"metadata": {
"model_id": model_id,
"prompt_version": prompt_version,
"dataset_path": dataset_path,
"dataset_hash": dataset_hash, # detects test set mutation
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
},
"aggregate_score": round(aggregate_score, 4),
"n_examples": len(per_example),
"per_example": per_example,
}
Path(output_path).write_text(json.dumps(result, indent=2))
return result
A few design decisions deserve explicit explanation because each maps to one of the four properties.
The content hash on the dataset (dataset_hash) enforces reproducibility and comparability. If the test set file changes between runs, the hash differs, and the two scores are no longer directly comparable.
Storing raw outputs alongside scores supports retroactive re-scoring. If you later adopt a better scoring function, you can re-score the stored outputs without re-running inference — important when inference is expensive.
The metadata block — model ID, prompt version, dataset hash, and timestamp — is the provenance record that makes comparability possible.
## Mock model — replace with your actual API call in practice.
## A real implementation would pass temperature=0 or a fixed seed.
def mock_model(prompt: str) -> str:
responses = {
"What is 2+2?": "4",
"Translate 'hello' to French.": "Bonjour",
"What color is the sky?": "blue",
}
return responses.get(prompt, "I don't know.")
## Sensitivity validation: inject a known regression, check the score drops.
def mock_regressed_model(prompt: str) -> str:
"""Simulates a model with a systematic error on arithmetic prompts."""
if "2+2" in prompt:
return "5" # wrong answer injected
return mock_model(prompt)
## If the score drop is proportional to the injection rate,
## the metric is sensitive. If the score barely moves, it is not.
This sensitivity validation pattern — run the eval on a deliberately degraded model and check that the score drops accordingly — is a practical sanity check you can apply to any metric before relying on it for production decisions.
📋 Quick Reference: The Four Eval Properties
| Property | ❌ Absent | ✅ Present |
|---|---|---|
| Reproducibility | Different score each run on same model | Pinned prompts, test set, inference params |
| Coverage | Test set is the team's happy-path examples | Sampled from real production distribution |
| Sensitivity | Metric doesn't move on real regressions | Validated against injected degradations |
| Comparability | Can't tell if score change is real | Versioned artifacts + stable baseline |
🧠 Mnemonic: RCSC — Reproducibility, Coverage, Sensitivity, Comparability. A result that fails any one of them should not be used to make consequential decisions about the system.
With these four properties defined and illustrated in code, the rest of the roadmap has a stable vocabulary to build on. When a later section argues that human evaluation is the calibration target for automated metrics, the implicit claim is that human judgments, collected carefully, satisfy all four properties — but at a throughput cost that makes them impractical as the primary scoring mechanism for continuous deployment. That tradeoff is the subject of the next section.
Human Evaluation: The Ground Truth Ceiling and Its Scaling Problem
Every automated metric in your evaluation pipeline is, at bottom, an attempt to approximate something: the judgment a knowledgeable human would make about whether a model output is actually useful. That relationship — automated metric as approximation of human judgment — is load-bearing. If your automated scores do not correlate with what humans would say, they are measuring the wrong thing, and any system you optimize against them will drift in the wrong direction.
Why Human Judgment Is the Reference Standard
Consider what you are actually asking when you evaluate an LLM output. You are not asking whether the output matches a template. You are asking whether a real person, attempting a real task, would be better served by having received this output than not. That question draws on contextual knowledge, pragmatic inference, and shared norms that are difficult to formalize. No metric can fully capture it from first principles.
This is why human evaluation occupies the ground truth ceiling: the upper bound of evaluation fidelity that all other methods try to approach. When researchers introduce a new automated metric, the standard validation move is to show correlation with human ratings on a held-out set. When two metrics disagree, human judgment is the tiebreaker.
🎯 Key Principle: Every automated evaluation method derives its legitimacy from how well it tracks human judgment. Human eval is not one method among many — it is the calibration target against which other methods are measured.
The implication is immediate: if you have never collected human labels for your task domain, you are operating on faith that your automated metrics capture what matters.
The Throughput Wall
So why not just use human evaluation for everything? The answer becomes concrete the moment you model the throughput arithmetic.
For nuanced multi-criterion evaluation — factual accuracy, source fidelity, and tone appropriateness simultaneously — throughput drops well below a hundred outputs per rater per day once you account for ramp-up time, calibration discussions, and sustained cognitive load.
Now consider what is happening on the model side during active development. A single hyperparameter sweep can produce tens of thousands of outputs in an hour. A batch inference job might produce far more.
Human Throughput vs. Model Throughput
Human annotation team
┌─────────────────────────────────────────────────────────┐
│ 5 annotators × ~80 outputs/day = ~400 outputs/day │
│ 400 outputs/day ÷ 24 hours ≈ 17 outputs/hour │
└─────────────────────────────────────────────────────────┘
│
│ gap: orders of magnitude
▼
Model inference
┌─────────────────────────────────────────────────────────┐
│ Batch job: 50,000 outputs in a few hours │
│ Dev iteration: hundreds of candidates per prompt │
│ A/B experiment: multiple variants × large test set │
└─────────────────────────────────────────────────────────┘
Result: human eval can cover a small, curated sample.
It cannot follow every model change in real time.
The throughput gap is structural, not a temporary inconvenience. Human attention is finite and expensive; model inference is parallelizable and cheap. Any evaluation strategy that relies solely on human scoring will lose coverage as soon as development pace increases — precisely when coverage matters most.
💡 Real-World Example: A team running weekly model updates might commission human evaluation for major checkpoints — say, a candidate release — while running automated evals on every commit. The human eval validates that the automated pipeline tracks quality correctly; the automated pipeline catches regressions before they reach the human evaluation stage. This division of labor is not a compromise; it is the intended architecture.
Inter-Annotator Agreement: The Noise Inside the Gold Standard
Even accepting throughput limits, a natural response is: run human eval on a representative sample, and trust those labels as ground truth. This is reasonable as far as it goes, but it conceals a complication practitioners routinely underestimate: human raters do not agree with each other as often as intuition suggests.
Inter-annotator agreement (IAA) measures the degree to which independent raters assign the same label to the same output. On tasks with crisp, unambiguous criteria, IAA is typically high. On the tasks where LLMs are actually deployed — is this medical summary accurate? is this tone appropriate for a customer support context? — IAA is often lower than teams expect.
The reasons are structural:
- Criterion underspecification. Rating rubrics that seem clear in a training session become ambiguous at edge cases. "Factually accurate" sounds precise until a rater must decide whether a technically true statement that omits a crucial qualifier counts as accurate.
- Background knowledge variance. A statement that a specialist recognizes as subtly wrong may pass a generalist's check.
- Scale interpretation. On a five-point scale, one rater's "3 — acceptable" is another's "4 — good."
🎯 Key Principle: Human labels are noisy. IAA is not a secondary concern to report in an appendix — it is part of the primary result. A label set with low IAA is not ground truth; it is a distribution of opinions that may or may not converge on a reliable signal.
The standard practice is to compute a statistic that accounts for chance agreement — Cohen's kappa for two raters, Fleiss's kappa for multiple raters, or intraclass correlation for continuous scales. A kappa above 0.6 is a useful working target for most NLP annotation tasks, though this threshold is domain-dependent and should be calibrated against the specific task.
## Minimal inter-annotator agreement calculation
from sklearn.metrics import cohen_kappa_score
## Each list represents one rater's labels for the same 10 outputs
## Labels: 1=poor, 2=acceptable, 3=good
rater_a = [3, 2, 3, 1, 2, 3, 3, 2, 1, 3]
rater_b = [3, 2, 2, 1, 3, 3, 3, 2, 1, 2]
kappa = cohen_kappa_score(rater_a, rater_b)
print(f"Cohen's kappa: {kappa:.3f}")
## Interpretation:
## < 0.20 : slight agreement
## 0.21–0.40 : fair
## 0.41–0.60 : moderate
## 0.61–0.80 : substantial <-- reasonable target for most NLP tasks
## > 0.80 : near-perfect
⚠️ Common Mistake: Treating IAA measurement as a one-time calibration exercise at the start of an annotation project. Rater drift — gradual shifts in how individuals apply a rubric over time — means IAA should be re-measured at regular intervals throughout a long annotation campaign.
Labeler Fatigue and Systematic Bias Across Sessions
Even a single rater, evaluated in isolation, does not produce perfectly consistent labels across time. Labeler fatigue is the degradation in judgment quality that accumulates over a sustained annotation session: decision quality deteriorates as cognitive resources deplete, and raters compensate by applying simpler heuristics — rating toward the scale midpoint, anchoring on recent examples, or reducing sensitivity to subtle distinctions.
This creates a subtle but dangerous form of bias. If your annotation workflow assigns outputs in a fixed order, systematic patterns in your test set can create systematic patterns in your label noise. A topic cluster that falls late in the annotation queue will be evaluated under cognitive load, and those labels will be less reliable than average.
Beyond within-session fatigue, cross-session inconsistency means a rater who annotates on Monday and resumes on Thursday may have subtly recalibrated their internal standards in the interim. The practical mitigation is to include anchor examples — a fixed set of pre-labeled outputs at the start of each session — so raters recalibrate to a stable reference before beginning new work.
def check_anchor_drift(
anchor_labels_by_session: dict[str, list[int]],
expected_labels: list[int],
tolerance: int = 1,
) -> dict[str, bool]:
"""
anchor_labels_by_session: {session_id: [label_for_anchor_1, ...]}
expected_labels: canonical labels for each anchor example
tolerance: max allowed deviation per label before flagging drift
"""
drift_report = {}
for session_id, labels in anchor_labels_by_session.items():
drifted = any(
abs(actual - expected) > tolerance
for actual, expected in zip(labels, expected_labels)
)
drift_report[session_id] = drifted
return drift_report
anchor_labels = {
"session_2026-01-08": [3, 2, 4],
"session_2026-01-15": [3, 1, 4], # second anchor drifted by 1
"session_2026-01-22": [2, 2, 3], # first and third anchors drifted
}
expected = [3, 2, 4]
report = check_anchor_drift(anchor_labels, expected, tolerance=1)
for session, drifted in report.items():
status = "⚠️ drift detected" if drifted else "✅ stable"
print(f"{session}: {status}")
The Architecture That Follows from These Constraints
Human evaluation is simultaneously the highest-fidelity signal available and fundamentally throughput-limited, inherently noisy, and subject to systematic bias from fatigue and cross-session inconsistency. These properties, taken together, produce a specific architectural conclusion: human evaluation should be used to build and validate automated metrics, not to score every production output.
How human eval fits into a rigorous pipeline
┌─────────────────────────────────────────────┐
│ DEVELOPMENT PHASE │
│ │
│ Collect human labels on curated sample │
│ │ │
│ ▼ │
│ Measure IAA → revise rubric if needed │
│ │ │
│ ▼ │
│ Build automated metric (classical or │
│ LLM-as-judge) │
│ │ │
│ ▼ │
│ Validate: does automated metric correlate │
│ with human labels on held-out set? │
└─────────────────────────────────────────────┘
│
│ if correlation is sufficient
▼
┌─────────────────────────────────────────────┐
│ PRODUCTION / ITERATION PHASE │
│ │
│ Run automated metric on every model │
│ version, every commit, every candidate │
│ │ │
│ ▼ │
│ Periodic human re-validation to check │
│ that automated metric has not drifted │
│ from human judgment │
└─────────────────────────────────────────────┘
The human labels collected in the development phase serve two purposes. First, they define what quality means for this specific task and domain — grounded in real judgments rather than assumed. Second, they provide a labeled dataset against which candidate automated metrics can be validated.
This framing makes LLM-as-judge legible as an architectural solution to a concrete bottleneck — a method for approximating human judgment at throughput that human annotation cannot achieve. Understanding the bottleneck precisely is what lets you evaluate proposed solutions critically rather than accepting them on faith.
A Minimal Working Eval Pipeline: From Outputs to Scores
The ideas so far — what rigorous eval means, why human evaluation can't scale — converge on a practical question: what does a working eval pipeline actually look like in code? This section builds one from the ground up, deliberately keeping it simple enough to read in one sitting while preserving the structural properties that make it worth building at all.
The Four-Stage Architecture
Every eval pipeline can be decomposed into four separable stages:
┌─────────────────────────────────────────────────────────────────┐
│ EVAL PIPELINE STAGES │
├──────────────┬──────────────┬──────────────┬───────────────────┤
│ 1. DATASET │ 2. INFERENCE│ 3. SCORING │ 4. LOGGING │
│ LOADING │ │ │ │
│ │ │ │ │
│ Load & │ Send each │ Compare │ Write results │
│ validate │ prompt to │ output to │ to structured │
│ (prompt, │ model; │ expected; │ log with │
│ expected) │ store raw │ produce │ metadata │
│ pairs │ response │ score │ │
└──────────────┴──────────────┴──────────────┴───────────────────┘
│ │ │ │
independently independently independently independently
testable testable testable testable
The word separable is doing real work. When stages are entangled — when your scoring logic is woven directly into your inference loop — a change in one forces you to re-run the other. Keeping stages independent means you can swap a scoring method without re-querying the model, reprocess old logs when requirements change, and write a unit test for each stage against a fixture without standing up a real LLM.
🎯 Key Principle: Separability is not an aesthetic preference. It's what lets you change scoring logic retroactively — a capability that pays dividends the first time your rubric evolves mid-project.
Why Raw Outputs Must Be Stored Separately from Scores
The single most consequential structural decision in an eval pipeline is storing raw model outputs independently of the scores derived from them. When raw outputs are stored, a scoring change is just a reprocessing job over existing logs. You replay the scorer against stored outputs, write new score fields, and compare before/after without touching the model. This is sometimes called decoupled scoring, and it's the foundation that makes iterative rubric development practical.
💡 Real-World Example: Imagine you're evaluating a summarization assistant and initially score outputs with a regex that checks for the presence of a date. You later realize the regex was too strict. With decoupled scoring, you fix the regex and reprocess. Without it, you'd need to re-run hundreds of inference calls and hope the model responds identically.
Implementing a Scorer Interface
A deterministic scorer depends only on the model's response and the expected value — no randomness, no external calls. The two most common variants are exact-match and regex-based scoring. Both should be implemented as a unified interface so that a model-based scorer can later slot in without restructuring the pipeline.
import re
from typing import Protocol
class Scorer(Protocol):
"""Interface that any scorer — deterministic or model-based — must satisfy."""
def score(self, output: str, expected: str) -> float:
"""Return a score in [0.0, 1.0] for a single output."""
...
class ExactMatchScorer:
"""Passes (1.0) iff the normalized output equals the normalized expected value."""
def __init__(self, normalize: bool = True):
self.normalize = normalize
def _norm(self, text: str) -> str:
return text.strip().lower() if self.normalize else text
def score(self, output: str, expected: str) -> float:
return 1.0 if self._norm(output) == self._norm(expected) else 0.0
class RegexScorer:
"""Passes (1.0) iff the output matches the compiled regex pattern."""
def __init__(self, pattern: str, flags: int = re.IGNORECASE):
self._pattern = re.compile(pattern, flags)
def score(self, output: str, expected: str) -> float: # noqa: ARG002
return 1.0 if self._pattern.search(output) else 0.0
class ModelBasedScorerStub:
"""
Placeholder for a model-based scorer (e.g., LLM-as-judge).
Satisfies the Scorer interface so the pipeline runs end-to-end
before the real implementation is wired in.
"""
def score(self, output: str, expected: str) -> float: # noqa: ARG002
raise NotImplementedError(
"ModelBasedScorerStub is not yet implemented. "
"Wire in a real LLM judge before running production evals."
)
All three classes satisfy the same Scorer protocol. The pipeline doesn't need to know which scorer it holds; it just calls .score(). This is the structural property that lets you swap a deterministic scorer for a model-based one without modifying any other pipeline code.
⚠️ Common Mistake: Designing the deterministic and model-based scorer as two separate code paths in the pipeline rather than as two implementations of one interface. When the time comes to run both scorers on the same dataset for calibration, a unified interface makes this trivial; separate paths make it painful.
A Complete Working Example
The following function assembles all four stages into one runnable pipeline. In a real system, _mock_llm_call would be replaced by an actual API call with a pinned model identifier and deterministic sampling parameters.
import json
import uuid
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
Dataset = list[tuple[str, str]]
def _mock_llm_call(prompt: str, model_id: str) -> str: # noqa: ARG001
"""
Stand-in for a real model call. Returns a predictable response
so the pipeline can be tested end-to-end without network access.
"""
last_word = prompt.strip().split()[-1].rstrip("?!").lower()
return last_word
def run_eval(
dataset: Dataset,
scorer: Scorer,
model_id: str = "mock-model-v1",
prompt_version: str = "v1.0",
log_path: Path | None = None,
) -> dict[str, Any]:
"""
Run a complete eval pipeline over a dataset.
Args:
dataset: List of (prompt, expected_output) pairs.
scorer: Any object satisfying the Scorer protocol.
model_id: Identifier for the model being evaluated.
prompt_version: Version tag for the prompt template in use.
log_path: Optional path to a .jsonl file for result logging.
Returns:
A summary dict with 'pass_rate' and 'examples' (per-example breakdowns).
"""
scorer_id = type(scorer).__name__
run_id = str(uuid.uuid4())[:8]
examples: list[dict[str, Any]] = []
log_file = open(log_path, "a", encoding="utf-8") if log_path else None
try:
for prompt, expected in dataset:
raw_output = _mock_llm_call(prompt, model_id) # Stage 2
score = scorer.score(raw_output, expected) # Stage 3
record: dict[str, Any] = { # Stage 4
"run_id": run_id,
"model_id": model_id,
"prompt_version": prompt_version,
"scorer_id": scorer_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"prompt": prompt,
"expected": expected,
"raw_output": raw_output, # Stored for retroactive rescoring.
"score": score,
}
examples.append(record)
if log_file:
log_file.write(json.dumps(record) + "\n")
finally:
if log_file:
log_file.close()
scores = [ex["score"] for ex in examples]
pass_rate = sum(scores) / len(scores) if scores else 0.0
return {
"run_id": run_id,
"model_id": model_id,
"prompt_version": prompt_version,
"scorer_id": scorer_id,
"pass_rate": pass_rate,
"n_examples": len(examples),
"examples": examples,
}
## Example usage
if __name__ == "__main__":
test_cases: Dataset = [
("What is the capital of France?", "france"),
("What color is the sky?", "sky"),
("What is 2 + 2?", "four"), # mock echoes "2", not "four" — fails
]
scorer = ExactMatchScorer(normalize=True)
results = run_eval(
dataset=test_cases,
scorer=scorer,
model_id="mock-model-v1",
prompt_version="v1.0",
log_path=Path("eval_results.jsonl"),
)
print(f"Pass rate: {results['pass_rate']:.0%}")
for ex in results["examples"]:
status = "✓" if ex["score"] == 1.0 else "✗"
print(f" {status} output={ex['raw_output']!r:10s} expected={ex['expected']!r}")
Notice that eval_results.jsonl contains one JSON record per example, each with raw_output stored verbatim. If you later swap ExactMatchScorer for a RegexScorer or a model-based scorer, you can reload those records and re-score without touching the mock LLM.
⚠️ Common Mistake: Writing only summary statistics to the log file instead of per-example records. Summary stats are derived from raw records — they shouldn't replace the per-example records that make retroactive analysis possible.
What This Pipeline Intentionally Leaves Out
This is a simplified picture. A production pipeline would also handle retry logic and timeouts, concurrent inference, scorer error handling, and content-addressed dataset versioning. None of these are conceptually hard, but they would obscure the structural point. The four-stage skeleton is the pattern; those additions are production hardening.
💡 Mental Model: Think of the four stages as contracts between modules. Dataset loading promises a list of (prompt, expected) pairs. Inference promises to store raw outputs unchanged. Scoring promises a float in [0.0, 1.0]. Logging promises to persist records in a format that supports reprocessing. As long as each module honors its contract, the internals can change independently.
The pipeline above uses exact-match scoring — appropriate when the answer space is closed and string equality is a meaningful proxy for correctness. The upcoming sections examine exactly where that assumption breaks down, and how LLM-as-judge fits into the same scorer slot without changing anything else.
Common Eval Pitfalls That Invalidate Results
An eval pipeline that produces a clean number is not the same as an eval pipeline that produces a meaningful number. Each pitfall in this section shares the same underlying structure: a pipeline that runs without errors, logs scores to a file, and returns a result that looks authoritative — but is measuring something other than what you think it's measuring.
Pitfall 1: Test Set Leakage
Test set leakage occurs when examples from your evaluation set appear — in whole or in part — in the data used to fine-tune the model or in the few-shot examples embedded in your prompts. The result is an inflated score that does not reflect the model's ability to generalize to unseen inputs.
Leakage is treacherous because it is invisible at runtime. The pipeline runs, scores are computed, and nothing signals that anything is wrong. A score of 90% on a leaked test set might correspond to a true generalization score of 65% — a difference large enough to change a deployment decision.
import hashlib
def fingerprint(text: str) -> str:
"""Produce a stable hash for deduplication."""
return hashlib.sha256(text.strip().lower().encode()).hexdigest()
def check_for_leakage(
eval_examples: list[dict],
training_examples: list[dict],
few_shot_examples: list[dict],
) -> list[dict]:
"""
Return any eval examples whose prompt fingerprint appears in
the training set or few-shot pool.
Each example dict is expected to have a 'prompt' key.
"""
training_fingerprints = {fingerprint(ex["prompt"]) for ex in training_examples}
few_shot_fingerprints = {fingerprint(ex["prompt"]) for ex in few_shot_examples}
contaminated = training_fingerprints | few_shot_fingerprints
return [ex for ex in eval_examples if fingerprint(ex["prompt"]) in contaminated]
leaked_items = check_for_leakage(eval_set, fine_tune_data, few_shot_pool)
if leaked_items:
print(f"⚠️ {len(leaked_items)} eval examples found in training or few-shot data.")
print("Remove these before scoring.")
This check uses exact-match fingerprinting, which catches verbatim copies. Near-duplicate leakage — paraphrased versions of the same scenario — requires embedding-based similarity search. Exact fingerprinting is a necessary first step, not a complete solution.
🎯 Key Principle: The eval set must be assembled before any decisions about training data or few-shot examples. Once you build your eval set first and treat it as read-only, the contamination risk drops dramatically.
Pitfall 2: Metric-Task Mismatch
Metric-task mismatch happens when the metric you apply is structurally incompatible with the kind of correctness the task requires. The most common instance is applying token-overlap metrics like BLEU or ROUGE to tasks where multiple valid phrasings exist.
BLEU and ROUGE measure n-gram overlap with a reference string. This is a reasonable proxy when the reference is nearly the only acceptable phrasing. It breaks down badly for summarization, question answering, or instruction following, where many phrasings of the same correct answer are equally valid.
Question: What is the capital of France?
Reference: "Paris is the capital of France."
Model output A: "The capital of France is Paris." ← concise, correct
Model output B: "Paris is the capital of France and has been for centuries, ← verbose
serving as the political and cultural center of the country."
BLEU score for A: lower (word order differs)
BLEU score for B: higher (more overlapping n-grams)
BLEU penalizes the concise, correct output while rewarding the verbose one. A metric that systematically rewards copying the reference's structure over producing accurate, well-formed responses is optimizing for the wrong thing.
| Task Type | Poor Metric Choice | Better Metric Choice |
|---|---|---|
| Open-ended QA | BLEU/ROUGE | Exact-match on extracted answer, LLM-as-judge |
| Summarization | ROUGE-L alone | Faithfulness score + LLM relevance judge |
| Code generation | Token overlap | Pass rate on unit tests |
| Classification | String match on full output | Label extraction + accuracy |
| Instruction following | BLEU | Rubric-based LLM judge |
⚠️ Common Mistake: Teams reach for BLEU or ROUGE because they are easy to compute and produce a number that feels authoritative. Precision does not imply validity. A metric with the wrong structure gives you high-confidence wrong answers.
Pitfall 3: Evaluating on the Happy Path Only
Happy path bias occurs when the test set is built primarily from examples where the system is expected to succeed, while underrepresenting the inputs where failure is most likely and most costly.
This happens for understandable reasons: when teams sit down to build an eval set, they think of prototypical use cases with clear, unambiguous prompts and clean reference answers. The result is a test set that produces encouraging numbers while systematically missing the failure modes that surface in production.
WHAT GETS INTO THE TEST SET
(without deliberate sampling strategy)
Input Distribution (Real World)
┌──────────────────────────────────────┐
│ ████████████ │
│ ████████████ ← Happy path │
│ ████████████ (well-represented) │
│ │
│ ░░░░ ← Ambiguous inputs │
│ ░░ ← Adversarial inputs │
│ ░ ← Edge cases │
│ ░ ← High-risk categories │
└──────────────────────────────────────┘
░ = underrepresented in eval
█ = overrepresented in eval
The correct approach is to sample from the full input distribution — actively seeking out categories of inputs the model is likely to fail on. For high-risk applications, this means deliberately oversampling failure-prone categories relative to their natural frequency, so that the eval is sensitive to the failures that matter most.
🤔 Did you know? The happy path problem compounds over time. Teams often add new examples to the eval set when they find failures in production, meaning the eval set slowly becomes a catalog of known failures. Inputs that cause novel failures continue to be underrepresented until they surface in production. A proactive sampling strategy breaks this reactive cycle.
Pitfall 4: Treating a Single Aggregate Score as Sufficient
An aggregate score collapses performance across all examples into one number. That number is useful for headline comparisons, but it actively hides information critical for deployment decisions.
Consider a model that achieves an overall pass rate of 85% across 1,000 eval examples. Now segment the results:
import json
from collections import defaultdict
def segmented_score_report(results: list[dict]) -> dict:
"""
Compute per-category pass rates from a flat list of scored results.
Each result dict is expected to have 'category' and 'passed' keys.
"""
category_counts: dict[str, dict[str, int]] = defaultdict(
lambda: {"passed": 0, "total": 0}
)
for result in results:
cat = result["category"]
category_counts[cat]["total"] += 1
if result["passed"]:
category_counts[cat]["passed"] += 1
return {
cat: {
"pass_rate": round(counts["passed"] / counts["total"], 3),
"n": counts["total"],
}
for cat, counts in category_counts.items()
if counts["total"] > 0
}
## Example output from a segmented report:
example_report = {
"general": {"pass_rate": 0.94, "n": 800},
"medical": {"pass_rate": 0.51, "n": 100},
"legal": {"pass_rate": 0.48, "n": 50},
"financial": {"pass_rate": 0.62, "n": 50},
}
total = sum(v["n"] for v in example_report.values())
aggregate = sum(v["pass_rate"] * v["n"] for v in example_report.values()) / total
print(f"Aggregate pass rate: {aggregate:.2f}") # → 0.85
The aggregate is 0.85. The medical and legal segments are below 0.55. If the application handles queries in those categories — where errors carry real-world consequences — an 85% aggregate is not a green light. It is a signal that the system performs well on its easy categories and poorly on the ones that matter most.
🎯 Key Principle: Segmented scoring — breaking aggregate results into meaningful subcategories — is a prerequisite for a valid deployment decision in any application where different input categories carry different failure costs. Define your categories before you run the eval, based on your understanding of the application's risk profile.
Pitfall 5: Not Versioning the Eval Suite
Eval drift is what happens when the test set, scoring prompt, or scoring logic changes between runs without being recorded. Two score numbers exist — say, 78% last month and 84% this month — and you cannot tell whether the model improved or the eval got easier.
The components that must be versioned:
- Test set — adding, removing, or modifying examples changes what is being measured
- Scoring prompt — if using LLM-as-judge, a rephrased rubric is a different measurement instrument
- Scoring model — if the judge model is updated, the same prompt may produce systematically different scores
- Inference parameters — temperature, sampling strategy, model version identifier
EVAL VERSIONING STRUCTURE
eval-suite/
├── v1.2.0/
│ ├── test_set.jsonl ← pinned, read-only
│ ├── scoring_prompt.txt ← exact judge prompt text
│ ├── config.json ← model id, temperature, seed
│ └── results/
│ ├── run_2026-03-01.jsonl
│ └── run_2026-04-15.jsonl
│
└── v1.3.0/
├── test_set.jsonl
├── scoring_prompt.txt
├── CHANGELOG.md ← "Added 50 medical examples; revised
│ rubric to clarify factual accuracy"
└── results/
import hashlib
import json
import pathlib
from datetime import datetime, timezone
def compute_suite_fingerprint(
test_set_path: str,
scoring_prompt_path: str,
) -> dict[str, str]:
"""Return stable hashes for the core eval artifacts."""
def file_hash(path: str) -> str:
content = pathlib.Path(path).read_bytes()
return hashlib.sha256(content).hexdigest()[:16]
return {
"test_set_hash": file_hash(test_set_path),
"scoring_prompt_hash": file_hash(scoring_prompt_path),
}
def results_are_comparable(path_a: str, path_b: str) -> bool:
"""Return True only if both result files used the same eval artifacts."""
def read_metadata(path: str) -> dict:
with open(path) as f:
return json.loads(f.readline())
meta_a = read_metadata(path_a)
meta_b = read_metadata(path_b)
return (
meta_a["test_set_hash"] == meta_b["test_set_hash"]
and meta_a["scoring_prompt_hash"] == meta_b["scoring_prompt_hash"]
)
If results_are_comparable returns False, the comparison requires explicit justification — it can't silently proceed as if the measurement instrument hadn't changed.
⚠️ Common Mistake: Treating the eval suite as a living document that gets quietly updated rather than explicitly versioned. Fixing a scoring prompt that was producing strange results without creating a new version breaks the comparability of every subsequent run against every prior run.
How the Pitfalls Interact
These five pitfalls are not independent. A team using a leaked test set with BLEU scoring on a happy-path-only sample, reporting only an aggregate number without versioning the suite, can produce a score that is wrong in five simultaneous ways — and the number will still look meaningful.
📋 Quick Reference: Common Eval Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Test set leakage | Scores drop sharply in production | Fingerprint dedup before training |
| Metric-task mismatch | Verbose outputs outscore concise ones | Match metric to correctness structure |
| Happy path bias | Edge cases fail in prod, not in eval | Sample full input distribution |
| Single aggregate | Segment failures hidden by average | Report per-category pass rates |
| No versioning | Can't explain score changes | Hash and version all eval artifacts |
A score you cannot trust is not just useless — it is actively harmful, because it produces confidence that is not warranted and can drive deployment decisions that cause real failures.
Key Takeaways and What Comes Next
By this point the picture of LLM evaluation has shifted from something informal — run a few prompts, look at the outputs, ship it — to something that resembles a discipline with specific properties, failure modes, and trade-offs. This closing section compresses that into a working mental model and maps it onto the specific problems the next lessons will tackle.
The Four Properties Are Engineering Requirements, Not Adjectives
Reproducibility, coverage, sensitivity, and comparability are engineering properties that pass or fail a checklist. Each one has a specific failure mode when missing:
- Reproducibility requires pinned prompts, pinned test sets, and controlled inference parameters — the same discipline applied to model weights or application code. Without it: two runs on the same model yield different scores.
- Coverage requires the test set to be drawn from the actual production input distribution, not from memorable successes. Without it: edge cases and failure modes are invisible.
- Sensitivity requires a metric that can detect regressions of the magnitude that would change a product decision. Without it: a meaningfully worse model passes undetected.
- Comparability requires versioned eval artifacts so that score changes are attributable to the model or prompt, not to silent eval drift. Without it: score trends are uninterpretable.
As the pitfalls section showed, each property has a corresponding failure mode: test set leakage undermines comparability, happy-path sampling undermines coverage, aggregate-only reporting undermines sensitivity. Treating these as checkable engineering requirements — not aspirational adjectives — is what separates a pipeline that generates meaningful numbers from one that merely generates numbers.
Human Evaluation as Calibration Target
The distinction between human evaluation as a calibration target versus a production scoring mechanism is one of the more consequential framings in this lesson. Used as a calibration target, human eval validates that an automated metric moves in the same direction as human judgment. Used as a production mechanism, it would need to cover every candidate output — a throughput ceiling no annotation team can clear at modern development scale.
The practical consequence is a layered architecture:
Human Evaluation (periodic, sampled)
|
| calibrates
↓
Automated Eval Pipeline (continuous, full coverage)
|
| flags regressions
↓
Model/Prompt Change Decision
The automated pipeline must approximate human judgment reliably, not just produce numbers quickly. An automated scorer that disagrees with human raters on which outputs are better creates a feedback loop that optimizes the system away from actual quality.
The Pipeline Skeleton
The four-stage structure — dataset loading, inference, scoring, and result logging — is an architectural skeleton that determines what the pipeline can do as it grows more sophisticated. Classical metrics and LLM-as-judge both plug into the scoring stage. Raw output logging makes it possible to apply new scoring methods retroactively. Dataset versioning makes comparisons across time interpretable.
from __future__ import annotations
import json
import hashlib
from datetime import datetime, timezone
from typing import Callable
def load_dataset(path: str) -> list[dict]:
"""Load a versioned JSONL dataset of {prompt, expected} pairs."""
with open(path) as f:
return [json.loads(line) for line in f]
def run_inference(prompt: str, model_id: str, seed: int = 42) -> str:
"""Placeholder for a real LLM call with temperature=0 and fixed seed."""
key = hashlib.md5(f"{model_id}:{seed}:{prompt}".encode()).hexdigest()[:8]
return f"[mock output for prompt hash {key}]"
def exact_match_scorer(output: str, expected: str) -> float:
return 1.0 if output.strip() == expected.strip() else 0.0
def stub_llm_judge_scorer(output: str, expected: str) -> float:
"""Interface stub for an LLM-as-judge scorer. Returns float in [0.0, 1.0]."""
return 0.5 # Replace with real LLM call
def log_results(
results: list[dict],
output_path: str,
model_id: str,
prompt_version: str,
) -> None:
metadata = {
"model_id": model_id,
"prompt_version": prompt_version,
"timestamp": datetime.now(timezone.utc).isoformat(),
"n_examples": len(results),
"pass_rate": sum(r["score"] for r in results) / len(results) if results else 0.0,
}
with open(output_path, "w") as f:
f.write(json.dumps(metadata) + "\n")
for record in results:
f.write(json.dumps(record) + "\n")
def run_eval(
dataset_path: str,
model_id: str,
scorer: Callable[[str, str], float],
output_path: str,
prompt_version: str = "v1",
seed: int = 42,
) -> dict:
examples = load_dataset(dataset_path)
results = []
for ex in examples:
output = run_inference(ex["prompt"], model_id, seed=seed)
score = scorer(output, ex["expected"])
results.append({
"prompt": ex["prompt"],
"expected": ex["expected"],
"output": output,
"score": score,
})
log_results(results, output_path, model_id, prompt_version)
pass_rate = sum(r["score"] for r in results) / len(results) if results else 0.0
return {"pass_rate": pass_rate, "n_examples": len(results), "results": results}
The scorer is passed in as a parameter, so swapping exact_match_scorer for stub_llm_judge_scorer — or a real LLM-as-judge implementation — changes nothing else in the pipeline:
## Exact match for a closed-answer task
result_exact = run_eval(
dataset_path="eval_set_v3.jsonl",
model_id="my-model-checkpoint-001",
scorer=exact_match_scorer,
output_path="results_exact_v3.jsonl",
prompt_version="v2",
)
print(f"Exact match pass rate: {result_exact['pass_rate']:.2%}")
## Same pipeline, LLM judge in the scoring slot
result_judge = run_eval(
dataset_path="eval_set_v3.jsonl",
model_id="my-model-checkpoint-001",
scorer=stub_llm_judge_scorer,
output_path="results_judge_v3.jsonl",
prompt_version="v2",
)
print(f"LLM-judge pass rate: {result_judge['pass_rate']:.2%}")
Running both scorers on the same inference outputs and comparing their agreement is itself a diagnostic for scorer validity — exactly the calibration work that connects automated metrics back to human judgment.
The Consolidated Mental Model
┌─────────────────────────────────────────────────────────────┐
│ EVAL RIGOR FRAMEWORK │
├─────────────────────────────────────────────────────────────┤
│ │
│ WHAT MAKES EVAL RIGOROUS WHAT CAN BREAK IT │
│ ───────────────────────── ──────────────────── │
│ Reproducibility (pinned) ←→ Unpinned seeds/prompts │
│ Coverage (distribution) ←→ Happy-path sampling │
│ Sensitivity (detects diffs) ←→ Blunt aggregate metrics │
│ Comparability (stable base) ←→ Unversioned eval drift │
│ │
├─────────────────────────────────────────────────────────────┤
│ │
│ PIPELINE SKELETON │
│ [Dataset] → [Inference] → [Scoring] → [Logging] │
│ ↑ ↑ ↑ ↑ │
│ versioned deterministic swappable metadata-rich │
│ │
├─────────────────────────────────────────────────────────────┤
│ │
│ SCORING SLOT (simplest to most capable) │
│ Exact match → Classical metrics → LLM-as-judge │
│ │ │ │ │
│ closed tasks partial fit open-ended gen │
│ breaks here (next lessons) │
│ │
├─────────────────────────────────────────────────────────────┤
│ │
│ CALIBRATION HIERARCHY │
│ Human eval (ground truth, doesn't scale) │
│ ↓ calibrates │
│ Automated pipeline (scales, must approximate human) │
│ ↓ determines rigor level │
│ Cost of being wrong │
│ │
└─────────────────────────────────────────────────────────────┘
What the Next Lessons Address
The pipeline skeleton has a slot for a scorer, and this lesson has used exact match and a stub as placeholders. Two questions follow naturally.
Why not just use classical automated metrics? For closed-answer tasks, exact match and its relatives work well. The problem emerges for open-ended generation — summarization, explanation, multi-step reasoning, dialogue. Classical token-overlap metrics like BLEU and ROUGE penalize correct paraphrases and reward verbose copying. The next lesson examines exactly where these metrics break down and why the failure is structural rather than fixable by tuning.
How much rigor does a given eval pipeline actually need? The four properties are not free. Building and maintaining a rigorous eval pipeline requires engineering time, annotation budget, and ongoing calibration work. The answer to "how much is enough" is determined by the cost of being wrong: what is the worst-case outcome if a regression slips through to production? For a low-stakes content recommendation feature, a lightweight pipeline may be entirely appropriate. For a medical documentation assistant or a legal drafting tool, the same pipeline would be inadequate regardless of its technical architecture. The lesson after classical metrics addresses this cost-of-being-wrong calculus explicitly.
🎯 Key Principle: The cost of being wrong is the organizing variable for eval rigor. Every decision about pipeline sophistication — which scorer to use, how large to make the test set, how often to run human calibration — flows from an honest assessment of what it costs when the system fails silently.