You are viewing a preview of this lesson. Sign in to start learning
Back to LLM as Judge: Reproducible Evaluation for LLM Systems

Rubric Design and Criteria Decomposition

Writing rubrics that are specific enough to be reproducible but flexible enough to catch real failures. Decomposing a single quality dimension into atomic criteria reduces ambiguity and enables per-criterion scoring.

Why Rubric Design Is the Foundation of Reproducible LLM Evaluation

Imagine you've spent three weeks fine-tuning a customer support model. You run it through your evaluation pipeline on a Friday afternoon, and the scores look great β€” helpfulness up 12%, accuracy holding steady. You ship it. On Monday, a colleague reruns the same evaluation on the same outputs and gets completely different numbers. The "improvement" you measured has vanished. Was the model ever better? Did it regress over the weekend? You have no idea. And neither does anyone else. Grab the free flashcards at the end of each section to lock in the vocabulary β€” you'll need it as this lesson builds toward full rubric implementation.

This scenario plays out constantly in teams building LLM-powered products, and the root cause is almost never the model itself. It's the rubric β€” or more precisely, the absence of a real one. Understanding why rubric design is the engineering foundation of reproducible LLM evaluation, rather than a soft creative exercise, is the first step toward building systems that actually tell you the truth about your model's behavior.

The Core Problem: Natural Language Is Ambiguous by Design

When engineers first set up LLM-as-judge systems, they typically write evaluation criteria that look something like this:

  • Is the response helpful?
  • Is the response accurate?
  • Is the response appropriate for the user?

These feel reasonable. They map to real things we care about. But here's the uncomfortable truth: these words mean different things to different readers, and they mean different things to the same reader on different days. This isn't a flaw in human cognition β€” it's a feature of natural language. Words like "helpful" carry enormous contextual weight that shifts depending on who is asking, what they're building, and what examples they've recently seen.

LLM judges inherit this ambiguity wholesale. When you ask GPT-4 or Claude to rate a response for "helpfulness" without further specification, the model draws on its training to form an implicit definition. That implicit definition may be reasonable, but it isn't your definition, and it isn't stable across:

🧠 Different phrasings of the same prompt πŸ“š Different positions in the context window πŸ”§ Different model versions as the judge model is updated 🎯 Different response styles that trigger different interpretive frames

The result is what researchers call inter-rater unreliability β€” the same output gets different scores from different evaluation runs. In human annotation studies, this is measured with metrics like Cohen's Kappa. In LLM evaluation, most teams never measure it at all, which means they're flying blind.

πŸ’‘ Real-World Example: A team at a major tech company ran an internal study where they evaluated 200 customer service responses for "quality" using a single holistic prompt. When they resampled the same 200 responses 48 hours later with the same judge model and the same prompt, the scores had shifted by an average of 0.8 points on a 5-point scale β€” with some responses swinging by 2 full points in either direction. No code had changed. No model had been updated. The rubric was simply too vague to constrain the judge's interpretation.

πŸ€” Did you know? Studies on human expert agreement show that even trained domain experts, given the same holistic quality rubric, achieve only 60–70% agreement on complex writing tasks. LLM judges, without precise rubrics, perform no better β€” and often worse, because they lack the ability to ask clarifying questions.

How Vague Rubrics Produce High-Variance Scores

To make this concrete, let's look at what actually happens inside an LLM judge when it encounters a poorly specified criterion. Consider this evaluation prompt:

## ❌ A vague, holistic evaluation prompt
evaluation_prompt = """
You are an expert evaluator. Rate the following customer support response 
on a scale of 1 to 5, where 1 is poor and 5 is excellent.

Response to evaluate:
{response}

Provide your rating and a brief explanation.
"""

This prompt gives the judge model enormous latitude. "Excellent" is undefined. The scale has no anchors. The judge must simultaneously assess correctness, tone, completeness, clarity, and relevance β€” but with no guidance on how to weight them or what counts as evidence for each. In practice, what the judge ends up doing is pattern-matching to its training distribution of "good responses," which varies based on subtle features of the input context.

Now look at what happens when you add even minimal structure:

## βœ… A more structured evaluation prompt (still not fully decomposed)
evaluation_prompt = """
You are an expert evaluator assessing customer support responses.
Rate the response on a scale of 1 to 5 using the following anchors:

1 - Does not address the user's question; contains factual errors; rude or dismissive tone
3 - Partially addresses the question; minor inaccuracies possible; neutral tone
5 - Fully resolves the user's question; factually correct; warm and professional tone

Response to evaluate:
{response}

Provide your rating (1-5) and explain which anchor best describes the response.
"""

This version is better. The anchors give the judge concrete reference points. But notice that a response that's factually perfect but dismissive in tone could score anywhere from 2 to 4 depending on how the judge weighs those two dimensions against each other. The variance has been reduced, but not eliminated.

The path to low-variance, reproducible evaluation runs through rubric decomposition β€” breaking a holistic quality judgment into discrete, independently assessable criteria. That's the central skill this lesson teaches.

🎯 Key Principle: Every degree of ambiguity in a rubric criterion becomes a source of variance in scores. Reproducible evaluation requires reducing ambiguity to the minimum necessary, then testing that the remaining ambiguity is acceptable for your use case.

The Business Cost of Irreproducible Evaluation

If rubric variance only affected individual scores, it might be manageable. The deeper problem is what irreproducible evaluation does to your ability to make decisions over time.

Consider the two critical questions every team building LLM systems needs to answer:

  1. Did this change make the model better or worse?
  2. Is the model behaving worse than it was last month?

Both questions require you to compare scores across time. If your evaluation system has high variance β€” say, Β±1 point on a 5-point scale β€” then a real improvement of 0.3 points is completely invisible in the noise. Conversely, a real regression of 0.4 points looks like noise and gets shipped to production.

This creates two catastrophic failure modes:

High-Variance Rubric β†’ Two Failure Modes

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FAILURE MODE 1: False Positives β”‚
β”‚ β”‚
β”‚ Real state: Model A β‰ˆ Model B (no meaningful change) β”‚
β”‚ Measured: Model A scores 3.8, Model B scores 4.1 β”‚
β”‚ Decision: Ship Model B ← WRONG β”‚
β”‚ Consequence: Wasted engineering resources, β”‚
β”‚ unpredictable production behavior β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FAILURE MODE 2: False Negatives β”‚
β”‚ β”‚
β”‚ Real state: Model B is meaningfully worse β”‚
β”‚ Measured: Model A scores 3.9, Model B scores 3.7 β”‚
β”‚ Decision: Variance is normal, ship Model B ← WRONG β”‚
β”‚ Consequence: Regression reaches users undetected β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The second failure mode is particularly insidious because the costs are invisible. You never know that the regression happened. User satisfaction metrics may eventually surface it β€” weeks later, after the signal is buried under other changes.

⚠️ Common Mistake: Teams often respond to high evaluation variance by running more evaluations and averaging the results. This reduces variance statistically, but it doesn't fix the underlying problem β€” the rubric is still ambiguous, so more samples just give you a more precise measurement of an imprecise construct. You're computing the average of undefined values.

βœ… Correct thinking: Fix the rubric first, then determine how many evaluations you need to achieve your desired confidence level. Precision in measurement comes from definition clarity, not from sample size alone.

πŸ’‘ Mental Model: Think of your rubric as a ruler. If the ruler's markings are smeared and hard to read, measuring more objects with it doesn't improve your measurements β€” it just gives you more imprecise data. A well-designed rubric is a ruler with clear, evenly spaced markings.

The Spectrum: Holistic to Fully Decomposed

Rubric design exists on a spectrum, and understanding where different approaches fall β€” and why you might choose each β€” is essential before diving into the mechanics of decomposition.

RUBRIC DESIGN SPECTRUM

◄─────────────────────────────────────────────────────────────►
β”‚ β”‚
Fully Holistic Fully Decomposed
β”‚ β”‚
"Rate this response "Score each criterion independently:
1-5 for quality" 1. Factual accuracy (0-2)
 2. Completeness (0-2)
 3. Tone appropriateness (0-2)
 4. Actionability (0-2)
 5. Conciseness (0-2)"
β”‚ β”‚
High flexibility Minimal ambiguity
High variance per criterion
Fast to write Slow to write
Hard to debug Easy to debug
Captures emergent Misses dimensions
qualities not listed not in the list

Holistic rubrics are fast to write and can capture gestalt quality β€” the sense that a response just works in a way that's hard to articulate. Experienced human judges using holistic rubrics often agree with each other at high rates, precisely because they share a rich mental model of quality. But LLM judges don't share your mental model, and holistic rubrics give them nowhere to anchor.

Fully decomposed rubrics minimize per-criterion ambiguity by making each dimension independently assessable. The tradeoff is coverage: you can only measure what you explicitly named. A response that scores 5/5 on every listed criterion might still feel wrong to a domain expert because it violates an implicit norm that wasn't captured in the decomposition.

🎯 Key Principle: The goal is not to reach either extreme of the spectrum, but to find the minimum decomposition that achieves your required reproducibility threshold while retaining coverage of the quality dimensions that matter for your specific application.

The right position on this spectrum depends on:

🧠 How stable your input distribution is β€” narrow, predictable inputs can tolerate more holistic rubrics because the judge's implicit model roughly matches your intended one πŸ“š How sensitive your decisions are β€” high-stakes decisions (safety classification, production deployment gates) require more decomposition πŸ”§ How often the rubric needs to change β€” holistic rubrics are easier to update but harder to version and audit 🎯 How you'll use the scores β€” aggregate trend metrics can tolerate more variance than per-response routing decisions

Measuring Your Rubric's Reproducibility

Before you can improve a rubric, you need to measure its current reproducibility. The practical way to do this is to compute intra-rubric consistency: run the same set of responses through the same rubric multiple times (with the judge temperature set low, but not zero) and measure the spread of scores.

import openai
import numpy as np
from collections import defaultdict

def measure_rubric_consistency(
    responses: list[str],
    rubric_prompt_template: str,
    judge_model: str = "gpt-4o",
    n_runs: int = 5,
    temperature: float = 0.3
) -> dict:
    """
    Measures how consistently a rubric scores the same responses
    across repeated evaluation runs.
    
    Returns per-response score variance and overall rubric consistency score.
    """
    client = openai.OpenAI()
    
    # Store scores per response across multiple runs
    all_scores = defaultdict(list)  # response_idx -> [score_run1, score_run2, ...]
    
    for run_idx in range(n_runs):
        for resp_idx, response in enumerate(responses):
            prompt = rubric_prompt_template.format(response=response)
            
            result = client.chat.completions.create(
                model=judge_model,
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=256
            )
            
            # Parse score from response (assumes judge outputs a number 1-5)
            raw_output = result.choices[0].message.content
            score = extract_score(raw_output)  # Your parsing logic here
            all_scores[resp_idx].append(score)
    
    # Compute per-response variance
    variances = {
        resp_idx: np.var(scores)
        for resp_idx, scores in all_scores.items()
    }
    
    mean_variance = np.mean(list(variances.values()))
    max_variance = np.max(list(variances.values()))
    
    # A variance > 0.5 on a 1-5 scale suggests the rubric needs refinement
    consistency_flag = "ACCEPTABLE" if mean_variance < 0.25 else "NEEDS REFINEMENT"
    
    return {
        "mean_variance": mean_variance,
        "max_variance": max_variance,
        "per_response_variances": variances,
        "consistency_assessment": consistency_flag,
        "recommendation": (
            "Rubric is sufficiently precise for deployment."
            if mean_variance < 0.25
            else "Decompose high-variance criteria into more specific sub-criteria."
        )
    }

This function is your diagnostic tool. Before committing to a rubric for production evaluation, run it across a representative sample of your response distribution. The mean variance tells you the expected noise floor of your evaluation system; the max variance tells you which responses (or response types) the rubric handles worst.

πŸ’‘ Pro Tip: High variance on specific response types is actually useful diagnostic information. If your rubric scores consistently for short, direct answers but inconsistently for long, structured responses, that tells you the rubric's implicit anchors are calibrated for one response style and need to be made explicit for others.

Framing Rubric Design as an Engineering Discipline

❌ Wrong thinking: "Writing evaluation criteria is a product or editorial task β€” we just describe what 'good' means and hand it to the LLM."

βœ… Correct thinking: Rubric design is a software engineering discipline with testable outputs, measurable quality properties, and iterative refinement cycles.

This reframe matters because it changes how teams allocate resources and time. Writing a rubric isn't a one-hour task done before the "real" work begins. It's an artifact that requires:

πŸ”§ Version control β€” rubrics change, and old evaluation runs must be traceable to the rubric version that produced them 🎯 Unit testing β€” you should have a golden set of responses with known scores that any new rubric version must pass πŸ“š Stakeholder review β€” domain experts need to validate that the decomposed criteria actually capture what they care about 🧠 Regression testing β€” when you update a rubric, you need to understand how it changes scores on your historical evaluation set

This lesson's remaining sections build directly on this engineering framing. Section 2 covers the structural anatomy of a well-formed rubric β€” what makes criteria specific, complete, and graded. Section 3 walks through the practical decomposition process, teaching you how to take a vague quality dimension and systematically break it into atomic, independently evaluable criteria. Section 4 shows you how rubric structure maps directly to prompt design and code. Sections 5 and 6 address the mistakes practitioners make and the deeper patterns β€” including rubric drift, where rubrics gradually lose calibration as the output distribution changes over time β€” that you'll need to manage in production systems.

🧠 Mnemonic: Think VCARD to remember the engineering properties a production rubric must have:

  • Versioned (traceable to evaluation runs)
  • Calibrated (validated on known examples)
  • Atomic criteria (each dimension independently assessable)
  • Reproducible (low variance on repeated runs)
  • Documented (stakeholders understand and agree on definitions)

πŸ“‹ Quick Reference Card: Rubric Design Fundamentals

🎯 Property ❌ What It Looks Like When Missing βœ… What Good Looks Like
πŸ”’ Specificity "Is the response helpful?" "Does the response directly answer all sub-questions the user asked?"
πŸ“Š Gradation "Yes / No" "0 = Addresses none, 1 = Addresses some, 2 = Addresses all"
🧩 Decomposition Single "quality" score Separate scores for accuracy, completeness, tone, actionability
πŸ” Reproducibility Variance > 0.5 across runs Variance < 0.25 across runs on representative sample
πŸ“‹ Coverage Only measures what's easy to measure Explicitly accounts for all dimensions stakeholders care about

The insight that makes everything else in this lesson work is simple but easy to overlook: the LLM judge is not the variable you should be optimizing first. When evaluation scores are inconsistent, the instinct is to try a better model, or a different model, or more elaborate chain-of-thought prompting. Sometimes those things help. But if the rubric is vague, a smarter judge just makes more confident, internally consistent, but still irreproducible decisions. Fix the measurement instrument before you optimize the measurer.

What you're about to learn is how to build rubrics that are specific enough to be reproducible, flexible enough to catch failures you haven't anticipated, and structured enough to be maintained as living engineering artifacts β€” not one-time documents that drift into obsolescence. That's the foundation every reliable LLM evaluation system is built on.

Anatomy of a Well-Formed Rubric: Specificity, Coverage, and Gradation

A rubric is not just a checklist. It is a contract between your evaluation intent and the judge that executes it β€” whether that judge is a human annotator or an LLM. When that contract is vague, two judges reading the same response will reach different verdicts. When it is overly rigid, the rubric becomes a narrow corridor that misses failure modes you never anticipated. Getting this balance right is what separates evaluation that produces trustworthy, reproducible scores from evaluation that produces noise.

This section dissects the structural components of a well-formed rubric: how to tune specificity, ensure full coverage, design meaningful gradation, anchor abstract criteria with concrete examples, and distinguish between what a response contains versus how it reasons. Each component builds on the last, and together they form the engineering skeleton of any reliable LLM evaluation system.


The Specificity–Flexibility Trade-Off

Specificity is the degree to which a rubric criterion constrains a judge's interpretation. A highly specific criterion leaves little room for ambiguity: "The response must name at least one concrete mitigation strategy for the identified risk." A vague criterion leaves too much room: "The response should be helpful."

But specificity has a ceiling. Push it too far and you get a criterion so narrow that it only catches one type of failure while remaining blind to others. Imagine you're evaluating a medical information chatbot and you write: "The response must include the phrase 'consult a doctor' at least once." That criterion is extremely specific β€” but a response could include that phrase and still recommend an unsafe home remedy in the same breath. The rubric passes what the rubric should fail.

🎯 Key Principle: A criterion should constrain the space of acceptable responses, not enumerate the exact tokens those responses must contain. Constraint is about intention, not verbatim matching.

The right level of specificity targets observable behaviors β€” things a judge can verify without inferring hidden intent. Compare these two versions of a criterion for evaluating factual accuracy:

❌ Too vague:
"The response should be accurate."

βœ… Well-specified:
"Every factual claim in the response can be verified against the provided
source documents. If a claim appears in the response but not in the source,
it is treated as unsupported and counts against this criterion."

⚠️ Too narrow:
"The response must quote at least one sentence verbatim from the source document."

The well-specified version tells the judge what to look for (claims), where to verify (source documents), and what constitutes failure (unsupported claims) β€” without mandating a specific surface form.

⚠️ Common Mistake β€” Mistake 1: Writing criteria that are actually implementation hints rather than quality descriptions. "The response should use bullet points" constrains formatting, not quality. Unless formatting is genuinely a quality dimension for your use case, this kind of criterion will penalize stylistically valid responses that present the same information in prose.


Coverage: Mapping the Full Quality Space

A rubric has coverage when its criteria collectively account for every meaningful way a response can succeed or fail β€” without those criteria overlapping so heavily that they double-count the same failure.

Think of the quality space for a response as a terrain. Your criteria are spotlights. If you position them poorly, some areas fall in shadow (uncovered failure modes) while others are hit by multiple beams (redundant criteria that inflate certain dimensions). Neither situation produces fair, informative scores.

QUALITY TERRAIN MAP

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚   Factual       Relevance    Completeness    β”‚
       β”‚   Accuracy      [Dim B]      [Dim C]         β”‚
       β”‚   [Dim A]                                    β”‚
       β”‚                                              β”‚
       β”‚   Tone &        Safety       ??? ◄── SHADOW  β”‚
       β”‚   Register      [Dim E]      (uncovered)     β”‚
       β”‚   [Dim D]                                    β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

       Good rubric: spotlights cover the full terrain with minimal overlap.
       Bad rubric: some regions dark, others triple-lit.

To audit coverage, use a failure mode enumeration exercise before writing criteria. Ask: "What are all the ways a response to this type of task could be bad?" List them exhaustively, then group them into candidate dimensions. Criteria should emerge from this bottom-up analysis, not be borrowed wholesale from a generic rubric template.

πŸ’‘ Real-World Example: A team evaluating a code-generation assistant started with three criteria: correctness, style, and efficiency. After running their rubric on 50 real outputs, they discovered a fourth failure mode they had not anticipated β€” responses that were syntactically correct but silently changed the function signature, breaking callers. None of their three criteria caught this. They added a fourth: interface stability, defined as "the function signature (name, parameters, return type) matches the specification exactly." Coverage audits should be ongoing, not one-time events.

Redundancy is the opposite failure. If you have separate criteria for "completeness" and "thoroughness," you risk scoring the same gap twice. Before finalizing criteria, ask of every pair: "Can a response score differently on these two criteria?" If the answer is almost always no, collapse them.

πŸ“‹ Quick Reference Card: Coverage Audit Checklist

Question Red Flag
πŸ” Can a response fail in a way no criterion catches? Criteria missing
πŸ” Do two criteria always move together? Redundancy risk
βš–οΈ Does each criterion catch a distinct failure mode? Good coverage
πŸ—ΊοΈ Were criteria derived from real failure examples? Bottom-up process

Gradation: Designing Score Levels That Carry Information

Once you know what you're measuring, you need to decide how you're measuring it. This is gradation β€” the design of the scoring scale and the behavioral descriptions anchored to each level.

The most common choice is between three archetypes:

  • Binary (pass/fail): Simple, low-ambiguity, but loses nuance. A response that nearly passes is indistinguishable from one that catastrophically fails.
  • Tiered labels (e.g., Poor / Acceptable / Good / Excellent): More informative than binary but requires careful anchoring to prevent label inflation (everything clustering at "Good").
  • Numeric scales (e.g., 1–5): Maximally granular, but the numbers are meaningless without behavioral anchors at each level.

🎯 Key Principle: A score level without a behavioral anchor is an aesthetic preference, not a measurement. "3 out of 5" means nothing unless you specify what a 3 looks like.

Here is what an unanchored versus an anchored scale looks like for the criterion "Factual Accuracy":

❌ UNANCHORED (useless):
1 = Very inaccurate
2 = Inaccurate
3 = Somewhat accurate
4 = Accurate
5 = Very accurate

βœ… ANCHORED (reproducible):
1 = The response contains multiple factual errors that directly contradict
    the source material, or makes claims that could cause harm if acted upon.

2 = The response contains at least one significant factual error (incorrect
    date, name, statistic, or causal relationship) that affects the core
    message, even if some peripheral details are correct.

3 = The response is factually correct on all major claims but contains one
    minor imprecision (rounding, simplification) that does not materially
    mislead the reader.

4 = All claims are factually supported by the source material, with no
    detectable errors or unsupported generalizations.

5 = All claims are accurate, and the response additionally flags uncertainty
    where the source material is ambiguous, rather than presenting uncertain
    information as settled fact.

Notice that the anchored descriptions do several things simultaneously: they specify what counts as a failure at each severity, they give a judge something concrete to look for, and they create a monotonic quality ladder β€” it is impossible for a level-5 response to also qualify as a level-2, because the descriptions are mutually exclusive.

πŸ’‘ Mental Model: Think of score levels as admission criteria for a club. Each club (score level) has explicit membership rules. A response qualifies for a club if and only if it meets that club's rules β€” not because a judge feels it belongs there.

Choosing the Right Scale Granularity

More levels is not always better. A 10-point scale sounds precise, but if judges cannot reliably distinguish a 6 from a 7, you have false precision. Research on human rating reliability suggests that 4-6 levels is the practical sweet spot for most quality dimensions β€” granular enough to be informative, coarse enough that distinctions are defensible.

For LLM judges specifically, binary and 3-level scales tend to produce more consistent scoring than 5-point scales, because the judge is making fewer discrimination decisions. If you need fine-grained scores for downstream ranking, consider running the LLM judge on a 3-level scale and using multiple criteria to produce a composite score, rather than asking a single criterion to carry all the nuance.

⚠️ Common Mistake β€” Mistake 2: Using a 1–5 numeric scale without anchors and then averaging scores across criteria. You end up with a number like 3.4 that is statistically meaningless because the individual inputs were not on a common behavioral scale.


Worked Examples and Exemplar Outputs as Anchors

Even the most carefully written behavioral anchor can be interpreted differently in edge cases. The solution is to supplement written anchors with exemplar outputs β€” actual (or constructed) response samples that demonstrate each score level for a given criterion.

Exemplars do something that written descriptions cannot: they resolve ambiguity at the boundary between adjacent score levels. A judge reading an anchor for score 3 and an anchor for score 4 may still be uncertain when a real response falls in the gap. An exemplar at each level eliminates that uncertainty by providing a concrete reference point.

## Example: Structuring exemplars alongside rubric criteria in a judge prompt

RUBRIC_WITH_EXEMPLARS = """
CRITERION: Factual Accuracy
SCORE 1:
  Description: Multiple errors contradicting source material.
  Exemplar: "The Eiffel Tower was built in 1901 by Gustave Cluny as a 
             permanent monument." (Actual: built 1887-1889 by Eiffel, 
             originally temporary.)

SCORE 3:
  Description: Major claims correct; one minor imprecision.
  Exemplar: "The Eiffel Tower, completed around 1889, was designed by 
             Gustave Eiffel." (Minor: construction finished March 1889, 
             but 'around 1889' is acceptable.)

SCORE 5:
  Description: All claims accurate; uncertainty flagged where appropriate.
  Exemplar: "The Eiffel Tower was completed in March 1889, designed by 
             Gustave Eiffel. Sources vary on the exact opening date, with 
             some citing March 31 and others April 6."

Now score the following response on a 1-5 scale using these anchors:
[RESPONSE TO EVALUATE]
"""

This code block shows how to embed exemplars directly in a judge prompt. Notice that exemplars are paired with written descriptions, not used as replacements for them. The description tells the judge what to look for; the exemplar shows them what it looks like in practice.

πŸ’‘ Pro Tip: Include exemplars at the boundary score levels (the edges of your scale) and at least one middle level. You do not need exemplars for every level β€” the ones you include will help judges interpolate the rest.

πŸ€” Did you know? In psychometric research, providing annotators with exemplar responses before scoring tasks is called frame-of-reference training, and studies consistently show it reduces inter-rater variance by 20–40% compared to written descriptions alone. The same effect holds when the "rater" is an LLM.


Output-Level vs. Process-Level Criteria

One of the most important and frequently overlooked distinctions in rubric design is the difference between output-level criteria and process-level criteria.

Output-level criteria evaluate what is present in the response: the claims made, the structure used, the information included or omitted. These criteria are largely surface-verifiable. A judge can confirm them by reading the response against a checklist.

Process-level criteria evaluate how the response reaches its conclusion: the quality of reasoning, the appropriateness of the logical steps, the handling of uncertainty. These criteria require the judge to follow the response's internal logic, not just verify its outputs.

OUTPUT-LEVEL vs. PROCESS-LEVEL CRITERIA

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚     OUTPUT-LEVEL            β”‚    β”‚     PROCESS-LEVEL           β”‚
  β”‚                             β”‚    β”‚                             β”‚
  β”‚  β€’ Does the response cite   β”‚    β”‚  β€’ Does the response show   β”‚
  β”‚    a source?                β”‚    β”‚    how it weighted sources? β”‚
  β”‚                             β”‚    β”‚                             β”‚
  β”‚  β€’ Is a conclusion stated?  β”‚    β”‚  β€’ Does the reasoning from  β”‚
  β”‚                             β”‚    β”‚    evidence to conclusion   β”‚
  β”‚  β€’ Are all required         β”‚    β”‚    hold without gaps?       β”‚
  β”‚    sections present?        β”‚    β”‚                             β”‚
  β”‚                             β”‚    β”‚  β€’ Does the response        β”‚
  β”‚  β€’ Is the answer within     β”‚    β”‚    acknowledge assumptions  β”‚
  β”‚    the correct range?       β”‚    β”‚    it cannot verify?        β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         Verifiable by reading              Requires following logic

Both types are valid and often both are necessary. A rubric for a medical summarization task might include output-level criteria ("The summary mentions the patient's primary diagnosis") alongside process-level criteria ("When the source document contains conflicting data, the response acknowledges the conflict rather than silently resolving it").

The danger is conflating them. If you try to use an output-level criterion to capture a process-level failure, you will get false positives. A response can state a conclusion and show flawed reasoning β€” the output looks fine, the process is broken. Conversely, a response can exhibit excellent reasoning but omit a required output element. Each failure mode needs its own criterion type.

## Practical example: Separating output and process criteria in a rubric structure

rubric = {
    "criteria": [
        {
            "id": "output_completeness",
            "type": "output",
            "description": "The response addresses all parts of the user's question.",
            "scale": [1, 2, 3],
            "anchors": {
                1: "One or more major parts of the question are not addressed.",
                2: "All major parts are addressed but one is shallow or incomplete.",
                3: "All parts of the question are fully and specifically addressed."
            }
        },
        {
            "id": "reasoning_validity",
            "type": "process",
            "description": "The logical steps from evidence to conclusion are valid.",
            "scale": [1, 2, 3],
            "anchors": {
                1: "The conclusion does not follow from the evidence, or evidence "
                   "is fabricated or misrepresented.",
                2: "The reasoning is generally sound but contains one inferential "
                   "leap not supported by the available evidence.",
                3: "Every step in the reasoning chain is supported by cited "
                   "evidence, and no unsupported inferences are made."
            }
        }
    ]
}

## When prompting an LLM judge, you can evaluate each criterion independently
def format_criterion_prompt(criterion: dict, response_text: str) -> str:
    anchors_text = "\n".join(
        f"  Score {k}: {v}" for k, v in criterion["anchors"].items()
    )
    return f"""
Evaluate the following response on this single criterion only.

CRITERION: {criterion['description']}
TYPE: {criterion['type'].upper()}

SCORING SCALE:
{anchors_text}

RESPONSE TO EVALUATE:
{response_text}

Provide your score (1, 2, or 3) and a one-sentence justification.
Format: SCORE: <number> | REASON: <justification>
"""

This code demonstrates a key architectural principle: rubric criteria are stored as structured data, not hardcoded strings. Separating criterion type ("output" vs. "process"), description, scale, and anchors into distinct fields makes it easy to iterate on individual criteria, add new ones, or adjust anchors without rewriting your entire evaluation prompt. The format_criterion_prompt function then assembles this structure into a judge prompt on demand.

🧠 Mnemonic: SCAG β€” Specificity, Coverage, Anchored gradation, Gap between output and process. The four structural pillars of a well-formed rubric.

πŸ’‘ Pro Tip: When you first deploy a new rubric, run it on a contrast set β€” a small batch of responses you already know are good and responses you already know are bad. If the rubric cannot reliably separate these two groups, something in the structure is broken. Fix the rubric before scaling your evaluation pipeline.


Putting the Components Together

A well-formed rubric is the product of all four components working in concert. Specificity ensures the judge knows what to look for. Coverage ensures nothing important escapes. Gradation ensures scores carry information. Exemplars and anchors ensure abstract descriptions translate into consistent verdicts. And the output/process distinction ensures that surface failures and reasoning failures are caught by criteria designed for each.

WELL-FORMED RUBRIC ASSEMBLY

  Failure Mode      Coverage        Criterion         Anchored
  Enumeration  ──►  Audit      ──►  Definition   ──►  Scale        ──► Judge Prompt
  (what can go      (distinct,      (observable,      (behavioral
   wrong?)           complete        specific          descriptors
                     criteria)       behavior)         + exemplars)

In the next section, we will take this structural understanding and turn it into a systematic process for decomposing high-level quality goals β€” like "this response should be trustworthy" β€” into the discrete, independently evaluable criteria that give rubrics their scoring power.

Decomposing Quality Dimensions into Evaluable Criteria

When you ask an LLM judge to evaluate whether a response is "good," you are essentially asking it to solve a poorly specified optimization problem. "Good" can mean accurate, clear, complete, appropriately toned, well-structured, and a dozen other things simultaneously β€” and different judges will weight those dimensions differently on each run. The solution is not to write a better definition of "good." It is to stop asking for "good" at all and instead ask a series of smaller, sharper questions that together compose the full picture of quality. This process is called criteria decomposition, and it is the practical engine that transforms a vague quality goal into a reproducible evaluation.

Starting from the Quality Goal

Every decomposition begins with a quality goal: a high-level statement of what a successful response achieves. Examples include "the response is factually correct," "the response is helpful to the user," or "the response follows the brand's communication style." These goals are legitimate β€” they capture real things you care about β€” but they are too coarse for a judge to score consistently.

The decomposition process starts with a single generative question: "What observable properties would a response have if it fully succeeded on this dimension?" This question is powerful because it shifts your attention from abstract ideals to concrete evidence. You are not asking what quality means; you are asking what quality looks like in the text.

Consider the quality goal "the response is helpful." Applying the generative question:

  • A helpful response directly addresses what the user asked, not a related but different question.
  • A helpful response provides enough detail for the user to act on the information.
  • A helpful response does not include so much irrelevant detail that the key information is buried.
  • A helpful response uses language the user can understand given the context they provided.

Each of these is now a candidate sub-criterion: a discrete, observable property that contributes to the overall quality goal. The raw list will be messy and overlapping at first. That is fine. The next step is to find the natural fault lines.

Quality Goal
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  "What observable properties would a       β”‚
β”‚   response have if it succeeded here?"     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
 Raw Observations (messy, overlapping)
  β”œβ”€β”€ Addresses the actual question asked
  β”œβ”€β”€ Provides actionable detail
  β”œβ”€β”€ Avoids irrelevant padding
  └── Uses appropriate language level
     β”‚
     β–Ό
 Identify fault lines β†’ Cluster β†’ Name sub-criteria
  β”œβ”€β”€ Relevance     (addresses the right question)
  β”œβ”€β”€ Completeness  (enough detail to act)
  β”œβ”€β”€ Conciseness   (no harmful padding)
  └── Accessibility (appropriate language level)

Identifying Natural Fault Lines

A fault line in a quality dimension is a boundary where a response can succeed on one side and fail on the other β€” independently. Finding fault lines is how you turn a raw list of observations into a clean set of sub-criteria.

Take the quality goal "correctness" as a worked example. It decomposes along at least three natural fault lines:

🎯 Key Principle: A sub-criterion marks a genuine fault line only if you can construct a realistic response that fails that sub-criterion while passing all the others.

Factual accuracy β€” Does the response state things that are true? A response can fail here by citing a wrong statistic while remaining logically coherent and complete on the topic it addresses.

Logical consistency β€” Do the claims in the response contradict each other or lead to contradictions when combined? A response can fail here even if every individual claim is true β€” for instance, by asserting two premises that cannot both hold simultaneously.

Completeness β€” Does the response cover all the aspects of the question that are necessary for the answer to be usable? A response can be factually accurate and internally consistent while omitting a critical condition that changes the conclusion.

These three sub-criteria partition the space of correctness failures. A response that fails factual accuracy does not automatically fail logical consistency β€” it might be a consistent but false account. A response that fails completeness might be perfectly accurate on everything it does cover. This independence is the signal that you have found a real fault line.

⚠️ Common Mistake: Decomposing into sub-criteria that are so tightly correlated that failing one almost always means failing another. If your sub-criteria are "the response is clear" and "the response is easy to understand," you have not decomposed anything β€” you have paraphrased. True sub-criteria should be independently violable.

The Independence Test

Before finalizing your sub-criteria, apply the independence test to each pair: write or imagine a concrete response that fails sub-criterion A but passes sub-criterion B, and another that fails B but passes A. If you cannot construct both directions, the two sub-criteria are not genuinely independent and should be merged or rethought.

πŸ’‘ Real-World Example: A team evaluating a medical Q&A assistant decomposed "safety" into three sub-criteria: (1) the response does not recommend dangerous actions, (2) the response includes appropriate disclaimers when uncertainty exists, and (3) the response directs users to professional care when the question exceeds the system's scope. During the independence test, they found they could construct a response that fails (1) while passing (2) and (3) β€” a response that confidently recommends a dangerous home remedy with no disclaimer and no referral. They could also construct a response that fails (2) alone β€” a medically correct recommendation stated without appropriate hedging. The test confirmed all three were genuine fault lines.

Writing Criterion Prompts That Point to Evidence

Once you have your sub-criteria, you face a second design challenge: how do you phrase each criterion so that the LLM judge looks for evidence in the response rather than forming a global impression?

The key insight is that global impression questions invite the judge to use its priors about what good responses look like, while evidence-directed questions anchor the judge to what is actually written in the response. This distinction is the difference between reproducible and unreproducible scoring.

❌ Wrong thinking: "Is this response factually accurate? Score 1-5."

βœ… Correct thinking: "Identify each factual claim made in the response. For each claim, assess whether it is supported by established knowledge. Score 5 if all claims are accurate, 4 if one minor claim is imprecise but no claims are wrong, 3 if one claim is factually incorrect, 2 if multiple claims are incorrect, 1 if the response is predominantly false."

The evidence-directed version tells the judge where to look (factual claims), what to do (assess each one), and how to map findings to scores (a defined scale). This eliminates the judge's discretion about what counts as relevant evidence, which is the primary source of inter-run variance.

When writing criterion prompts, use this structure:

  1. Scope statement: What part of the response should the judge examine?
  2. Evidence instruction: What specific properties should the judge look for?
  3. Score mapping: What findings correspond to what scores?
  4. Edge case note (optional): How should the judge handle ambiguous or missing content?

πŸ’‘ Pro Tip: Use the phrase "cite the specific part of the response that supports your score" in your criterion prompt. This forces the judge to ground its reasoning in the text, which makes its explanations auditable and catches hallucinated justifications.

Translating a Vague Rubric into a Structured Criteria Dictionary

The practical artifact of criteria decomposition is a rubric dictionary: a structured data object that pairs each sub-criterion with its description, scoring scale, and evaluation instructions. This object becomes the single source of truth for your evaluation pipeline β€” the same structure that populates your judge prompt also drives your score aggregation logic.

Here is a concrete translation from vague to structured. Suppose your original rubric is:

"Evaluate whether the response is correct and useful. Score 1-5."

After decomposition along the fault lines for correctness and helpfulness, you arrive at the following structured rubric:

## rubric.py
## A structured rubric for evaluating a technical Q&A assistant response.
## Each key is a sub-criterion name; each value is a dict describing how to evaluate it.

RUBRIC = {
    "factual_accuracy": {
        "description": "All factual claims in the response are correct.",
        "weight": 0.30,
        "scale": {
            5: "Every factual claim is correct and verifiable.",
            4: "One minor claim is imprecise but not wrong (e.g., approximate figure).",
            3: "One claim is factually incorrect but does not invalidate the core answer.",
            2: "Multiple claims are incorrect, undermining the response's usefulness.",
            1: "The response is predominantly or dangerously false."
        },
        "instructions": (
            "Identify each distinct factual claim in the response. "
            "For each, assess whether it is accurate based on established knowledge. "
            "Cite the specific claim when assigning a score below 5."
        )
    },
    "logical_consistency": {
        "description": "The claims in the response do not contradict each other.",
        "weight": 0.20,
        "scale": {
            5: "All claims are mutually consistent.",
            4: "A minor tension exists but does not affect the conclusion.",
            3: "A contradiction exists that could confuse the reader.",
            2: "Multiple contradictions make the response unreliable.",
            1: "The response is internally incoherent throughout."
        },
        "instructions": (
            "Read the response and check whether any two claims, when combined, "
            "produce a contradiction. Quote the contradicting statements if found."
        )
    },
    "completeness": {
        "description": "The response covers all aspects of the question necessary for the answer to be actionable.",
        "weight": 0.25,
        "scale": {
            5: "All necessary aspects are addressed; nothing critical is missing.",
            4: "One minor aspect is omitted but the response is still actionable.",
            3: "A significant aspect is missing, reducing the response's usefulness.",
            2: "Multiple significant aspects are missing.",
            1: "The response fails to address the core of the question."
        },
        "instructions": (
            "First, list the aspects of the question that would need to be addressed "
            "for a complete answer. Then assess which of those aspects the response covers. "
            "Identify any that are missing."
        )
    },
    "relevance": {
        "description": "The response addresses what the user actually asked, not a related but different question.",
        "weight": 0.25,
        "scale": {
            5: "The response directly and precisely addresses the user's question.",
            4: "The response addresses the question but includes minor tangents.",
            3: "The response partially addresses the question but drifts significantly.",
            2: "The response mostly addresses a different question.",
            1: "The response is entirely off-topic."
        },
        "instructions": (
            "State in one sentence what the user's question is asking for. "
            "Then assess how directly the response targets that need. "
            "Note any significant content that does not serve the user's actual question."
        )
    }
}

This structure makes several things explicit that were hidden in the original vague rubric: the relative weight of each dimension, the specific evidence the judge should gather, and the precise meaning of each score level. The instructions field is especially important β€” it is the evidence-directed prompt fragment that will be embedded directly into the judge's evaluation prompt.

Building a Criterion-Level Prompt Renderer

With the rubric dictionary in place, you can programmatically render evaluation prompts for each sub-criterion. This keeps your evaluation logic DRY and ensures that every criterion is evaluated with consistent framing.

## prompt_renderer.py
## Renders a per-criterion evaluation prompt from the rubric dictionary.

def render_criterion_prompt(
    criterion_name: str,
    criterion_config: dict,
    user_question: str,
    model_response: str
) -> str:
    """
    Produces a focused evaluation prompt for a single rubric criterion.
    The judge is given the criterion description, scale, and evidence instructions,
    then asked to score only this one dimension.
    """
    scale_text = "\n".join(
        f"  Score {score}: {label}"
        for score, label in sorted(criterion_config["scale"].items(), reverse=True)
    )

    prompt = f"""You are an expert evaluator. Your task is to assess ONE specific quality dimension of a model response.

### Criterion: {criterion_name.replace('_', ' ').title()}
{criterion_config['description']}

### Evaluation Instructions
{criterion_config['instructions']}

### Scoring Scale
{scale_text}

### User Question
{user_question}

### Model Response
{model_response}

### Your Task
1. Follow the evaluation instructions above to gather evidence from the response.
2. Assign a score from 1 to 5 based strictly on the scale above.
3. Cite the specific part of the response that most influenced your score.

Respond in this format:
SCORE: <integer 1-5>
EVIDENCE: <quoted text from response>
REASONING: <one or two sentences explaining the score>"""

    return prompt


## Example usage
if __name__ == "__main__":
    from rubric import RUBRIC

    question = "How do I safely dispose of old lithium-ion batteries?"
    response = "You can throw lithium-ion batteries in the regular trash. "\
               "They are mostly plastic and metal, which decompose naturally. "\
               "Check your local electronics retailer for recycling drop-off points."

    # Render the prompt for the 'factual_accuracy' criterion
    prompt = render_criterion_prompt(
        criterion_name="factual_accuracy",
        criterion_config=RUBRIC["factual_accuracy"],
        user_question=question,
        model_response=response
    )
    print(prompt)

Notice that this response is a particularly good test case: it contains an internal contradiction (throw in trash and check recycling points), a factual error (they do not decompose naturally), and partial completeness. Evaluating it criterion by criterion will produce a differentiated score profile β€” low on factual accuracy, low on logical consistency, medium on completeness β€” rather than a single undifferentiated "bad" judgment. That differentiation is exactly what makes the evaluation actionable.

πŸ’‘ Mental Model: Think of your rubric dictionary as a measurement instrument, like a multimeter. A single "is this circuit working?" question gives you a binary answer. Measuring voltage, current, and resistance separately gives you a diagnostic profile that tells you what is failing and why. Criteria decomposition gives your evaluations the same diagnostic resolution.

Choosing the Right Granularity

Decomposition can be taken too far. If you decompose "factual accuracy" into 15 sub-sub-criteria, each covering a single type of claim, you will generate so many judge calls that the evaluation becomes prohibitively expensive and the results become difficult to interpret. The right granularity is determined by two practical constraints.

Actionability: Each sub-criterion should correspond to a distinct failure mode that a developer or prompt engineer could address independently. If two sub-criteria would always lead to the same fix, merge them.

Evaluability: Each sub-criterion should be assessable by the judge with the context it is given. A criterion like "the response is consistent with the company's undocumented internal policies" cannot be evaluated without additional context that the judge does not have.

πŸ“‹ Quick Reference Card: Decomposition Quality Checklist

Criterion property What to check
🎯 Independently violable Can a response fail this without failing the others?
πŸ”§ Observable Can the judge find evidence for this in the text?
πŸ“š Actionable Does failure point to a distinct fix?
πŸ”’ Non-redundant Is this meaningfully different from every other criterion?
🧠 Appropriately scoped Is it neither too broad nor so narrow it rarely fires?

⚠️ Common Mistake: Creating criteria that require the judge to know things outside the response β€” such as whether a cited study actually exists, or whether a recommendation matches a specific product's documentation. If verification requires external lookup, either provide that context in the prompt or replace the criterion with one that is assessable from the text alone (e.g., "does the response acknowledge the limits of its knowledge?").

Putting It Together: The Decomposition Workflow

To close this section, here is the full decomposition workflow as a repeatable process:

1. STATE the quality goal clearly.
   └── "The response should be correct."

2. ASK the generative question.
   └── "What observable properties would a correct response have?"

3. LIST raw observations.
   └── Facts are true, claims don't contradict, nothing critical is missing, ...

4. FIND fault lines β€” cluster observations into groups
   that can fail independently.
   └── Factual accuracy | Logical consistency | Completeness

5. APPLY the independence test to each pair.
   └── Can A fail while B passes? Can B fail while A passes?
   └── Yes β†’ keep separate. No β†’ merge.

6. WRITE evidence-directed criterion prompts.
   └── Scope + Evidence instruction + Score mapping + Edge cases

7. APPLY the evaluability check.
   └── Can the judge score this from the response text alone?

8. ASSEMBLE the rubric dictionary.
   └── name, description, weight, scale, instructions

This workflow is not a one-time exercise. As you run evaluations and examine disagreements between judge runs β€” or between your judge and human raters β€” you will discover criteria that are still too coarse, criterion prompts that are ambiguous in edge cases, or fault lines you missed entirely. The rubric dictionary is a living artifact. Its initial quality sets the ceiling on your evaluation's reproducibility, but ongoing refinement based on observed failures is what keeps that ceiling rising.

🧠 Mnemonic: Remember the decomposition steps as SLAFIA: State the goal, List observations, Apply the generative question, Find fault lines, Independence test, Assemble the rubric. Good rubrics are not written in one sitting β€” they are refined through iteration, just like the systems they evaluate.

Implementing Rubric-Based Evaluation in Code

At this point in the lesson, you understand why rubrics matter and how to decompose quality dimensions into atomic criteria. Now it's time to close the loop: turning that conceptual structure into running code that actually evaluates LLM responses. This section walks you through every layer of the pipeline β€” from representing a rubric as a typed data object, to injecting criteria into judge prompts, to parsing scores and rolling them up into a final composite metric.

The goal is a pipeline you could drop into a real evaluation harness today. Every design decision along the way is made explicit, because in reproducible evaluation, undocumented decisions are bugs waiting to surface.


Representing a Rubric as a Data Object

Before you can evaluate anything, you need a canonical representation of your rubric that the rest of the system can depend on. Rubric-as-data is the principle that every criterion β€” its description, scoring scale, weight, and anchor examples β€” should live in a structured object rather than being scattered across prompt strings or configuration files.

Why does this matter? Because when your rubric lives only inside a prompt template, it's invisible to the rest of your codebase. You can't iterate over its criteria programmatically, you can't validate that a score falls within the right range, and you can't reuse the same rubric across different judge models without copy-pasting strings. A typed data object solves all of these problems at once.

The following example uses Pydantic models, which give you data validation and serialization for free:

from pydantic import BaseModel, Field, field_validator
from typing import List, Optional


class AnchorExample(BaseModel):
    """A labeled example that illustrates a specific score level."""
    score: int
    description: str  # What this score level looks like in practice


class Criterion(BaseModel):
    """A single, independently assessable quality dimension."""
    name: str                          # Short identifier, e.g. "factual_accuracy"
    description: str                   # Full natural-language description for the judge
    min_score: int = 1
    max_score: int = 5
    weight: float = 1.0                # Relative weight in composite score
    anchors: List[AnchorExample] = []  # Optional scoring anchors

    @field_validator("weight")
    @classmethod
    def weight_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError("Criterion weight must be positive")
        return v


class Rubric(BaseModel):
    """A complete evaluation rubric composed of multiple criteria."""
    name: str
    description: str
    criteria: List[Criterion]

    def total_weight(self) -> float:
        return sum(c.weight for c in self.criteria)

    def criterion_by_name(self, name: str) -> Optional[Criterion]:
        return next((c for c in self.criteria if c.name == name), None)

This structure enforces several invariants automatically. The field_validator ensures no criterion can be given a zero or negative weight, which would silently corrupt your composite score. The Rubric.total_weight() method will become essential when we normalize weights during aggregation. Notice that AnchorExample is a first-class model β€” anchor examples aren't afterthoughts bolted onto a string; they're typed objects the prompt-building code can iterate over.

πŸ’‘ Pro Tip: Store rubric definitions as JSON or YAML files and load them at runtime with Rubric.model_validate(json.load(f)). This lets non-engineers edit rubric text without touching Python, while your code still enforces the schema.


Building a Three-Criterion Example Rubric

Let's instantiate a concrete rubric for evaluating customer-support responses. This rubric has three criteria β€” factual accuracy, tone appropriateness, and resolution completeness β€” each with different weights reflecting their relative importance:

support_rubric = Rubric(
    name="customer_support_quality",
    description="Evaluates the quality of an AI-generated customer support response.",
    criteria=[
        Criterion(
            name="factual_accuracy",
            description=(
                "Does the response contain only statements that are factually correct? "
                "Penalize any claims that contradict established product documentation, "
                "policies, or general facts."
            ),
            min_score=1,
            max_score=5,
            weight=2.0,  # Highest weight β€” wrong facts damage trust most
            anchors=[
                AnchorExample(score=1, description="Contains multiple factual errors."),
                AnchorExample(score=3, description="Mostly accurate with one minor inaccuracy."),
                AnchorExample(score=5, description="Completely accurate, no errors detected."),
            ],
        ),
        Criterion(
            name="tone_appropriateness",
            description=(
                "Is the tone professional, empathetic, and suited to a customer support context? "
                "Avoid both overly robotic language and inappropriate informality."
            ),
            min_score=1,
            max_score=5,
            weight=1.0,
            anchors=[
                AnchorExample(score=1, description="Tone is rude, dismissive, or highly inappropriate."),
                AnchorExample(score=3, description="Neutral tone; neither warm nor cold."),
                AnchorExample(score=5, description="Warm, empathetic, and professionally appropriate."),
            ],
        ),
        Criterion(
            name="resolution_completeness",
            description=(
                "Does the response fully address the customer's stated problem? "
                "A complete response either resolves the issue or provides a clear next step."
            ),
            min_score=1,
            max_score=5,
            weight=1.5,
            anchors=[
                AnchorExample(score=1, description="Ignores the core problem entirely."),
                AnchorExample(score=3, description="Partially addresses the problem; key steps missing."),
                AnchorExample(score=5, description="Fully resolves the issue or provides an explicit escalation path."),
            ],
        ),
    ],
)

Weights of 2.0, 1.0, and 1.5 reflect a deliberate editorial judgment: a response that sounds warm but gives wrong information is worse than one that's robotic but accurate. That judgment is now documented in code, not buried in a design document or a team member's memory.


Translating Criteria into Judge Prompts

With the rubric as a data object, the next step is generating a judge prompt for each criterion. The key design choice here is per-criterion prompting: instead of asking the judge to evaluate all three criteria in a single prompt, you make one focused call per criterion. This reduces the cognitive load on the judge model and makes each output easier to parse and validate.

🎯 Key Principle: One criterion, one prompt call. Multi-criteria single prompts invite the judge to average across dimensions mentally, producing less reliable per-criterion scores.

The prompt template has four injectable components: the criterion description, the score scale, the anchor examples, and the actual content being evaluated.

from string import Template


JUDGE_PROMPT_TEMPLATE = """
You are a precise evaluation assistant. Your task is to score the following response on ONE specific criterion.

### Criterion: {criterion_name}
{criterion_description}

### Scoring Scale
Score from {min_score} (lowest) to {max_score} (highest).

### Score Anchors
{anchor_text}

### Content to Evaluate
#### User Question:
{user_question}

#### Response Being Evaluated:
{response_text}

### Your Output
Respond in this EXACT JSON format and nothing else:
{{
  "criterion": "{criterion_name}",
  "score": <integer between {min_score} and {max_score}>,
  "reasoning": "<one or two sentences explaining your score>"
}}
"""


def build_judge_prompt(
    criterion: Criterion,
    user_question: str,
    response_text: str,
) -> str:
    """Render a judge prompt for a single criterion."""
    anchor_text = "\n".join(
        f"- Score {a.score}: {a.description}" for a in criterion.anchors
    ) or "No anchors provided."

    return JUDGE_PROMPT_TEMPLATE.format(
        criterion_name=criterion.name,
        criterion_description=criterion.description,
        min_score=criterion.min_score,
        max_score=criterion.max_score,
        anchor_text=anchor_text,
        user_question=user_question,
        response_text=response_text,
    )

Several design choices deserve attention here. The anchor text is generated from the typed AnchorExample objects β€” if you add or remove anchors in the rubric definition, the prompt updates automatically. The output format instruction demands a specific JSON shape, which the parsing layer will enforce. And the criterion name is echoed back in the output JSON, which lets you detect hallucinated or misrouted responses during validation.

⚠️ Common Mistake: Asking the judge to score "on a scale of 1-5" without anchors produces wildly different distributions across judge model versions. Always include at least three anchors (low, mid, high) so the model has calibration points.


Parsing and Validating Judge Outputs

The judge will return a string. Your pipeline's reliability depends on how robustly you parse that string and how strictly you enforce the schema. Structured output validation is the practice of treating every judge response as potentially malformed until proven otherwise.

The parsing layer does three things: extracts the JSON from the response (gracefully handling markdown fences the model might wrap it in), validates the schema, and enforces business rules like score range.

import json
import re
from pydantic import BaseModel, field_validator
from typing import Optional


class CriterionScore(BaseModel):
    """Validated output from a single criterion judge call."""
    criterion: str
    score: int
    reasoning: str

    @field_validator("reasoning")
    @classmethod
    def reasoning_must_be_nonempty(cls, v):
        if not v or not v.strip():
            raise ValueError("Reasoning must not be empty")
        return v.strip()


def parse_judge_response(
    raw_response: str,
    criterion: Criterion,
) -> CriterionScore:
    """
    Extract, parse, and validate a CriterionScore from the judge's raw output.
    Raises ValueError if the response is malformed or violates constraints.
    """
    # Strip markdown code fences if the model wrapped its output
    cleaned = re.sub(r"```(?:json)?\n?", "", raw_response).strip()

    try:
        data = json.loads(cleaned)
    except json.JSONDecodeError as e:
        raise ValueError(f"Judge response is not valid JSON: {e}\nRaw: {raw_response!r}")

    score_obj = CriterionScore.model_validate(data)

    # Enforce the criterion's defined score range
    if not (criterion.min_score <= score_obj.score <= criterion.max_score):
        raise ValueError(
            f"Score {score_obj.score} out of range "
            f"[{criterion.min_score}, {criterion.max_score}] "
            f"for criterion '{criterion.name}'"
        )

    # Confirm the model scored the right criterion
    if score_obj.criterion != criterion.name:
        raise ValueError(
            f"Expected criterion '{criterion.name}', "
            f"got '{score_obj.criterion}'"
        )

    return score_obj

The range check is crucial. Without it, a judge that misreads the scale and returns a score of 7 on a 1–5 criterion will silently corrupt your composite metric. The criterion name check catches a subtler failure mode: a judge that copy-pastes output from a previous call or hallucinates a different criterion name entirely.

πŸ’‘ Mental Model: Think of parse_judge_response as a contract assertion β€” it's not just parsing data, it's enforcing that the judge fulfilled the exact task you assigned.


Aggregating Per-Criterion Scores into a Composite Metric

Once you have a validated CriterionScore for each criterion, the final step is aggregation. The simplest approach is a weighted average, where each criterion's score is multiplied by its weight, summed, and divided by the total weight. This produces a composite score on the same scale as the individual criteria.

Composite Score Formula:

  composite = Ξ£ (score_i Γ— weight_i) / Ξ£ weight_i

For our three-criterion rubric:
  weights = [2.0, 1.0, 1.5]   β†’  total = 4.5
  scores  = [4,   3,   5  ]

  composite = (4Γ—2.0 + 3Γ—1.0 + 5Γ—1.5) / 4.5
            = (8.0 + 3.0 + 7.5) / 4.5
            = 18.5 / 4.5
            β‰ˆ 4.11

The aggregation function should also return per-criterion breakdowns alongside the composite, because the composite alone hides information. A score of 4.11 means very different things depending on whether it came from three 4s or from a perfect factual accuracy masking a poor resolution score.

from dataclasses import dataclass
from typing import Dict


@dataclass
class EvaluationResult:
    """Full output of a rubric evaluation run."""
    rubric_name: str
    composite_score: float
    per_criterion: Dict[str, CriterionScore]  # criterion name β†’ validated score
    weights_used: Dict[str, float]            # explicit record of weighting decisions


def aggregate_scores(
    rubric: Rubric,
    criterion_scores: Dict[str, CriterionScore],
) -> EvaluationResult:
    """
    Combine per-criterion scores into a weighted composite.
    Requires a score for every criterion defined in the rubric.
    """
    missing = [c.name for c in rubric.criteria if c.name not in criterion_scores]
    if missing:
        raise ValueError(f"Missing scores for criteria: {missing}")

    total_weight = rubric.total_weight()
    weighted_sum = sum(
        criterion_scores[c.name].score * c.weight
        for c in rubric.criteria
    )

    return EvaluationResult(
        rubric_name=rubric.name,
        composite_score=round(weighted_sum / total_weight, 4),
        per_criterion=criterion_scores,
        weights_used={c.name: c.weight for c in rubric.criteria},
    )

The weights_used field in EvaluationResult might seem redundant β€” you can always look at the rubric definition. But storing the weights that were actually applied to a specific evaluation run is a reproducibility artifact. Six months from now, when you update criterion weights, you'll be able to look at historical results and know exactly what weighting schema produced each composite score.

⚠️ Common Mistake: Normalizing scores to 0–1 before multiplying by weights can seem elegant, but it introduces a subtle problem: if different criteria have different scale ranges (e.g., one is 1–5 and another is 1–10), your normalization math must account for that, or you're silently biasing the composite. Keeping scores on their native scales and normalizing only at the end is safer.


End-to-End Pipeline

Now let's put all the pieces together. The following example simulates a full evaluation run using a mock judge function (in production, you'd replace call_judge_model with your actual LLM API call).

from typing import Callable


def evaluate_response(
    rubric: Rubric,
    user_question: str,
    response_text: str,
    judge_fn: Callable[[str], str],  # fn(prompt) -> raw judge response string
) -> EvaluationResult:
    """
    Run a full rubric evaluation on a single LLM response.

    Args:
        rubric:        The rubric defining evaluation criteria.
        user_question: The original user input the response is answering.
        response_text: The LLM response being evaluated.
        judge_fn:      A callable that sends a prompt to the judge model
                       and returns the raw string response.

    Returns:
        An EvaluationResult with per-criterion scores and a composite metric.
    """
    criterion_scores: Dict[str, CriterionScore] = {}

    for criterion in rubric.criteria:
        prompt = build_judge_prompt(criterion, user_question, response_text)
        raw_response = judge_fn(prompt)

        try:
            score = parse_judge_response(raw_response, criterion)
        except ValueError as e:
            # In production: log the error, retry with temperature=0, or flag for human review
            raise RuntimeError(
                f"Failed to parse judge response for criterion '{criterion.name}': {e}"
            )

        criterion_scores[criterion.name] = score

    return aggregate_scores(rubric, criterion_scores)


## --- Demo Run ---

SAMPLE_QUESTION = (
    "My order #4821 hasn't arrived after 10 days. "
    "Your website says delivery takes 5-7 business days. What should I do?"
)

SAMPLE_RESPONSE = (
    "I'm sorry to hear your order is delayed β€” that's frustrating, and I completely understand. "
    "Orders typically arrive within 5-7 business days, so yours is past the expected window. "
    "Here's what I recommend: First, check your tracking link (sent in your confirmation email). "
    "If tracking shows no movement for 3+ days, please reply here with your order number and "
    "we'll open a trace with the carrier and send a replacement if needed. "
    "We'll get this sorted for you as quickly as possible."
)

## Simulated judge responses for the demo
_MOCK_RESPONSES = {
    "factual_accuracy": '{"criterion": "factual_accuracy", "score": 5, "reasoning": "The 5-7 business day claim matches stated policy. No factual errors detected."}',
    "tone_appropriateness": '{"criterion": "tone_appropriateness", "score": 5, "reasoning": "Response is empathetic, professional, and appropriately warm."}',
    "resolution_completeness": '{"criterion": "resolution_completeness", "score": 4, "reasoning": "Provides clear next steps but does not proactively initiate the carrier trace."}',
}

def mock_judge(prompt: str) -> str:
    """Returns pre-baked responses keyed on criterion name in the prompt."""
    for name, response in _MOCK_RESPONSES.items():
        if name in prompt:
            return response
    raise ValueError("Unknown criterion in prompt")


result = evaluate_response(
    rubric=support_rubric,
    user_question=SAMPLE_QUESTION,
    response_text=SAMPLE_RESPONSE,
    judge_fn=mock_judge,
)

print(f"Rubric: {result.rubric_name}")
print(f"Composite Score: {result.composite_score:.2f} / 5.00")
print()
for name, score in result.per_criterion.items():
    weight = result.weights_used[name]
    print(f"  [{name}] Score: {score.score}/5  Weight: {weight}")
    print(f"    Reasoning: {score.reasoning}")

Running this produces:

Rubric: customer_support_quality
Composite Score: 4.78 / 5.00

  [factual_accuracy] Score: 5/5  Weight: 2.0
    Reasoning: The 5-7 business day claim matches stated policy. No factual errors detected.
  [tone_appropriateness] Score: 5/5  Weight: 1.0
    Reasoning: Response is empathetic, professional, and appropriately warm.
  [resolution_completeness] Score: 4/5  Weight: 1.5
    Reasoning: Provides clear next steps but does not proactively initiate the carrier trace.

The composite of 4.78 reflects that the response is very strong but not perfect β€” the slight incompleteness on resolution is weighted at 1.5, pulling the composite just below a perfect 5.0. Crucially, you can see why the score landed where it did, which is precisely the audit trail reproducible evaluation requires.


The Full Pipeline at a Glance

Rubric Definition (Pydantic)
         β”‚
         β–Ό
  For each Criterion
         β”‚
         β”œβ”€β”€β–Ί build_judge_prompt(criterion, question, response)
         β”‚              β”‚
         β”‚              β–Ό
         β”‚       judge_fn(prompt)  ◄── LLM API call
         β”‚              β”‚
         β”‚              β–Ό
         β”‚    parse_judge_response(raw, criterion)
         β”‚              β”‚
         β”‚        [validate JSON]
         β”‚        [check score range]
         β”‚        [verify criterion name]
         β”‚              β”‚
         β”‚              β–Ό
         β”‚       CriterionScore βœ“
         β”‚
         β–Ό
  aggregate_scores(rubric, all_scores)
         β”‚
         β–Ό
   EvaluationResult
   β”œβ”€ composite_score
   β”œβ”€ per_criterion breakdown
   └─ weights_used (audit record)

πŸ“‹ Quick Reference Card: Pipeline Components

πŸ”§ Component πŸ“š Responsibility 🎯 Key Guarantee
πŸ—οΈ Rubric / Criterion Canonical definition Schema validation at load time
πŸ“ build_judge_prompt Prompt assembly Anchors always injected from data
πŸ” parse_judge_response Output extraction Score range + reasoning enforced
βž• aggregate_scores Weighted rollup Missing criteria raise errors
πŸ“Š EvaluationResult Final output Weights recorded as audit trail

πŸ€” Did you know? Storing weights_used alongside each evaluation result is the same pattern production ML systems use to track hyperparameters alongside model metrics β€” it's the difference between a score and a reproducible score.


Where This Pipeline Goes Next

This implementation is intentionally minimal β€” it's a foundation, not a ceiling. In practice, you'll extend it in several directions: adding retry logic with exponential backoff for malformed judge responses, running criteria in parallel to reduce latency, supporting multiple judge models and comparing their scores for calibration, and persisting EvaluationResult objects to a database for trend analysis over time.

The key insight is that every extension point is already clean because the rubric is a data object. Want to add a new criterion? Edit the Rubric definition and every downstream component β€” prompts, parsers, aggregators β€” adapts automatically. Want to change weights for an A/B test? Update a single field and the audit trail captures the change.

πŸ’‘ Remember: The code is the documentation. When weighting decisions, anchor examples, and score ranges live in typed Python objects rather than in someone's memory or a slide deck, your evaluation system becomes something you can reason about, version-control, and hand off to a colleague β€” which is exactly what reproducible evaluation demands.

Common Rubric Design Mistakes and How to Avoid Them

Even practitioners who understand the theory of rubric design often produce rubrics that fail in practice. The gap between a rubric that looks coherent and one that performs coherently is surprisingly large. A rubric can read fluently, cover all the right topics, and still produce noisy, contradictory, or misleading scores when a real LLM judge applies it at scale. This section catalogs the five most common failure modes, diagnoses why each one degrades evaluation quality, and shows concrete before-and-after fixes you can apply immediately.

Think of this as a debugging guide. Just as code has recognizable bug patternsβ€”off-by-one errors, null pointer exceptions, race conditionsβ€”rubrics have their own recurring pathologies. Once you can name them, you can spot them in minutes rather than hours of confused score analysis.


Mistake 1: Criterion Conflation

Criterion conflation occurs when a single rubric criterion bundles two or more logically distinct failure modes together. The result is a criterion that can fail in multiple independent ways, making it impossible to determine which underlying property caused a low score.

Consider this deceptively reasonable-looking criterion:

"The response is accurate and well-organized."

At a glance this seems fine. But accuracy and organization are orthogonal properties. A response can be perfectly accurate but chaotic in structure (a data dump of correct facts). It can be beautifully organized but factually wrong (a crisp five-paragraph essay full of errors). When a judge scores this criterion a 2 out of 5, you have no idea whether the response was inaccurate, disorganized, or both. Your evaluation data is opaque.

Conflation diagnostic:

  Criterion: "Accurate AND well-organized"
         |
         β–Ό
   Score: 2/5  ← What caused this?
         |
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚         β”‚
    β–Ό         β–Ό
  Inaccurate  Disorganized
  (unknown)   (unknown)

  You cannot distinguish which failure occurred.

The fix is atomic decomposition: split any criterion that contains a conjunction ("and", "as well as", "while also") or an implicit multi-part requirement into separate, independently scoreable criteria.

❌ Wrong thinking: "Accuracy and clarity are both about quality, so one criterion covers both." βœ… Correct thinking: "Accuracy and clarity can fail independently. Each deserves its own criterion so failures are diagnosable."

Here is the same rubric rewritten with decomposed criteria:

## Before: conflated criterion
conflated_rubric = {
    "criteria": [
        {
            "name": "accuracy_and_organization",
            "description": "The response is accurate and well-organized.",
            "scale": "1-5"
        }
    ]
}

## After: atomically decomposed criteria
decomposed_rubric = {
    "criteria": [
        {
            "name": "factual_accuracy",
            "description": (
                "All factual claims in the response are correct. "
                "No false statements, hallucinated figures, or misattributed sources."
            ),
            "scale": "1-5"
        },
        {
            "name": "structural_organization",
            "description": (
                "The response presents information in a logical sequence. "
                "Related ideas are grouped together. "
                "The reader can follow the argument without re-reading."
            ),
            "scale": "1-5"
        }
    ]
}

With the decomposed version, a response that is accurate but disorganized will score high on factual_accuracy and low on structural_organization, giving you actionable signal. You can now fix the right thing in your system.

πŸ’‘ Pro Tip: A quick heuristicβ€”if you can construct a hypothetical response that scores differently on two parts of the same criterion, those parts should be separate criteria.


Mistake 2: Underspecified Score Anchors

Score anchors are the definitions attached to each point on a rating scale. They tell the judge what observable properties distinguish a score of 3 from a score of 4. When anchors are underspecifiedβ€”or missing entirelyβ€”judges fill the gap with their own implicit standards, which vary across prompts, sessions, and model versions.

This is the most common rubric mistake, and it is also the subtlest. A rubric can list perfectly decomposed criteria and still produce noisy scores if the scale labels are vague.

⚠️ Common Mistake: Using ordinal labels like "Poor / Fair / Good / Excellent" without defining what observable evidence maps to each label. These labels feel descriptive but carry almost no information for a judge trying to decide between a 3 and a 4.

Here is a typical underspecified anchor set for a "completeness" criterion:

Score Label
1 Very incomplete
2 Mostly incomplete
3 Somewhat complete
4 Mostly complete
5 Fully complete

The labels are pure synonyms of the numbers. A judge reading this has no idea what "somewhat complete" means in observable terms. Does it mean 50% of required elements are present? 70%? Does it matter which elements are missing?

The fix is to write behavioral anchors: definitions that describe what you would observe in a response at each score level.

## Underspecified anchors β€” useless for consistent scoring
bad_anchors = {
    1: "Very incomplete",
    2: "Mostly incomplete",
    3: "Somewhat complete",
    4: "Mostly complete",
    5: "Fully complete"
}

## Behavioral anchors β€” observable, reproducible
good_anchors = {
    1: (
        "Fewer than half of the required elements are present. "
        "Critical components (e.g., the answer to the user's core question) "
        "are entirely missing."
    ),
    2: (
        "At least half of required elements are present, but one or more "
        "critical components are missing or addressed only superficially "
        "(one sentence where a full explanation is needed)."
    ),
    3: (
        "All critical components are present. At least one secondary component "
        "is missing or underdeveloped. The user could act on the response but "
        "would likely need to ask a follow-up question."
    ),
    4: (
        "All critical and secondary components are present. Minor details "
        "or edge cases are omitted, but nothing a typical user would notice "
        "as a gap."
    ),
    5: (
        "All required elements are present, including edge cases and caveats "
        "relevant to the task. No reasonable follow-up question about "
        "completeness could be asked."
    )
}

Notice that each anchor in good_anchors describes what you would find in the response, not just how complete it seems. This is the key distinction. An LLM judge (or human rater) can apply these anchors mechanically to a response and arrive at the same score across sessions.

🎯 Key Principle: Score anchors should describe evidence, not impressions. Replace adjectives like "mostly" and "somewhat" with specific thresholds, observable elements, and named components.

πŸ’‘ Real-World Example: In medical rubric design, anchor vagueness has real consequences. A rubric item like "patient history is adequate" scores consistently only when "adequate" is defined as specific observable items: chief complaint, duration, relevant medications, and allergies. The same principle applies to LLM rubrics.


Mistake 3: Input-Agnostic Criteria

Input-agnostic criteria are rubric dimensions written as if quality expectations are constant across all possible inputs. They fail to account for how the definition of "good" shifts when the task type, domain, or user intent changes.

Imagine a rubric criterion for "response length appropriateness":

"The response is an appropriate lengthβ€”neither too short nor too long."

This criterion is technically evaluating something real (length appropriateness), but what counts as appropriate depends entirely on the input. A one-sentence answer is appropriate for "What is the capital of France?" and catastrophically insufficient for "Explain the trade-offs between monolithic and microservices architectures."

Input-agnostic criterion failure:

  Criterion: "Appropriate length"
       β”‚
  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                     β”‚
  β–Ό                     β–Ό
Simple factual Q     Complex technical Q
"Capital of France?" "Explain microservices"
       β”‚                     β”‚
  1 sentence = βœ…        1 sentence = ❌
  1 page     = ❌        1 page     = βœ…

  Same criterion, opposite expectations.
  A fixed rubric cannot handle both.

The fix involves two strategies. The first is to parameterize criteria by task type: build separate rubric templates for distinct task categories (factual QA, long-form explanation, code generation, creative writing) and select the appropriate template at runtime.

The second is to reference the input explicitly in the criterion description, rather than defining quality in the abstract:

def build_length_criterion(task_type: str) -> dict:
    """
    Returns a length criterion whose anchors are calibrated
    to the expected task complexity.
    """
    anchors = {
        "factual_qa": {
            "description": (
                "The response length matches the complexity of the question. "
                "Simple factual questions (single entity, single fact) should "
                "be answered in 1-3 sentences. Adding unrequested elaboration "
                "counts against this criterion."
            ),
            "scale": {
                1: "Substantially too long (more than 2 paragraphs for a simple factual question).",
                3: "Appropriate core answer present; minor padding or minor under-explanation.",
                5: "Length is precisely calibrated to question complexityβ€”no padding, no truncation."
            }
        },
        "technical_explanation": {
            "description": (
                "The response covers the topic with sufficient depth given "
                "the complexity indicated in the user's question. Shallow "
                "coverage of a complex topic counts against this criterion."
            ),
            "scale": {
                1: "Fewer than 2 paragraphs for a topic requiring substantial explanation.",
                3: "Core concepts covered but key nuances or trade-offs missing.",
                5: "Depth matches complexity: all major sub-topics addressed, "
                   "edge cases noted where relevant."
            }
        }
    }
    return anchors.get(task_type, anchors["factual_qa"])

## Usage
criterion = build_length_criterion("technical_explanation")
print(criterion["description"])

This approach ensures the rubric is task-aware rather than task-agnostic. The same underlying quality dimension (length appropriateness) is operationalized differently depending on what kind of response quality actually looks like for that task.

⚠️ Common Mistake: Writing a single universal rubric and applying it to all task types. This is the evaluation equivalent of using a single unit test to validate all code paths.


Mistake 4: Overloading the Judge

Judge overloading occurs when a single evaluation prompt includes so many criteria that the LLM judge cannot maintain coherent focus across all of them simultaneously. The result is scoring that is noisier, less internally consistent, and more prone to contradictions between criteria.

This is a cognitive load problem applied to language models. Research on LLM attention and instruction-following shows that model performance on any individual instruction degrades as the total number of instructions in the prompt grows. A rubric with twelve criteria in one prompt is not twelve separate evaluationsβ€”it is one very confused evaluation.

Judge overloading effect:

  12-criterion prompt
  β”‚
  β”œβ”€β”€ Criterion 1  β†’ Judge attention: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  (high early)
  β”œβ”€β”€ Criterion 2  β†’ Judge attention: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘
  β”œβ”€β”€ Criterion 3  β†’ Judge attention: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘
  β”‚   ...
  β”œβ”€β”€ Criterion 9  β†’ Judge attention: β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘  (degraded)
  β”œβ”€β”€ Criterion 10 β†’ Judge attention: β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘
  β”œβ”€β”€ Criterion 11 β†’ Judge attention: β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
  └── Criterion 12 β†’ Judge attention: β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  (minimal)

  Late criteria are systematically under-weighted.
  Contradictions between early and late criteria go unnoticed.

The fix is criterion batching: group criteria into thematically related sets and run a separate judge call for each batch. A practical upper limit is four to six criteria per judge call.

from typing import Any

def evaluate_in_batches(
    response: str,
    all_criteria: list[dict],
    judge_fn,  # callable that takes (response, criteria) -> scores
    batch_size: int = 4
) -> dict[str, Any]:
    """
    Splits a large rubric into batches and runs one judge call
    per batch, then merges results. Reduces judge overloading.
    """
    all_scores = {}

    # Split criteria list into batches of at most batch_size
    for i in range(0, len(all_criteria), batch_size):
        batch = all_criteria[i : i + batch_size]
        batch_label = f"batch_{i // batch_size + 1}"

        print(f"Evaluating {batch_label}: {[c['name'] for c in batch]}")

        # Each batch gets its own focused judge call
        batch_scores = judge_fn(response=response, criteria=batch)

        # Merge scores into the master dictionary
        all_scores.update(batch_scores)

    return all_scores


## Example rubric with many criteria β€” would overload a single call
large_rubric = [
    {"name": "factual_accuracy"},
    {"name": "source_citation"},
    {"name": "claim_hedging"},
    {"name": "logical_coherence"},
    {"name": "structural_organization"},
    {"name": "transition_quality"},
    {"name": "tone_appropriateness"},
    {"name": "vocabulary_level"},
    {"name": "completeness"},
    {"name": "conciseness"},
]

## With batch_size=4, this creates 3 separate focused judge calls
## instead of one overloaded call with 10 criteria

Batching does increase latency and cost proportionally to the number of batches, but it reliably produces higher-quality scores per criterion. For production systems where evaluation quality is the bottleneck, this trade-off is almost always worth it.

πŸ’‘ Mental Model: Think of an LLM judge like a human reviewer doing a structured walkthrough. If you hand them a ten-page checklist to complete in one sitting while reading a single document, their attention will flag. Breaking the checklist into focused review passes produces better resultsβ€”even though it takes longer.

🧠 Mnemonic: FOCUS β€” Fewer criteria, One theme per call, Clear scope, Unambiguous task, Separate batches for separate dimensions.


Mistake 5: Neglecting Adversarial Test Cases During Rubric Design

Adversarial test cases are edge-case inputs deliberately chosen to expose rubric failures. Most practitioners design rubrics by testing them on typical examplesβ€”responses that are clearly good, clearly bad, or middling. This selection bias means rubrics that perform fine on average inputs will produce absurd or contradictory scores on inputs that sit near boundary conditions.

The failure mode is subtle: a rubric can be internally consistent and produce reasonable scores across hundreds of normal examples, then completely break down on inputs that weren't considered during design. By the time you discover the breakage in production, you may have already used corrupted evaluation data to make system decisions.

⚠️ Common Mistake: Validating a rubric only on the examples you used to write it. This is circularβ€”of course it scores those examples reasonably.

Here are four categories of adversarial test cases every rubric should survive:

πŸ“‹ Quick Reference Card: Adversarial Test Case Types

🎯 Type πŸ“š Description πŸ”§ What it exposes
πŸ”’ Correct but suspicious True statement that sounds wrong Bias toward confident-sounding responses
πŸ“Š Wrong but fluent Confident, well-organized false claim Over-weighting style over substance
🧠 Minimal valid One-sentence answer that fully satisfies the question Bias toward longer responses
πŸ”„ Off-topic but polished Eloquent response that doesn't address the question Style masking relevance failure

The process for adversarial rubric validation should happen before you deploy the rubric at scale:

## A lightweight adversarial test harness for rubric validation

adversarial_test_cases = [
    {
        "label": "correct_but_suspicious",
        "input": "What is the boiling point of water at 8,848 meters altitude?",
        "response": "Approximately 70Β°C (158Β°F), due to reduced atmospheric pressure.",
        # This is factually correct. A biased rubric might penalize it
        # for sounding 'wrong' relative to the familiar 100Β°C answer.
        "expected_accuracy_score": 5,
        "expected_range": (4, 5)  # Acceptable range for automated checking
    },
    {
        "label": "wrong_but_fluent",
        "input": "Who invented the telephone?",
        "response": (
            "The telephone was famously invented by Thomas Edison in 1876, "
            "a landmark moment in American innovation that transformed "
            "global communication forever."
        ),
        # Factually wrong (Alexander Graham Bell), but well-written.
        # A rubric conflating style and accuracy would over-score this.
        "expected_accuracy_score": 1,
        "expected_range": (1, 2)
    },
    {
        "label": "minimal_but_valid",
        "input": "What does HTTP stand for?",
        "response": "HyperText Transfer Protocol.",
        # Completely correct and appropriately concise.
        # A length-biased rubric would penalize brevity.
        "expected_completeness_score": 5,
        "expected_range": (4, 5)
    },
    {
        "label": "off_topic_but_polished",
        "input": "What are the main causes of World War I?",
        "response": (
            "World War II arose from a complex interplay of economic depression, "
            "the rise of fascism, and the failures of the Treaty of Versailles. "
            "Historians continue to debate the relative weight of these factors."
        ),
        # Answers the wrong war. Polished but irrelevant.
        # A rubric without a relevance criterion will miss this entirely.
        "expected_relevance_score": 1,
        "expected_range": (1, 2)
    }
]

def run_adversarial_validation(rubric, judge_fn, test_cases):
    """
    Runs a rubric against adversarial test cases and flags
    cases where the judge score falls outside the expected range.
    """
    failures = []
    for case in test_cases:
        scores = judge_fn(response=case["response"], criteria=rubric["criteria"])
        for criterion_name, expected_range in [
            (k, v) for k, v in case.items() if k.endswith("_range")
        ]:
            # Map criterion range back to criterion name
            crit = criterion_name.replace("_range", "").replace("expected_", "")
            if crit in scores:
                actual = scores[crit]
                low, high = expected_range
                if not (low <= actual <= high):
                    failures.append({
                        "test_case": case["label"],
                        "criterion": crit,
                        "expected_range": expected_range,
                        "actual_score": actual
                    })
    return failures

Running this validation before deployment catches rubric failures while you can still fix them. A rubric that fails on the wrong_but_fluent case is telling you that your accuracy criterion isn't strong enough to override stylistic quality. A rubric that fails on minimal_but_valid is telling you your completeness anchors are implicitly penalizing brevity.

πŸ€” Did you know? The practice of adversarial rubric testing mirrors red-teaming in securityβ€”the goal is to find the cases where your system's assumptions break, before an attacker (or an edge-case input) finds them for you.

πŸ’‘ Pro Tip: Maintain a living library of adversarial test cases organized by criterion type. Every time a rubric produces a surprising score in production, add the input to the library. Over time, this library becomes your most valuable rubric debugging tool.


Putting It All Together: A Rubric Diagnostic Checklist

Before deploying any rubric, run it through this five-point diagnostic. Each question maps directly to one of the mistakes covered in this section.

Rubric Diagnostic Flow:

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 1. CONFLATION CHECK                                 β”‚
  β”‚    Does any criterion contain "and", "as well as",  β”‚
  β”‚    or imply two distinct failure modes?             β”‚
  β”‚    YES β†’ Split into separate criteria               β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 2. ANCHOR CHECK                                     β”‚
  β”‚    Can you distinguish a 3 from a 4 using only      β”‚
  β”‚    observable evidence in the response?             β”‚
  β”‚    NO β†’ Rewrite anchors as behavioral descriptions  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 3. TASK-AWARENESS CHECK                             β”‚
  β”‚    Would the same response score differently on     β”‚
  β”‚    this criterion if the task type changed?         β”‚
  β”‚    YES β†’ Parameterize by task type                  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 4. LOAD CHECK                                       β”‚
  β”‚    Does the rubric contain more than 6 criteria     β”‚
  β”‚    in a single prompt?                              β”‚
  β”‚    YES β†’ Split into thematic batches                β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 5. ADVERSARIAL CHECK                                β”‚
  β”‚    Has the rubric been tested on fluent-but-wrong,  β”‚
  β”‚    correct-but-suspicious, minimal-valid, and       β”‚
  β”‚    off-topic-but-polished examples?                 β”‚
  β”‚    NO β†’ Run adversarial validation before deploy    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

A rubric that passes all five checks is not guaranteed to be perfect, but it has survived the most common failure modes. The goal is not perfectionβ€”it is diagnosable imperfection. When your rubric produces a surprising score, you should be able to trace the failure to a specific, fixable cause rather than attributing it to mysterious model variance.

🎯 Key Principle: Rubric quality is measured not by how well it scores typical cases, but by how clearly it fails on edge casesβ€”and whether those failures are diagnosable and correctable.

With these five mistakes identified and their fixes in hand, you are equipped to audit existing rubrics systematically and design new ones that hold up under real-world conditions. The next and final section consolidates these principles and prepares you for the more specialized rubric techniques covered in the child lessons.

Key Takeaways and Preparing for Deeper Patterns

You have now traveled the full arc of rubric design as an engineering discipline. You started by seeing why vague evaluation criteria are the root cause of inconsistent LLM judgments, moved through the anatomy of well-formed rubrics, learned how to decompose quality dimensions into evaluable criteria, implemented a working pipeline in code, and catalogued the mistakes that silently corrupt evaluation systems. This final section crystallizes what you now understand, makes the connections between ideas explicit, and points you toward the specialized techniques that build on this foundation.


The Central Idea: A Rubric Is a Contract

🎯 Key Principle: A rubric is a contract between the evaluation designer and the judge. Every term in that contract that is left undefined is a clause the judge will interpret differently on different days, with different inputs, or after different model updates.

This framing matters because it shifts how you think about your own rubrics. You are not writing instructions β€” you are writing a specification. The same way a software contract specifies behavior under all observable conditions, a rubric must specify the scoring behavior under all the response variations your system is likely to encounter. When the contract is ambiguous, the verdict varies. When the verdict varies, you cannot trust the measurement. When you cannot trust the measurement, you cannot improve the system.

The practical consequence is that rubric ambiguity is a bug, not a stylistic choice. Every vague adjective ("clear," "relevant," "appropriate") is a potential source of variance. Every undefined boundary between score levels is a place where two judges β€” human or model β€” will disagree. Debugging your rubric is as important as debugging your application code.

πŸ’‘ Mental Model: Think of your rubric the way a compiler thinks about type definitions. A loosely typed system will accept almost anything and fail silently at runtime. A strictly typed system catches mismatches early and produces predictable behavior. Strict rubrics are the strongly typed evaluation systems of LLM evaluation.


What You Now Know That You Didn't Before

Let's make the learning gain explicit before moving on. Here is a summary of the conceptual shifts this lesson was designed to produce:

Before this lesson, you might have approached LLM evaluation by asking a judge model something like: "Rate this response from 1 to 10 for quality." That single-number request contains at least five unstated assumptions about what quality means, how to weight competing considerations, and where the boundaries between score levels fall.

After this lesson, you understand that the same evaluation goal should be expressed as a structured rubric with decomposed criteria, each criterion anchored to observable behaviors and mapped to a defined score scale. The judge model's job is not to exercise subjective judgment β€” it is to apply your specification to the evidence in front of it.

πŸ“‹ Quick Reference Card: Before vs. After Rubric Design Training

πŸ” Concept ❌ Before βœ… After
🎯 Evaluation goal "Rate for quality" Decomposed into 4–6 measurable criteria
πŸ“ Score scale Undefined 1–10 Anchored levels with behavioral descriptions
🧩 Criterion structure Single holistic prompt Independently violable atomic criteria
πŸ’¬ Judge instruction "Use your judgment" Explicit chain-of-thought scoring protocol
πŸ”’ Reproducibility Varies by run Stable across runs and model versions
πŸ”§ Implementation Ad hoc prompting Rubric as data, scores as structured output
πŸ“š Failure detection Silent drift Per-criterion score tracking and alerting

Decomposition Is the Primary Tool

Decomposition is the act of taking a single vague quality goal and breaking it into discrete, independently assessable criteria. It is the most important practical skill this lesson teaches, and it is worth restating exactly why it works.

A vague goal like "the response should be helpful" contains at least four separable ideas: that the response addresses the user's actual question, that it provides accurate information, that it is expressed at an appropriate level of detail, and that it does not introduce unnecessary confusion. Each of those ideas can be violated independently. A response can be accurate but at the wrong level of detail. It can be detailed but address the wrong question. It can address the right question but include confusing tangents.

When you collapse all four ideas into one criterion, a judge model that encounters a response which is accurate and detailed but addresses the wrong question has no guidance about how to score it. It will make a judgment call, and that call will not be consistent across runs.

When you decompose into four criteria, each with its own score, you get four signals instead of one. The aggregated score is more stable because noise in one criterion does not corrupt all the others. And the per-criterion breakdown tells you where a response failed, not just that it failed β€” which is the difference between actionable feedback and a useless number.

🧠 Mnemonic: SIAM β€” the four properties of good criteria:

  • Specific: describes observable, not inferred, behavior
  • Independently violable: can fail without other criteria failing
  • Anchored: each score level has a behavioral description
  • Mapped: connects to a well-defined, bounded score scale

If a criterion fails any one of the SIAM tests, revise it before deploying.


The Implementation Pattern Is Universal

One of the most durable ideas from the code-focused section of this lesson is that the implementation pattern β€” rubric as data, criteria as prompt components, scores as structured outputs β€” is not tied to any specific framework or model provider. It is an architectural pattern that holds regardless of whether you are using OpenAI, Anthropic, a local model, or a future provider that does not yet exist.

Here is that pattern expressed in its most minimal form:

## rubric_as_data.py
## The rubric lives in a data structure, not hardcoded in a prompt string.
## This makes it versionable, testable, and reusable across different judge calls.

from dataclasses import dataclass
from typing import List

@dataclass
class ScoreLevel:
    score: int
    label: str
    description: str  # Observable behavior at this level

@dataclass
class Criterion:
    name: str
    description: str
    score_levels: List[ScoreLevel]
    weight: float = 1.0  # For weighted aggregation

@dataclass
class Rubric:
    name: str
    criteria: List[Criterion]
    version: str  # Track rubric versions like software versions

    def to_prompt_section(self) -> str:
        """Serialize the rubric into a structured prompt section."""
        lines = [f"## Evaluation Rubric: {self.name} (v{self.version})\n"]
        for i, criterion in enumerate(self.criteria, 1):
            lines.append(f"### Criterion {i}: {criterion.name}")
            lines.append(criterion.description)
            lines.append("\nScore Levels:")
            for level in criterion.score_levels:
                lines.append(
                    f"  {level.score} β€” {level.label}: {level.description}"
                )
            lines.append("")
        return "\n".join(lines)

    def weighted_score(self, raw_scores: dict) -> float:
        """Aggregate per-criterion scores into a single weighted total."""
        total_weight = sum(c.weight for c in self.criteria)
        weighted_sum = sum(
            raw_scores.get(c.name, 0) * c.weight
            for c in self.criteria
        )
        return weighted_sum / total_weight if total_weight > 0 else 0.0

This code block shows the data-first approach: the rubric is a Python object with a version string, and it can serialize itself into a prompt section. The weighted_score method demonstrates that aggregation logic belongs in your code, not in the judge's output.

Now here is the companion piece β€” the structured output schema that captures the judge's scoring:

## structured_scoring_output.py
## Forces the judge model to return scores in a machine-parseable format.
## Chain-of-thought reasoning is captured alongside the score for auditability.

from pydantic import BaseModel, Field
from typing import List, Optional

class CriterionScore(BaseModel):
    criterion_name: str
    reasoning: str = Field(
        description="Step-by-step reasoning about this criterion before assigning a score"
    )
    score: int = Field(
        ge=1, le=5,
        description="Score from 1 (lowest) to 5 (highest) for this criterion"
    )
    confidence: Optional[float] = Field(
        default=None, ge=0.0, le=1.0,
        description="Judge's self-reported confidence, 0.0 to 1.0"
    )

class RubricEvaluationResult(BaseModel):
    evaluator_model: str
    rubric_version: str
    criterion_scores: List[CriterionScore]
    overall_notes: Optional[str] = None

    def summary(self, rubric) -> dict:
        """Return a summary dict with per-criterion scores and weighted total."""
        raw_scores = {
            cs.criterion_name: cs.score
            for cs in self.criterion_scores
        }
        return {
            "per_criterion": raw_scores,
            "weighted_total": rubric.weighted_score(raw_scores),
            "rubric_version": self.rubric_version,
            "model": self.evaluator_model,
        }

The confidence field in CriterionScore is worth highlighting: it gives the judge a channel to signal uncertainty. A score of 3 with confidence 0.9 means something very different from a score of 3 with confidence 0.4. Low-confidence scores are candidates for human review, not direct inclusion in aggregate metrics.

πŸ’‘ Pro Tip: Version your rubric data the same way you version your application code. When a rubric changes, scores from the old version are not directly comparable to scores from the new version. Store rubric_version alongside every evaluation result so you can segment your analysis by rubric version and detect when a change in aggregate scores reflects a real quality shift versus a measurement change.


Connecting the Mistakes to the Principles

The mistakes section of this lesson β€” criterion overlap, undefined score boundaries, implicit context assumptions, coverage gaps, and rubric drift β€” each map directly back to one of the core principles. Understanding those connections helps you diagnose future problems faster.

MISTAKE                     ROOT PRINCIPLE VIOLATED
─────────────────────────────────────────────────────────────
Criterion overlap          β†’ Independent violability (SIAM "I")
Undefined score boundaries β†’ Anchoring (SIAM "A")
Implicit context            β†’ Specificity (SIAM "S")
Coverage gaps              β†’ Coverage (rubric anatomy)
Rubric drift               β†’ Mapping + versioning
─────────────────────────────────────────────────────────────

When a score distribution looks wrong β€” too compressed, too bimodal, or shifting over time β€” the first diagnostic question is: which principle does this pattern suggest is being violated? Compressed scores often indicate undefined boundaries. Bimodal distributions often indicate overlapping criteria that force the judge to pick a side rather than score independently. Scores that drift over time without a corresponding rubric version change often indicate unanchored criteria that the judge is interpreting differently as its context changes.

⚠️ Critical Point to Remember: Rubric drift is the most dangerous failure mode because it is invisible in the scores themselves. A rubric that was reliable in January can produce systematically different scores in June β€” not because your system changed, but because the judge model was updated, or the distribution of inputs shifted, or a small wording change in a prompt template altered how the rubric text was interpreted. Always store the full rubric text alongside each evaluation result, not just the rubric name or version number.


What the Child Lessons Cover: Three Extensions of This Foundation

The techniques covered in the child lessons each take one specific aspect of rubric design further than this foundational lesson could go. Understanding how they extend the foundation will help you decide which to prioritize.

Atomic Criteria

Atomic criteria push decomposition to its logical limit. Where this lesson teaches you to break a quality dimension into four to six criteria, the atomic approach asks: can each criterion be broken into its smallest possible unit β€” one that can only be scored pass or fail, with no intermediate states?

The advantage of atomic criteria is maximum reproducibility: a binary criterion has no ambiguous middle ground. The trade-off is verbosity and the risk that a long list of atomic criteria becomes unwieldy to maintain. The child lesson on this topic covers how to determine when atomic decomposition is worth the overhead and how to manage rubrics that have dozens of atomic criteria without losing coherence.

Chain-of-Thought Scoring

This lesson introduced structured output with a reasoning field. The chain-of-thought scoring child lesson goes much deeper: it covers how to design the judge prompt so that the reasoning genuinely constrains the score (rather than being generated after the score is already decided), how to evaluate the quality of the reasoning itself, and how to use reasoning traces as diagnostic signals for rubric improvement.

πŸ€” Did you know? Research on LLM judge consistency has found that forcing a model to produce explicit reasoning before the score β€” rather than after β€” significantly reduces score variance on borderline cases. The reasoning acts as a commitment device: the model must construct an argument before it commits to a number.

Rubric Drift

The child lesson on rubric drift treats the problem as a monitoring and reliability engineering challenge. It covers how to detect drift statistically (using control charts and distributional shift tests), how to create anchor test sets β€” a fixed set of inputs with known correct scores that you can re-run after any rubric or model change to check for drift β€” and how to implement automated alerts when drift is detected in production evaluation pipelines.

Here is a minimal example of an anchor test approach that the drift lesson expands on significantly:

## anchor_test.py
## An anchor test set contains inputs with known ground-truth scores.
## Re-running the evaluator on this set after any change detects drift.

from typing import List, Tuple

@dataclass
class AnchorCase:
    input_text: str
    response_text: str
    expected_scores: dict  # {criterion_name: expected_score}
    tolerance: int = 1  # Acceptable deviation from expected score

def run_anchor_tests(
    anchor_cases: List[AnchorCase],
    evaluator_fn,  # Callable that returns RubricEvaluationResult
    rubric,
) -> Tuple[bool, List[dict]]:
    """
    Run all anchor cases and report failures.
    Returns (all_passed: bool, failure_report: list).
    """
    failures = []
    for case in anchor_cases:
        result = evaluator_fn(case.input_text, case.response_text)
        raw_scores = {
            cs.criterion_name: cs.score
            for cs in result.criterion_scores
        }
        for criterion, expected in case.expected_scores.items():
            actual = raw_scores.get(criterion)
            if actual is None or abs(actual - expected) > case.tolerance:
                failures.append({
                    "criterion": criterion,
                    "expected": expected,
                    "actual": actual,
                    "input_snippet": case.input_text[:100],
                })
    return len(failures) == 0, failures

The tolerance parameter reflects a practical reality: asking for exact score reproduction across model updates is usually too strict. A deviation of one point on a five-point scale is often within acceptable variance. A deviation of two points signals that the rubric is being applied differently and warrants investigation.


Practical Applications and Next Steps

Here are three concrete ways to apply what you have learned immediately:

πŸ”§ Application 1: Audit an existing evaluation. Take an evaluation you are already running β€” even an informal one β€” and write down the implicit criteria you are using. Now apply the SIAM test to each. How many criteria are specific? How many are independently violable? How many have anchored score levels? The gaps you find are your rubric debt.

πŸ“š Application 2: Decompose one quality dimension from scratch. Pick a quality dimension that matters to your system β€” accuracy, tone, format adherence, safety β€” and spend 30 minutes decomposing it using the process from Section 3. Aim for four to six criteria. Write behavioral descriptions for each score level. Then test the rubric by scoring five responses by hand and checking whether the scores feel right and whether the reasoning was forced to be explicit.

🎯 Application 3: Implement the rubric-as-data pattern. Take one existing evaluation prompt and refactor it so that the rubric lives in a data structure (a Python dataclass, a JSON file, a YAML config) and the prompt is generated from that structure. Add a version field. Store the full rubric text alongside the first set of results you generate. This single structural change makes every subsequent improvement to your evaluation system traceable.


Final Summary

🎯 Key Principle: Rubric design is an engineering discipline. It produces specifications, not preferences. Good rubrics are versionable, decomposed, anchored, and stored alongside the results they generate.

The ideas in this lesson are mutually reinforcing. Decomposition makes criteria specific. Specificity enables anchoring. Anchoring enables consistent scoring. Consistent scoring enables meaningful aggregation. Meaningful aggregation enables system improvement. And rubric versioning ensures that when aggregate scores change, you can determine whether the change reflects your system or your measurement.

⚠️ Final Critical Points:

  1. Rubric drift is silent. It will not announce itself. You must proactively monitor for it with anchor test sets and distributional checks.
  2. Decomposition without anchoring is incomplete. Breaking a quality goal into five criteria gives you five opportunities for ambiguity if none of the score levels are described with behavioral specificity.
  3. The implementation pattern is the enabler. Rubric as data, criteria as prompt components, scores as structured output β€” this pattern is what makes rubric-based evaluation maintainable at scale. Without it, even a well-designed rubric degrades into ad hoc prompting over time.

You now have the foundation. The child lessons will take you into territory where this foundation is stress-tested by scale, by edge cases, and by the practical demands of running evaluation systems in production. Each of those lessons assumes you understand what a well-formed rubric looks like and why decomposition is the right tool. You do now.