Match each architectural concept to what it specifically addresses in the evaluation reliability problem:

!MATCH[["G-Eval criteria expansion","Locks in consistent criterion interpretation before scoring"],["Probability-weighted scoring","Captures model uncertainty instead of forcing a single token"],["Schema enforcement via Pydantic","Guarantees typed fields replace regex-based extraction"],["temperature=0 in judge calls","Enables identical score reproduction on re-evaluation"],["Semantic reproducibility","Same rubric meaning applied consistently across all runs"]]

G-Eval and Structured Output

Two distinct advances: G-Eval as a scoring architecture with a specific paper insight, and structured output as an engineering pattern for making judge responses machine-readable and pipeline-composable.

Last generated Apr 23, 2026 UTC

Why Evaluation Architecture Matters: From Ad Hoc Prompts to Systematic Judging

Imagine you've just shipped a new LLM-powered feature. You want to know if it's actually working — so you write a quick prompt: "Rate this response from 1 to 10 and explain why." You paste a few outputs in, eyeball the scores, and feel reasonably confident. Ship it. Come back a week later to run the same check, and you get wildly different numbers for the same inputs. The model says "8" one day and "5" the next. You have no idea what changed. This scenario, repeated across teams and projects, is exactly why evaluation architecture matters — and why free flashcards for the key terms in this lesson are worth bookmarking as you go.

The difference between a fragile evaluation script and a reliable evaluation pipeline is not just about prompting skill. It's about two orthogonal architectural decisions that most practitioners conflate, skip, or discover only after something breaks in production. This lesson is about those two decisions: G-Eval, a principled approach to how a judge reasons and scores, and structured output, an engineering pattern that governs how that score is returned and consumed. Together, they transform LLM evaluation from an art project into something that behaves like infrastructure.

The Problem Nobody Warns You About

When developers first reach for an LLM as a judge, the instinct makes sense. Language models understand nuance. They can evaluate fluency, factual accuracy, tone, and helpfulness in ways that simple string-matching metrics cannot. The early prototype almost always works: you write a prompt, get back a verdict, read it, and nod. It seems fine.

The problems emerge at scale, and they emerge quietly.

Free-text verdicts are the first failure mode. When a judge model returns a paragraph of reasoning followed by a score buried somewhere in prose, your code needs to extract that score reliably. You write a regex. It works for 94% of cases. The other 6% silently return None, get treated as zero, or crash the pipeline at midnight on a Saturday. You don't notice until someone asks why your evaluation dashboard has been flat for three days.

## The fragile pattern — don't do this
import re

def extract_score(judge_response: str) -> float:
    # Hoping the model says "Score: X" somewhere in its response
    match = re.search(r'Score:\s*(\d+(?:\.\d+)?)', judge_response)
    if match:
        return float(match.group(1))
    # Silent failure: returns None, or you default to 0
    return None  # ← this will cause downstream chaos

## Example judge response that breaks this regex:
response = "I would rate this response highly, perhaps a 7 out of 10, because..."
print(extract_score(response))  # Returns None — the score is there, but invisible

This code block illustrates the core fragility. The model's response is reasonable — a human would read it and find the score immediately. But the machine can't reliably parse prose, and when it fails, it fails silently. Multiply this across thousands of evaluation runs, and you have a metrics system that lies to you.

Inconsistent reasoning is the second failure mode, and it's more insidious. Without a defined scoring rubric applied consistently, the same LLM judge will weight criteria differently across runs. One evaluation might penalize verbosity; another will reward it. The judge isn't wrong in either case — it's just not anchored. The scores become a function of prompt ordering, temperature sampling, and whatever conversational context the model has implicitly constructed, rather than a function of the response quality you actually care about.

Aggregation becomes meaningless when both of these problems compound. If you can't trust that a "7" from Run A means the same thing as a "7" from Run B, you cannot compute meaningful averages, track improvements over time, or compare two models against each other. Your evaluation numbers have the appearance of rigor without any of the substance.

🤔 Did you know? A 2023 study found that LLM judges without explicit chain-of-thought reasoning showed inter-run score variance of up to 30% on identical inputs — comparable to the variance between human annotators who explicitly disagree. Structured, criteria-anchored evaluation reduced that variance substantially.

Two Advances, Two Different Problems

This is where the architecture conversation begins. The community has developed two distinct responses to the evaluation reliability problem, and understanding that they solve different problems is the first conceptual unlock of this lesson.

┌─────────────────────────────────────────────────────────┐
│              THE EVALUATION RELIABILITY PROBLEM          │
└─────────────────────┬───────────────────────────────────┘
                      │
          ┌───────────┴────────────┐
          │                        │
          ▼                        ▼
┌─────────────────┐      ┌──────────────────────┐
│    G-EVAL       │      │  STRUCTURED OUTPUT   │
│                 │      │                      │
│ HOW the judge   │      │ HOW the score is     │
│ reasons & scores│      │ returned & consumed  │
│                 │      │                      │
│ • CoT expansion │      │ • Schema validation  │
│ • Prob-weighted │      │ • Machine-readable   │
│   scoring       │      │ • Pipeline-safe      │
│ • Anchored      │      │ • Type-guaranteed    │
│   rubrics       │      │                      │
└─────────────────┘      └──────────────────────┘
          │                        │
          └───────────┬────────────┘
                      │
                      ▼
          ┌───────────────────────┐
          │  RELIABLE EVALUATION  │
          │      PIPELINE         │
          └───────────────────────┘

G-Eval addresses the reasoning problem. It introduces a specific architecture: given an evaluation criterion (say, "coherence" or "factual accuracy"), the judge first expands that criterion into a detailed chain-of-thought rubric, then applies each step of that rubric systematically before producing a score. Critically, the G-Eval paper also proposes scoring based on the probability distribution over score tokens, rather than just reading the token the model happened to sample. This makes the score a more stable signal than a single sampled output. We'll go deep on the mechanics in the next section — but conceptually, G-Eval is about making the judgment process principled and reproducible.

Structured output addresses the consumption problem. It's an engineering pattern — often enforced via API features like JSON mode, function calling, or schema-constrained generation — that guarantees the model's response conforms to a predefined shape. Instead of hoping the model says "Score: 7", you define a schema that requires a field called score of type integer with a range constraint, and the model runtime enforces it. Your downstream code never needs a regex. The response is a Python object. It either validates or it throws an error you can catch.

🎯 Key Principle: G-Eval and structured output are orthogonal. You can implement G-Eval reasoning without structured output (and suffer parsing problems). You can use structured output without G-Eval-style reasoning (and get machine-readable but unreliable scores). Reproducibility demands both.

What Reproducibility Actually Requires

The word "reproducibility" gets thrown around loosely in ML contexts. Here, it has a precise meaning: given the same input, the same evaluation criteria, and the same judge model, your evaluation system should produce scores that are consistently interpretable — even if they're not bit-for-bit identical across runs (which, given sampling, they often won't be).

Reproducibility in LLM evaluation has two layers:

Semantic reproducibility means the score reflects the same underlying judgment each time. If your rubric says "coherence" means the response stays on topic and transitions logically between ideas, then every run should be evaluating those things — not whatever the model happens to interpret "coherence" to mean today. G-Eval achieves this by making the rubric explicit and applying it through structured chain-of-thought, rather than leaving interpretation implicit.

Mechanical reproducibility means the score can be reliably extracted, stored, and compared across runs. A "7.2" that comes back as a validated float in a typed schema is mechanically reproducible. A "7.2" buried in a paragraph that a regex sometimes misses is not. Structured output achieves this by removing the parsing step entirely — the score is already in machine-readable form when it arrives.

💡 Mental Model: Think of reproducibility as a two-legged stool. G-Eval is one leg — it stabilizes the meaning of the score. Structured output is the other leg — it stabilizes the form of the score. A stool with one leg doesn't stand. Neither does an evaluation architecture with only one of these advances.

Here's a concrete illustration of what the mechanical layer looks like when structured output is properly applied:

from pydantic import BaseModel, Field
from openai import OpenAI

## Define the schema for what the judge MUST return
class CoherenceEvaluation(BaseModel):
    reasoning: str = Field(
        description="Step-by-step analysis of the response's coherence"
    )
    score: int = Field(
        ge=1, le=5,
        description="Coherence score from 1 (incoherent) to 5 (perfectly coherent)"
    )
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="Judge confidence in this score"
    )

client = OpenAI()

def evaluate_coherence(response_text: str) -> CoherenceEvaluation:
    """Evaluate coherence using a schema-constrained judge response."""
    completion = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are an expert evaluator. Assess the coherence of the given text."
            },
            {
                "role": "user",
                "content": f"Evaluate the coherence of this response:\n\n{response_text}"
            }
        ],
        response_format=CoherenceEvaluation,  # Schema enforcement happens here
    )
    # This is now a typed Python object — no regex, no parsing, no silent failures
    return completion.choices[0].message.parsed

## Usage — the result is guaranteed to have .score, .reasoning, .confidence
result = evaluate_coherence("The sky is blue. Cats meow. Therefore, I recommend Python.")
print(f"Score: {result.score}")       # Always an int between 1 and 5
print(f"Confidence: {result.confidence}")  # Always a float between 0 and 1

Notice what's missing from this code: error-prone string parsing, regex extraction, None handling, and silent failures. The model runtime guarantees the response matches CoherenceEvaluation. If it doesn't, you get an explicit exception you can log and handle — not a corrupted metric silently poisoning your dashboard.

The Real-World Cost of Getting This Wrong

It's worth sitting with the practical consequences of evaluation architecture failures, because they're not abstract. They show up in concrete ways that cost real time and money.

Inconsistent scores corrupt A/B comparisons. If you're comparing Model A against Model B, and your evaluation scores have high variance due to unprincipled reasoning, you might conclude Model A wins when the difference is within your noise floor. Teams have made deployment decisions on this basis — shipping a model that wasn't actually better because their evaluation system couldn't tell the difference.

Broken parsers create silent evaluation blackouts. When the regex or string parser that extracts scores from free-text responses starts failing, the failure is often invisible. Scores get defaulted to zero, or evaluation records are skipped, or the pipeline continues but the metrics dashboard shows stale data. The team thinks evaluation is running; it isn't. By the time someone notices, days of regression data are gone.

Unanchored rubrics make improvement invisible. If your judge isn't applying a consistent definition of quality, you can't tell whether a model improved between versions. The score changes, but you don't know if that's because the model got better or because the judge weighted something differently this time. Your ability to iterate confidently evaporates.

⚠️ Common Mistake: Mistake 1: Treating evaluation as a one-time script rather than a pipeline component. Evaluation that runs once and gets eyeballed is fine for a quick sanity check. But once you need to compare versions, track regressions, or run evaluation in CI/CD, it must be a reliable component with stable inputs and outputs — which means it needs architecture, not just a prompt. ⚠️

💡 Real-World Example: A team building a customer support summarization system implemented an LLM judge to evaluate summary quality. Their judge returned free text, and they extracted scores with a regex. After a model API update changed the judge model's response formatting, the regex started failing on ~15% of evaluations. Those failures silently defaulted to a score of 0. For two weeks, their quality metrics showed a dramatic decline that triggered an incident review — until someone noticed the evaluation pipeline itself was broken, not the summarization model. The incident cost three engineers two days of investigation. A schema-constrained response format would have surfaced the failure immediately as a validation error.

A Preview of How These Two Advances Work Together

Think of a well-designed LLM judge as having two layers, like a well-designed API:

┌──────────────────────────────────────────────────────────┐
│                    EVALUATION REQUEST                     │
│         (text to evaluate + evaluation criteria)          │
└──────────────────────────────┬───────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────┐
│              LAYER 1: G-EVAL REASONING LAYER             │
│                                                          │
│  1. Expand criteria into step-by-step rubric (CoT)       │
│  2. Apply each rubric step to the evaluated text         │
│  3. Derive score using probability-weighted method       │
│                                                          │
│  → Produces: principled, anchored judgment               │
└──────────────────────────────┬───────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────┐
│           LAYER 2: STRUCTURED OUTPUT LAYER               │
│                                                          │
│  Schema: { reasoning: str, score: int, flags: list }     │
│  Validation: enforced by runtime                         │
│  Result: typed Python/JSON object                        │
│                                                          │
│  → Produces: machine-readable, pipeline-safe response    │
└──────────────────────────────┬───────────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────────────────┐
│              DOWNSTREAM EVALUATION PIPELINE              │
│   (aggregation, dashboards, regression detection, CI)    │
└──────────────────────────────────────────────────────────┘

Layer 1 (G-Eval) ensures the judgment is sound and consistent. Layer 2 (structured output) ensures the result is consumable and safe. Neither layer can do the other's job. A principled judgment returned as prose is still a parsing nightmare. A machine-readable response that contains an unprincipled score is reliable garbage.

In the sections that follow, you'll go deep on each layer individually — understanding the intellectual contribution of the G-Eval paper, then understanding structured output as a deliberate engineering pattern — before seeing them wired together in a working implementation.

📋 Quick Reference Card: The Two Advances at a Glance

	🧠 G-Eval	🔧 Structured Output
🎯 Solves	Inconsistent reasoning & scoring	Unreliable score extraction
🔒 Mechanism	CoT rubric expansion + prob-weighted scoring	Schema-constrained generation
📚 Layer	Semantic (meaning of the score)	Mechanical (form of the score)
🔧 Without it	Scores vary by implicit interpretation	Scores require fragile parsing
✅ With it	Anchored, criteria-grounded judgment	Typed, validated, pipeline-safe output

🧠 Mnemonic: "GIST" — G-Eval handles the Intelligence of scoring, Structured output handles the Transport. Get the intelligence right, get the transport right, get reliable evaluation.

The evaluation architecture conversation is ultimately about taking something that looks like it works in a notebook and making it something that actually works in production — at scale, over time, across model versions, in automated pipelines. That transformation requires both advances. Let's look at each one properly, starting with the intellectual core of G-Eval.

G-Eval at a Glance: The Insight Behind the Architecture

Before diving into implementation details, it helps to understand why G-Eval exists — what problem it was designed to solve, and what intellectual leap made it work. This section gives you that conceptual foundation. The dedicated G-Eval child lesson will go deeper into the architecture, prompt engineering, and calibration techniques; here, the goal is to build the mental model you'll need to make sense of everything that follows.

The Problem G-Eval Was Built to Solve

Imagine you ask an LLM to score a piece of text on a scale from 1 to 5 for "coherence." You write a prompt like:

Rate the coherence of the following text on a scale from 1 to 5, where 1 is incoherent and 5 is perfectly coherent. Return only the number.

Text: {candidate_text}

This seems reasonable. But in practice, it produces scores that are frustratingly noisy. Run the same prompt twice on the same text and you might get a 3, then a 4. Ask the model to score two texts and it tends to anchor on the first one, inflating or deflating the second score based on contrast rather than absolute quality. Most critically, when researchers compare these scores against human judgments, the correlation is often disappointingly low.

The root issue is that direct scoring prompts ask the model to do too much in a single step. The word "coherence" is doing enormous semantic work — it implicitly bundles together logical flow, pronoun resolution, topic consistency, transition quality, and more. The model is expected to silently unpack all of that, weigh each dimension, and compress the result into a single number. There's no visible reasoning, no structured decomposition, and no reliable way to audit what the model actually evaluated.

🤔 Did you know? The original G-Eval paper (Liu et al., 2023, "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment") demonstrated that their approach achieved a Spearman correlation with human judgments of up to 0.435 on summarization tasks — substantially higher than previous automated metrics including BERTScore and BLEURT, which hovered around 0.2–0.3.

G-Eval's central insight is that this single-step compression is the fundamental flaw. The solution is to split the process into two distinct phases before any scoring happens.

The Core Insight: Two-Phase Evaluation

G-Eval reframes scoring as a two-phase process:

Criteria decomposition — Given a rubric criterion, have the model generate a detailed, step-by-step list of evaluation sub-tasks specific to that criterion.
Form-filling evaluation — Use those generated steps as a structured checklist to guide the actual scoring of the candidate text.

This is conceptually similar to how a skilled human evaluator works. A teacher grading an essay doesn't just stare at it and produce a gut-feel number. They apply a rubric: Does the thesis appear in the introduction? Are claims supported by evidence? Do paragraphs transition logically? G-Eval asks the LLM to first construct that rubric operationally, then apply it.

╔══════════════════════════════════════════════════════════╗
║              G-EVAL: TWO-PHASE ARCHITECTURE              ║
╠══════════════════════════════════════════════════════════╣
║                                                          ║
║  INPUT: Criterion (e.g., "coherence") + Task Description ║
║                         │                               ║
║                         ▼                               ║
║  ┌─────────────────────────────────────────────────┐    ║
║  │         PHASE 1: Criteria Decomposition         │    ║
║  │                                                 │    ║
║  │  LLM generates evaluation steps:               │    ║
║  │  1. Check logical flow between sentences        │    ║
║  │  2. Verify pronoun references are clear         │    ║
║  │  3. Confirm topic consistency across paragraphs │    ║
║  │  4. Assess transition quality                   │    ║
║  └─────────────────────┬───────────────────────────┘    ║
║                         │                               ║
║                         ▼                               ║
║  ┌─────────────────────────────────────────────────┐    ║
║  │         PHASE 2: Form-Filling Evaluation         │    ║
║  │                                                 │    ║
║  │  LLM applies steps to candidate text            │    ║
║  │  and produces a score                           │    ║
║  └─────────────────────┬───────────────────────────┘    ║
║                         │                               ║
║                         ▼                               ║
║  OUTPUT: Score (with probability weighting applied)     ║
╚══════════════════════════════════════════════════════════╝

The criteria decomposition phase is essentially a form of chain-of-thought prompting applied to the rubric itself rather than to the candidate text. By generating explicit evaluation steps before seeing the text to be scored, the model commits to a consistent interpretation of the criterion. This dramatically reduces the variance you see when scoring different texts — the goalposts are set before the game begins.

💡 Mental Model: Think of Phase 1 as writing the answer key before grading the exam. When you write the key first, every student's paper gets evaluated against the same standard. When you grade without a key, you unconsciously adjust your standards based on what you've already seen.

The Probability-Weighting Trick

The second major contribution of the G-Eval paper is more subtle but equally important. Once the model has applied its evaluation steps to a candidate text, how do you extract the score?

The naive approach is to ask the model to output a number and parse it:

## Naive approach: parse a single generated token
response = llm.generate(prompt)
score = int(response.strip())  # Fragile! What if it says "4 out of 5"?

This is fragile in multiple ways. The model might add qualifications ("I'd give this a 4 because..."), use decimals, or occasionally produce a score outside your intended range. More fundamentally, a single sampled token obscures the model's actual uncertainty. When a model is genuinely torn between a 3 and a 4, forcing it to output one number throws away information.

G-Eval's solution is probability-weighted scoring: instead of sampling a single output token, query the model for the probabilities it assigns to each valid score token and compute a weighted average.

Score tokens: ["1", "2", "3", "4", "5"]
Model probabilities: [0.02, 0.08, 0.25, 0.45, 0.20]

Weighted score = (1×0.02) + (2×0.08) + (3×0.25) + (4×0.45) + (5×0.20)
               = 0.02 + 0.16 + 0.75 + 1.80 + 1.00
               = 3.73

This yields a continuous score rather than a discrete integer, which carries meaningful information: a score of 3.73 tells you something different from 4.0. It signals that the model sees this text as almost a 4, with some residual probability on 3 — perhaps the text has strong structure but one noticeable logical gap.

Here's a simplified implementation of what this looks like in practice:

import openai
import numpy as np

def probability_weighted_score(
    prompt: str,
    score_tokens: list[str] = ["1", "2", "3", "4", "5"],
    model: str = "gpt-4o"
) -> float:
    """
    Query the model for token probabilities over score tokens
    and return a probability-weighted score.
    """
    client = openai.OpenAI()
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1,          # We only need the first score token
        logprobs=True,         # Request log probabilities
        top_logprobs=10        # Get top 10 token probabilities
    )
    
    # Extract log probabilities from the response
    top_logprobs = response.choices[0].logprobs.content[0].top_logprobs
    
    # Build a dict of token -> probability
    token_probs = {
        entry.token: np.exp(entry.logprob)  # Convert log prob to prob
        for entry in top_logprobs
    }
    
    # Compute weighted average over valid score tokens
    total_prob = 0.0
    weighted_sum = 0.0
    
    for token in score_tokens:
        prob = token_probs.get(token, 0.0)
        weighted_sum += int(token) * prob
        total_prob += prob
    
    # Normalize in case not all score tokens appear in top_logprobs
    if total_prob == 0:
        raise ValueError("None of the score tokens appeared in top logprobs")
    
    return weighted_sum / total_prob

This code requests log probabilities from the API (logprobs=True), converts them from log-space to probability-space with np.exp(), and then computes the weighted average only over the tokens you consider valid scores. The normalization step at the end is important: if your score range is 1–5 but the model's top-10 tokens only include 3, 4, and 5, you normalize against the probability mass that did appear rather than assuming the missing tokens have zero probability.

⚠️ Common Mistake: Not all LLM APIs expose token-level log probabilities, and those that do often limit how many top tokens they return. If top_logprobs=10 doesn't include all your score tokens, your normalization must account for the missing probability mass. Always add a fallback that detects when score token coverage is low and either widens the search or falls back to a sampled response.

Why This Reduces Bias

Two specific failure modes of naive scoring prompts are addressed by the G-Eval approach:

Positional bias occurs when the order in which you present texts affects their scores — the model unconsciously anchors on the first example and scores subsequent ones relative to it rather than against an absolute standard. G-Eval's Phase 1 decomposition combats this by fixing the evaluation criteria before any candidate text is introduced. The model can't retroactively shift the standard based on what it has already seen.

Arbitrary score assignment occurs when the model has no principled basis for choosing between adjacent scores. Without explicit sub-criteria, "is this text a 3 or a 4?" is genuinely ambiguous. With decomposed steps, the model has a partial checklist to reason against: "Steps 1, 2, and 4 are satisfied; step 3 is partially satisfied — that points toward a 4 rather than a 5."

The probability-weighting trick addresses a third problem: threshold sensitivity. When a model generates a score by sampling, tiny changes in prompt wording can flip a borderline case from a 3 to a 4. The weighted approach smooths over this cliff-edge behavior — a text that deserves "3 or maybe 4" gets a score of 3.6, which is both more honest and more stable across prompt variations.

🎯 Key Principle: G-Eval's two advances work at different levels. Criteria decomposition operates at the reasoning level — it makes the evaluation process explicit and reproducible. Probability weighting operates at the output extraction level — it makes score retrieval continuous and uncertainty-aware. You need both to get the full benefit.

A Concrete Example: Coherence Scoring

To make this tangible, here's what the two-phase process looks like for scoring the coherence of a news summary:

## Phase 1: Criteria decomposition
criteria_expansion_prompt = """
You will be evaluating text summaries for COHERENCE.

Coherence refers to the collective quality of all sentences — whether
the summary is well-structured and well-organized, and whether it makes
sense as a unified whole rather than a collection of unrelated sentences.

Please write a detailed list of evaluation steps for assessing coherence.
Be specific. Each step should be an actionable check a human evaluator
could perform on any summary.
"""

## The model might generate steps like:
## 1. Read the summary and check whether sentences follow a logical order.
## 2. Verify that pronouns and references resolve correctly.
## 3. Check that the summary does not introduce contradictory information.
## 4. Assess whether transitions between sentences are smooth.
## 5. Confirm the summary reads as a unified whole rather than isolated facts.

## Phase 2: Form-filling evaluation (using the generated steps)
scoring_prompt = """
Evaluation Steps:
{generated_steps}

Source Document:
{source_document}

Summary:
{candidate_summary}

Using the evaluation steps above, assess the coherence of the summary.
Rate the coherence on a scale from 1 (incoherent) to 5 (perfectly coherent).
Score:
"""

The key structural detail here is that {generated_steps} is populated with the output of Phase 1 before the candidate summary is introduced. The evaluation framework is built in one LLM call; the scoring happens in a second call that treats those steps as fixed instructions.

💡 Real-World Example: In practice, you can cache the Phase 1 output. If you're scoring 500 summaries against the same coherence rubric, you run Phase 1 once to generate the evaluation steps, then run Phase 2 five hundred times reusing those same steps. This makes G-Eval both principled and efficient at scale.

What This Section Covers vs. What Comes Next

This section has given you the conceptual skeleton of G-Eval:

🧠 The core insight: decompose criteria before scoring to improve human alignment
📚 The two-phase structure: criteria expansion followed by form-filling evaluation
🔧 The probability trick: weighted scoring over token probabilities for continuous, bias-reduced output
🎯 The bias story: why these techniques address positional bias and score arbitrariness

What this section intentionally does not cover:

The precise prompt templates and formatting conventions that maximize G-Eval performance
How to calibrate G-Eval scores against a gold-standard human annotation set
Multi-criteria G-Eval setups where you run separate evaluations for fluency, coherence, and factuality and aggregate them
How G-Eval behaves differently across model families and sizes

All of that lives in the dedicated G-Eval child lesson, where we'll examine the full architecture, walk through prompt engineering choices, and look at calibration workflows. The goal here was to give you the why — the intellectual motivation that makes the architecture feel inevitable rather than arbitrary.

⚠️ Common Mistake: Treating G-Eval's two phases as optional. Some practitioners skip the criteria decomposition phase to save on API calls and just append a scoring rubric inline with the prompt. This loses most of the benefit. The chain-of-thought expansion isn't decoration — it's the mechanism that locks in a consistent interpretation of the criterion before any candidate text can influence it.

💡 Pro Tip: Even before you implement the full probability-weighting machinery, simply adding Phase 1 criteria decomposition to your scoring prompts will produce a measurable improvement in score consistency. It's the higher-leverage of the two advances for most practical use cases.

🧠 Mnemonic: Think of G-Eval as "Grade with a Key" — Generate the answer key first (Phase 1), then grade against it (Phase 2). That ordering is everything.

Bridging to What Comes Next

With the G-Eval architecture in your mental model, the next section introduces the second major advance covered in this lesson: structured output as an engineering pattern. While G-Eval solves the question of how the judge reasons, structured output solves the question of how the judge communicates its results in a way that's machine-readable, schema-validated, and safely composable into evaluation pipelines.

These two advances are complementary. G-Eval gives you scores you can trust. Structured output gives you scores your pipeline can consume reliably. Section 4 will show you how they wire together into a working implementation — but first, Section 3 will make sure you understand what structured output actually is and why it matters as a deliberate engineering choice rather than just a formatting convenience.

Structured Output as an Engineering Pattern for Judge Responses

When engineers first start building LLM-based evaluation systems, they often reach for the simplest possible solution: append "Respond in JSON" to the judge prompt and hope for the best. This works often enough in development that it feels like a solved problem — until it fails silently in production, swallowing a malformed response and corrupting a metric that nobody notices for days. Structured output is not simply a formatting instruction you give to a language model. It is a deliberate engineering pattern — a set of mechanisms that shift the responsibility for schema conformance from the model's probabilistic generation process to deterministic enforcement infrastructure. Understanding the difference, and building pipelines that exploit it, is what separates a toy evaluation script from a production-grade judging system.

The Gap Between Asking for JSON and Enforcing a Schema

Let's be precise about what actually happens when you write "Output your evaluation as JSON with fields: score, reasoning" in a prompt. You are making a soft request: the model has been trained on enough JSON that it will comply most of the time, but compliance is a statistical tendency, not a guarantee. The model might:

Prefix the JSON with a natural language sentence ("Sure, here is my evaluation: {...}")
Use single quotes instead of double quotes
Emit trailing commas that are valid JavaScript but illegal JSON
Nest fields differently than you expected
Hallucinate additional fields because they seemed contextually appropriate
Truncate the response mid-object if it hits a token limit

Every one of these failure modes requires defensive parsing code — and defensive parsing code is exactly the kind of brittle glue that turns a clean pipeline into a maintenance burden.

Schema enforcement moves the constraint from the prompt layer to the infrastructure layer. Instead of telling the model what you want, you configure the API or decoding engine to only permit token sequences that are valid under your schema. The model's generation is constrained so that structurally invalid outputs are literally impossible, not merely unlikely.

Soft Request Path:
  Prompt ──► LLM Generation ──► Raw Text ──► Your Parser ──► Maybe Valid Object
               (probabilistic)              (fragile)         (sometimes fails)

Schema Enforcement Path:
  Prompt ──► LLM Generation ──► Enforced Schema ──► Guaranteed Valid Object
               (probabilistic)    (deterministic)     (always succeeds structurally)

The critical insight is that even with schema enforcement, the content of the fields is still probabilistic — the model decides what score to assign, what reasoning to write. You are enforcing structure, not correctness. This distinction matters: structured output gives you reliable plumbing, not reliable judgment. You still need to design your prompts and rubrics carefully to get good judgment.

The Four Mechanisms of Structured Output

Different systems offer different enforcement mechanisms, each with different tradeoffs in terms of strictness, flexibility, and latency.

JSON Mode

JSON mode is the lightest enforcement available in most hosted model APIs. When you enable it, the API guarantees that the response will be valid, parseable JSON — but it says nothing about the shape of that JSON. You might ask for {score: int, reasoning: str} and receive {"evaluation": {"numeric_rating": 3, "explanation": "..."}}. The structure is valid JSON; it just is not the structure you wanted. JSON mode is useful as a baseline but insufficient for production evaluation pipelines where downstream code expects specific field names.

Response Schemas (Structured Output APIs)

Several major model providers now offer response schema enforcement, where you supply a JSON Schema definition and the API guarantees that the response conforms to it — correct field names, correct types, required fields present. OpenAI calls this "Structured Outputs"; other providers use similar terminology. This is substantially stronger than JSON mode: not only is the response valid JSON, it matches the exact shape you declared.

The tradeoff is that schema support varies across providers and model versions, and very complex schemas (deep nesting, many anyOf branches) can sometimes degrade generation quality as the constrained decoding fights against the model's natural completion tendencies.

Grammar-Constrained Decoding

At the infrastructure level, tools like llama.cpp, Outlines, and guidance implement grammar-constrained decoding — the most rigorous enforcement available. Here, you define a formal grammar (often as a context-free grammar or a JSON Schema compiled to one), and the token sampling process is modified so that only tokens that can lead to a valid completion under the grammar have nonzero probability. Structurally invalid outputs become literally impossible at the decoding level.

This approach is most relevant when you are self-hosting models or running evaluation infrastructure where you control the entire stack. For teams running evaluation pipelines on their own GPU infrastructure, grammar-constrained decoding is often the right default.

Tool and Function Calling as a Forcing Function

Tool calling (sometimes called function calling) is an indirect but highly reliable enforcement mechanism available in most hosted model APIs. Instead of asking the model to return structured data, you declare a "function" with a typed parameter schema and instruct the model that it must call this function to complete its task. The model's response is then a structured function call with validated arguments rather than a free-form text completion.

This is widely used in production because:

It is well-supported across providers
The schema enforcement is robust
It aligns with the model's fine-tuned behavior (models are specifically trained to produce well-formed tool calls)
It is easy to layer into existing OpenAI-compatible client code

🎯 Key Principle: Choose the enforcement mechanism that matches your deployment context. Hosted APIs → response schemas or tool calling. Self-hosted models → grammar-constrained decoding. Never → prompt-only JSON requests for anything that feeds production metrics.

Schema Design Principles for Judge Responses

Once you have chosen an enforcement mechanism, you face a design question: what should the schema actually look like? Judge responses have specific characteristics that inform good schema design.

The most important principle is separation of concerns: your schema should separate the score, the reasoning, and optionally a confidence or uncertainty signal into distinct typed fields, rather than encoding them all in a single string. This sounds obvious but is frequently violated in practice.

❌ Wrong thinking: Ask the model to return "reasoning_and_score": "This response scores 3/5 because..." and parse the number from the string.

✅ Correct thinking: Define score as an integer field with minimum: 1, maximum: 5, reasoning as a string field, and confidence as an optional number field. Parse nothing — just access the fields.

Here is a well-designed judge response schema using Pydantic, which integrates cleanly with structured output APIs:

from pydantic import BaseModel, Field
from typing import Literal, Optional

class CriterionEvaluation(BaseModel):
    """Evaluation result for a single scoring criterion."""
    
    criterion_name: str = Field(
        description="The name of the criterion being evaluated"
    )
    score: int = Field(
        ge=1, le=5,
        description="Score from 1 (poor) to 5 (excellent)"
    )
    reasoning: str = Field(
        description="Step-by-step reasoning that justifies the score"
    )
    confidence: Literal["high", "medium", "low"] = Field(
        description="Judge's confidence in this score"
    )

class JudgeResponse(BaseModel):
    """Complete structured response from an LLM judge."""
    
    criteria_evaluations: list[CriterionEvaluation] = Field(
        description="Individual evaluation for each criterion"
    )
    overall_score: float = Field(
        ge=1.0, le=5.0,
        description="Weighted aggregate score across all criteria"
    )
    summary: str = Field(
        description="Brief overall assessment of the evaluated response"
    )
    evaluation_flags: Optional[list[str]] = Field(
        default=None,
        description="Any concerns or edge cases the judge noted"
    )

Notice several design decisions here. First, the score field uses ge=1, le=5 validation — even if the schema enforcement passes, Pydantic will raise a ValidationError if the model somehow returns a 6 or a 0. This is defense in depth. Second, confidence uses a Literal type rather than a free string — the model cannot return "somewhat confident" when you expected "medium". Third, evaluation_flags is optional: the model can omit it when nothing unusual occurred, preventing the model from fabricating flags just to fill the field. Fourth, reasoning is a string field adjacent to its score — this co-location matters for debuggability. When you inspect a judge's output and a score looks wrong, the reasoning is right there in the same object.

💡 Pro Tip: Include a criterion_name field even for single-criterion evaluations. It makes logs self-documenting and makes schema evolution easier — when you add a second criterion later, your log format is already prepared for it.

How Structured Output Enables Safe Pipeline Composition

The real payoff of structured output is not parsing convenience — it is safe pipeline composition. When every component in your evaluation pipeline emits and consumes well-typed objects, you can wire components together, aggregate results, log uniformly, and branch conditionally without writing a single line of string manipulation code.

Consider a typical evaluation pipeline:

Input ──► Judge Prompt ──► LLM API ──► JudgeResponse object
                                             │
              ┌──────────────────────────────┤
              │                              │
              ▼                              ▼
       Aggregate scores               Log to database
       across N runs                  (structured insert)
              │
              ▼
       Conditional branch:
       overall_score < 2.0 → flag for human review
       overall_score >= 4.0 → auto-approve
       else → secondary judge

Every arrow in this diagram can be implemented as clean typed function calls. The aggregate_scores function receives list[JudgeResponse] and returns float. The log_to_database function maps a JudgeResponse directly to a row schema. The conditional_branch reads judge_response.overall_score as a float — no casting, no .get() with defaults, no try/except around a float() call on an extracted string.

⚠️ Common Mistake: Mistake 1: Treating structured output as a presentation concern rather than an architectural one. Teams sometimes add structured output late in a project, after logging and aggregation code has already been written against string parsing. Retrofitting is painful. Design your schema first, before writing any downstream code.

Here is what this pipeline composition looks like in practice, using the OpenAI client with structured output:

from openai import OpenAI
from pydantic import ValidationError
import json

client = OpenAI()

def run_judge(
    system_prompt: str,
    user_content: str,
    model: str = "gpt-4o"
) -> JudgeResponse:
    """
    Run a structured LLM judge and return a validated JudgeResponse.
    Raises ValidationError if the response does not conform to schema.
    """
    completion = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_content}
        ],
        response_format=JudgeResponse,  # Schema enforcement at API level
    )
    
    # .parse() returns a parsed Pydantic object directly
    # ValidationError is raised here if schema conformance fails
    result = completion.choices[0].message.parsed
    
    if result is None:
        # Model refused or could not complete — handle gracefully
        raise ValueError("Judge returned null response; check content policy flags")
    
    return result


def aggregate_judge_responses(
    responses: list[JudgeResponse]
) -> dict[str, float]:
    """Aggregate scores across multiple judge runs."""
    # Clean aggregation with no string parsing needed
    scores = [r.overall_score for r in responses]
    
    return {
        "mean": sum(scores) / len(scores),
        "min": min(scores),
        "max": max(scores),
        "high_confidence_mean": sum(
            r.overall_score for r in responses
            # Filter by confidence field — typed, no string comparison risk
            if any(e.confidence == "high" for e in r.criteria_evaluations)
        ) / max(1, sum(
            1 for r in responses
            if any(e.confidence == "high" for e in r.criteria_evaluations)
        ))
    }

Notice that aggregate_judge_responses accesses r.overall_score directly as a float and e.confidence directly as a string literal — no casting, no key-error handling, no defensive get() calls. The type system does the work that would otherwise require defensive code.

💡 Real-World Example: A production evaluation pipeline at a mid-sized AI company was logging judge results to a data warehouse for trend analysis. Before structured output, the logging code extracted the score from the judge's text using a regex, which worked for 97% of responses. The remaining 3% were silently dropped. Over three weeks, this produced a systematic upward bias in reported metrics — dropped responses were disproportionately from edge cases where the judge wrote "I would rate this a 2.5" instead of just a number. Structured output eliminated both the regex and the silent drop.

Validation on Receipt: Defense in Depth

Even with API-level schema enforcement, defensive validation on receipt is good practice. APIs can have bugs; model providers update their structured output implementations; you might swap to a different provider that has slightly different enforcement semantics. A thin validation layer at the boundary of your pipeline catches these surprises before they propagate.

The pattern is simple: treat the boundary between the LLM API and your application code as a deserialization boundary, exactly as you would treat any external data source. Data that comes in gets validated; validated data flows through the rest of the system as trusted typed objects.

from pydantic import ValidationError
import logging

logger = logging.getLogger(__name__)

def safe_run_judge(
    system_prompt: str,
    user_content: str,
    fallback_score: float = 3.0
) -> tuple[JudgeResponse | None, bool]:
    """
    Run judge with full error handling.
    Returns (response, is_valid) tuple.
    Logs failures rather than raising, enabling bulk eval jobs to continue.
    """
    try:
        response = run_judge(system_prompt, user_content)
        
        # Secondary validation: business rules beyond schema structure
        # Schema guarantees score is 1-5, but we can add further checks
        if response.overall_score < 1.0 or response.overall_score > 5.0:
            logger.warning(
                "Score out of range despite schema enforcement: %s",
                response.overall_score
            )
            return None, False
        
        # Validate reasoning is substantive (not just whitespace)
        for criterion in response.criteria_evaluations:
            if len(criterion.reasoning.strip()) < 20:
                logger.warning(
                    "Suspiciously short reasoning for criterion '%s'",
                    criterion.criterion_name
                )
                # Don't fail — but flag it
                if response.evaluation_flags is None:
                    response.evaluation_flags = []
                response.evaluation_flags.append(
                    f"thin_reasoning:{criterion.criterion_name}"
                )
        
        return response, True
    
    except ValidationError as e:
        # This should be rare with API-level enforcement, but handle it
        logger.error("Schema validation failed: %s", e.errors())
        return None, False
    
    except ValueError as e:
        # Null response / content policy refusal
        logger.error("Judge returned null: %s", e)
        return None, False
    
    except Exception as e:
        # Network errors, rate limits, etc.
        logger.error("Unexpected error in judge: %s", e)
        return None, False

This pattern separates three concerns cleanly: schema validation (handled by Pydantic via the API client), business rule validation (the custom checks on score range and reasoning length), and infrastructure errors (the catch-all exception handler). Each concern is visible and independently maintainable.

🎯 Key Principle: Schema validation guarantees structure. Business rule validation guarantees semantics. Infrastructure error handling guarantees availability. All three are necessary; none of them substitutes for the others.

A Schema Design Quick Reference

📋 Quick Reference Card: Judge Response Schema Design

	Field	Type	Why
🎯 Required	`score`	`int` with bounds	Forces discrete, bounded values; prevents "3.7" vs "4" ambiguity
🧠 Required	`reasoning`	`str` (min length)	Enables audit trail; supports human review of judge decisions
📚 Recommended	`confidence`	`Literal["high","medium","low"]`	Enables confidence-weighted aggregation and flagging
🔧 Recommended	`criterion_name`	`str`	Makes logs self-documenting; supports multi-criterion schemas
🔒 Optional	`evaluation_flags`	`list[str] \\| None`	Captures edge cases without polluting the primary score fields
⚙️ Optional	`overall_score`	`float` with bounds	Pre-computed weighted aggregate; avoids recomputing in downstream

🧠 Mnemonic: SRCCF — Score, Reasoning, Confidence, Criterion name, Flags. These are the five fields that cover 95% of judge response schemas.

Putting It Together

Structured output is not a feature you toggle on to get cleaner-looking outputs. It is the engineering foundation that makes LLM-based evaluation trustworthy enough to sit inside production pipelines. When a judge response is schema-enforced, typed, and validated at the boundary, every component downstream — aggregators, loggers, dashboards, conditional routers — can be written to assume correctness rather than defend against malformation. The cognitive overhead of "what if the model returned something weird?" disappears from every component except the one place where you explicitly handle it: the boundary.

The choice of enforcement mechanism (response schemas, tool calling, grammar-constrained decoding) is a deployment decision, not a philosophical one. Pick the strongest mechanism your infrastructure supports. Design your schema with separation of concerns, bounded types, and explicit confidence signals. Validate at the boundary. After that, the rest of your pipeline can be as clean as any other typed software system.

In the next section, we will see how this structured output foundation combines with G-Eval's rubric expansion and probability-weighted scoring to produce a complete, production-ready evaluation architecture.

Wiring G-Eval and Structured Output Together: A Working Implementation

Up to this point, you have seen G-Eval and structured output as separate ideas: one is a scoring architecture grounded in chain-of-thought rubric expansion, the other is an engineering pattern for making judge responses machine-readable. This section is where those two threads are woven together into a single, runnable evaluation pipeline. By the end, you will have a working mental model and real code you can adapt, covering everything from prompt design to batch result persistence.

Designing the Prompt Template

The prompt is the foundation of any G-Eval implementation. Its job is twofold: first, it must communicate the rubric criteria clearly enough that the model can reason through them step by step; second, it must tell the model exactly what shape its output should take so that your structured output schema can parse it reliably.

A well-designed G-Eval prompt template has four named regions. Think of them as layers in a sandwich:

┌─────────────────────────────────────────────────────┐
│  SYSTEM ROLE                                        │
│  "You are an expert evaluator. Reason carefully     │
│   before assigning scores."                         │
├─────────────────────────────────────────────────────┤
│  EVALUATION CRITERIA                                │
│  Numbered list of dimensions with anchored          │
│  descriptions (e.g., coherence 1–5 scale)           │
├─────────────────────────────────────────────────────┤
│  EVALUATION STEPS  (G-Eval's key insight)           │
│  Model-generated chain-of-thought instructions      │
│  that unpack HOW to assess each criterion           │
├─────────────────────────────────────────────────────┤
│  DOCUMENT + RESPONSE UNDER EVALUATION               │
│  The actual content to score, clearly delimited     │
└─────────────────────────────────────────────────────┘

The Evaluation Steps region is what distinguishes G-Eval from a naive "rate this on a scale of 1 to 5" prompt. In the original paper, those steps are themselves generated by a separate LLM call — the model is asked to produce a detailed checklist for how a human expert would evaluate each criterion. You then inject those generated steps into every subsequent scoring call. This means the model arrives at the scoring task having already "rehearsed" the reasoning path, which consistently produces more calibrated scores.

For a self-contained implementation, you can either pre-generate and hard-code the evaluation steps (cheaper, deterministic) or regenerate them at runtime (more adaptive but adds latency and cost). For most production pipelines, pre-generation is the right choice.

Here is a concrete prompt template encoding three G-Eval dimensions — coherence, relevance, and fluency — formatted to pair with a structured output schema:

EVAL_SYSTEM_PROMPT = """
You are an expert NLP evaluator. Your task is to score a model-generated summary
against a source document using three quality dimensions.

For each dimension, follow the evaluation steps precisely, write your chain-of-thought
reasoning in the corresponding `reasoning` field, then assign an integer score
within the stated range.

### Evaluation Criteria

1. **Coherence** (1–5): The summary should be well-structured and logically ordered.
   A score of 1 means the summary is incoherent or self-contradictory.
   A score of 5 means the summary reads as a unified, well-organized whole.

2. **Relevance** (1–5): The summary should contain only information present in the
   source document. Penalize hallucinated or irrelevant content.
   A score of 1 means the summary is mostly irrelevant or fabricated.
   A score of 5 means every sentence is grounded in the source.

3. **Fluency** (1–3): The summary should be grammatically correct and easy to read.
   A score of 1 means frequent grammatical errors impede reading.
   A score of 3 means the summary reads naturally without errors.

### Evaluation Steps

For Coherence:
  - Read the summary from start to finish without consulting the source.
  - Identify whether each sentence follows logically from the previous one.
  - Check for contradictions, abrupt topic shifts, or missing connective tissue.

For Relevance:
  - Read each sentence in the summary and locate its supporting evidence in the source.
  - Flag any claim that cannot be traced to the source.
  - Count the proportion of well-supported sentences.

For Fluency:
  - Scan for grammatical errors, awkward phrasing, and punctuation problems.
  - Note whether the vocabulary is appropriate for the domain.

Return your evaluation as a structured JSON object matching the provided schema.
"""

EVAL_USER_TEMPLATE = """
### Source Document
{source_document}

### Summary Under Evaluation
{summary}
"""

Notice how each evaluation step is written as an explicit procedural checklist. This is the G-Eval insight operationalized: rather than leaving the model to invent its own rubric on the fly, you hand it a pre-reasoned procedure. The model's job is then to execute that procedure and record its findings.

💡 Pro Tip: Keep your delimiters (## Source Document, ## Summary Under Evaluation) consistent across every call in your pipeline. Inconsistent delimiters are one of the most common causes of silent parsing failures when the model accidentally incorporates delimiter text into its reasoning.

Mapping Dimensions onto a Typed Schema

With the prompt designed, the next step is defining a typed response schema that enforces the structure you need downstream. Using Pydantic (Python's de facto data validation library) gives you both the schema definition and runtime validation in a single declaration.

The schema mirrors the three scoring dimensions exactly. Each dimension gets a reasoning field (the chain-of-thought trace) and a score field with a constrained integer range:

from pydantic import BaseModel, Field
from typing import Literal

class DimensionScore(BaseModel):
    """A single scored evaluation dimension with chain-of-thought reasoning."""
    reasoning: str = Field(
        description="Step-by-step reasoning the model used to arrive at the score."
    )
    score: int = Field(
        description="Integer score within the dimension's defined range."
    )

class SummaryEvaluation(BaseModel):
    """Structured G-Eval output for a summary evaluation task."""
    coherence: DimensionScore = Field(
        description="Coherence score (1–5): logical flow and organization."
    )
    relevance: DimensionScore = Field(
        description="Relevance score (1–5): fidelity to the source document."
    )
    fluency: DimensionScore = Field(
        description="Fluency score (1–3): grammatical correctness and readability."
    )

    def weighted_score(self, weights: dict[str, float] | None = None) -> float:
        """Compute a normalized weighted composite score in [0, 1]."""
        if weights is None:
            # Default: coherence and relevance weighted higher than fluency
            weights = {"coherence": 0.45, "relevance": 0.45, "fluency": 0.10}

        # Normalize each dimension to [0, 1] before weighting
        normalized = {
            "coherence": (self.coherence.score - 1) / (5 - 1),  # range 1–5
            "relevance": (self.relevance.score - 1) / (5 - 1),  # range 1–5
            "fluency":   (self.fluency.score   - 1) / (3 - 1),  # range 1–3
        }
        return sum(normalized[dim] * weights[dim] for dim in weights)

The weighted_score method normalizes each dimension to a [0, 1] interval before applying weights. This is essential: mixing raw scores from different scales (1–5 vs. 1–3) without normalization would arithmetically favor higher-range dimensions regardless of weights.

🎯 Key Principle: Normalize before you weight. A raw score of 3 on a 1–3 scale represents perfect fluency, but 3 on a 1–5 scale represents mediocre coherence. Treating them as equivalent numbers is a category error that silently distorts your composite scores.

Code Walkthrough: Sending the Judge Call and Extracting the Score

With prompt and schema in hand, the core judge call is straightforward. The example below uses the OpenAI Python SDK with its beta.chat.completions.parse endpoint, which handles structured output natively by injecting the schema into the API request and returning a validated Python object:

import openai
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

def run_geval_judge(
    source_document: str,
    summary: str,
    model: str = "gpt-4o-mini",
) -> SummaryEvaluation:
    """
    Send a G-Eval judge call and return a validated SummaryEvaluation object.

    Args:
        source_document: The reference text the summary was generated from.
        summary: The candidate summary to evaluate.
        model: The OpenAI model to use as judge.

    Returns:
        A SummaryEvaluation instance with per-dimension scores and reasoning.
    """
    user_message = EVAL_USER_TEMPLATE.format(
        source_document=source_document,
        summary=summary,
    )

    # `parse` enforces the schema at the API level and returns a typed object
    completion = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": EVAL_SYSTEM_PROMPT},
            {"role": "user",   "content": user_message},
        ],
        response_format=SummaryEvaluation,  # Pydantic class, not an instance
        temperature=0,                       # determinism for reproducibility
    )

    # The SDK populates `parsed` with a fully validated SummaryEvaluation
    evaluation: SummaryEvaluation = completion.choices[0].message.parsed
    return evaluation


## --- Example usage ---
if __name__ == "__main__":
    doc = (
        "Scientists at the University of Edinburgh have discovered that "
        "migratory birds use magnetic fields to navigate, a finding confirmed "
        "by experiments in which birds with disrupted magnetic sense lost "
        "their sense of direction entirely."
    )
    candidate_summary = (
        "Researchers found that birds rely on Earth's magnetic field for "
        "navigation, proven by experiments showing directionless flight "
        "when this sense was blocked."
    )

    result = run_geval_judge(doc, candidate_summary)

    print(f"Coherence  — score: {result.coherence.score}/5")
    print(f"  Reasoning: {result.coherence.reasoning}")
    print(f"Relevance  — score: {result.relevance.score}/5")
    print(f"  Reasoning: {result.relevance.reasoning}")
    print(f"Fluency    — score: {result.fluency.score}/3")
    print(f"  Reasoning: {result.fluency.reasoning}")
    print(f"Composite  — weighted score: {result.weighted_score():.3f}")

Setting temperature=0 is a deliberate choice for reproducibility. When the same document-summary pair is re-evaluated later (for auditing or debugging), you want identical scores. Non-zero temperature introduces stochastic variation that makes regression testing against historical baselines unreliable.

⚠️ Common Mistake: Passing response_format=SummaryEvaluation() (an instance) instead of response_format=SummaryEvaluation (the class). The SDK expects the class itself so it can extract the JSON schema. Passing an instance may silently fall back to unstructured output depending on the SDK version.

Handling the Probability-Weighted Score Variant

The G-Eval paper's most distinctive contribution is its probability-weighted scoring approach. Instead of taking the model's stated score at face value, you ask for the log-probabilities of the top candidate score tokens and compute an expected value. This produces a continuous score that is more sensitive to near-ties (e.g., "probably a 3, but 4 is plausible") than a hard argmax.

🤔 Did you know? The G-Eval paper showed that probability-weighted scores correlate more strongly with human judgments than single-token scores on several NLP benchmarks. The gap is especially pronounced for coherence, where the model is often genuinely uncertain between adjacent scores.

Not all APIs expose token-level log-probabilities for structured output calls. When they do, the flow looks like this:

Judge prompt ──► API call with logprobs=True ──► raw completion
                                                        │
                              ┌─────────────────────────┘
                              │
                    Extract top-k token logprobs
                    for the score position
                              │
                    Convert logprobs → probabilities
                    (softmax over candidate tokens)
                              │
                    Compute E[score] = Σ p(token) × token_value
                              │
                    Return continuous weighted score

The code below implements this variant using the standard chat.completions.create endpoint (not .parse), because structured output endpoints do not always expose per-token log-probabilities:

import math
from collections import defaultdict

def run_geval_logprob_score(
    source_document: str,
    summary: str,
    dimension: Literal["coherence", "relevance", "fluency"],
    score_range: tuple[int, int],   # e.g. (1, 5) or (1, 3)
    model: str = "gpt-4o-mini",
) -> float:
    """
    Return a probability-weighted G-Eval score for a single dimension.

    The model is prompted to output ONLY a single integer score token.
    We capture the log-probabilities of the top candidate tokens and
    compute an expected value over the valid score range.

    Returns:
        A float in [score_range[0], score_range[1]] representing E[score].
    """
    low, high = score_range
    valid_tokens = [str(s) for s in range(low, high + 1)]

    # Minimal prompt: ask for a single score token only
    logprob_prompt = (
        f"Based on the evaluation criteria for {dimension}, output ONLY "
        f"a single integer score between {low} and {high}. "
        f"Do not output any other text."
    )

    user_message = EVAL_USER_TEMPLATE.format(
        source_document=source_document,
        summary=summary,
    ) + f"\n\n{logprob_prompt}"

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": EVAL_SYSTEM_PROMPT},
            {"role": "user",   "content": user_message},
        ],
        max_tokens=1,      # We only want the single score token
        logprobs=True,
        top_logprobs=10,   # Capture top-10 candidate tokens
        temperature=0,
    )

    # Extract log-probability entries for the first (and only) token
    top_logprobs = response.choices[0].logprobs.content[0].top_logprobs

    # Filter to valid score tokens and convert log-probs → probs
    score_probs: dict[int, float] = defaultdict(float)
    for entry in top_logprobs:
        token = entry.token.strip()
        if token in valid_tokens:
            score_probs[int(token)] += math.exp(entry.logprob)

    if not score_probs:
        # Fallback: uniform distribution if no valid score tokens surfaced
        return (low + high) / 2.0

    # Normalize (probabilities may not sum to 1 after filtering)
    total = sum(score_probs.values())
    expected_score = sum(s * (p / total) for s, p in score_probs.items())
    return expected_score

💡 Real-World Example: Suppose the judge is evaluating a summary with mediocre coherence. The log-probability distribution over score tokens might assign 0.45 probability to "3", 0.35 to "4", and 0.20 to "2". The hard-argmax score is 3. The probability-weighted score is 0.45×3 + 0.35×4 + 0.20×2 = 3.25 — a more nuanced signal that downstream aggregation can exploit.

Integrating into a Batch Evaluation Loop

A single judge call is useful for debugging; a batch loop is what makes evaluation a pipeline component. The goals of a production batch loop are: process a dataset of document-summary pairs, persist results in a structured format for later analysis, handle transient API failures gracefully, and generate a reproducible audit trail.

Dataset (JSONL)
     │
     ▼
┌──────────────┐     retry/backoff     ┌─────────────────┐
│  Batch Loop  │ ──────────────────► │  Judge (G-Eval) │
│              │ ◄──────────────────── │  + Structured   │
└──────────────┘   SummaryEvaluation   │    Output       │
     │                                  └─────────────────┘
     ▼
Results (JSONL)  ← one record per evaluated pair
     │
     ▼
Aggregate stats CSV  ← mean scores, std dev, pass/fail flags

The implementation below ties all prior pieces together. It reads from a JSONL dataset, runs the G-Eval judge, serializes results back to JSONL (one record per pair), and writes a summary CSV:

import json
import csv
import time
from pathlib import Path
from typing import Iterator

def load_dataset(path: str) -> Iterator[dict]:
    """Yield individual records from a JSONL evaluation dataset."""
    with open(path) as f:
        for line in f:
            line = line.strip()
            if line:  # skip blank lines
                yield json.loads(line)


def evaluate_dataset(
    dataset_path: str,
    results_path: str,
    weights: dict[str, float] | None = None,
    max_retries: int = 3,
    retry_delay: float = 2.0,
) -> None:
    """
    Run G-Eval over an entire dataset and persist results for reproducibility.

    Each input record must have 'id', 'source_document', and 'summary' fields.
    Output JSONL has one record per input, adding evaluation fields.
    """
    results_file = Path(results_path)
    # Determine already-processed IDs to support resumable runs
    processed_ids: set[str] = set()
    if results_file.exists():
        with open(results_file) as f:
            for line in f:
                if line.strip():
                    record = json.loads(line)
                    processed_ids.add(record["id"])

    with open(results_file, "a") as out:  # append mode for resumability
        for record in load_dataset(dataset_path):
            item_id = record["id"]
            if item_id in processed_ids:
                print(f"[SKIP] {item_id} already evaluated.")
                continue

            for attempt in range(1, max_retries + 1):
                try:
                    evaluation = run_geval_judge(
                        source_document=record["source_document"],
                        summary=record["summary"],
                    )
                    composite = evaluation.weighted_score(weights)

                    result = {
                        "id": item_id,
                        "coherence_score":   evaluation.coherence.score,
                        "coherence_reason":  evaluation.coherence.reasoning,
                        "relevance_score":   evaluation.relevance.score,
                        "relevance_reason":  evaluation.relevance.reasoning,
                        "fluency_score":     evaluation.fluency.score,
                        "fluency_reason":    evaluation.fluency.reasoning,
                        "composite_score":   round(composite, 4),
                        "model":             "gpt-4o-mini",
                        "timestamp":         time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
                    }
                    out.write(json.dumps(result) + "\n")
                    out.flush()  # ensure data hits disk even if the loop crashes
                    print(f"[OK]   {item_id} — composite: {composite:.3f}")
                    break  # success: move to next record

                except Exception as e:
                    if attempt == max_retries:
                        print(f"[FAIL] {item_id} after {max_retries} attempts: {e}")
                    else:
                        time.sleep(retry_delay * attempt)  # exponential backoff


def aggregate_results(results_path: str, summary_csv_path: str) -> None:
    """Compute per-dimension means and write a summary CSV."""
    records = []
    with open(results_path) as f:
        for line in f:
            if line.strip():
                records.append(json.loads(line))

    if not records:
        print("No results to aggregate.")
        return

    dimensions = ["coherence_score", "relevance_score", "fluency_score", "composite_score"]
    means = {
        dim: sum(r[dim] for r in records) / len(records)
        for dim in dimensions
    }

    with open(summary_csv_path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["dimension", "mean", "n"])
        writer.writeheader()
        for dim, mean in means.items():
            writer.writerow({"dimension": dim, "mean": round(mean, 4), "n": len(records)})

    print(f"\nAggregation complete ({len(records)} records):")
    for dim, mean in means.items():
        print(f"  {dim}: {mean:.4f}")

Several design decisions in this batch loop are worth calling out explicitly.

Append mode with resume support. Opening the results file in append mode and tracking processed IDs at startup means the loop is idempotent: if it crashes halfway through a 10,000-item dataset, you can re-run it and it will pick up from where it left off without re-evaluating already-completed items or duplicating records.

Flush after every write. Calling out.flush() after each line ensures that completed evaluations are written to disk even if the process is killed unexpectedly. Without this, Python's write buffer may hold multiple records in memory, losing them on crash.

Exponential backoff on retry. Multiplying retry_delay by attempt gives 2s, 4s, and 6s delays, which is sufficient to ride out transient rate-limit responses from most APIs without burning through your retry budget too quickly.

💡 Mental Model: Think of the results JSONL file as a write-ahead log for your evaluation run. Each appended line is a durable record of a completed unit of work. If anything goes wrong, the log tells you exactly where to resume.

📋 Quick Reference Card: Batch Loop Design Decisions

🔧 Decision	✅ Chosen Approach	❌ Common Alternative
🔒 File mode	Append (`"a"`) + resume check	Overwrite (`"w"`) each run
🔧 Flush strategy	After every record	At end of loop
📚 Retry logic	Exponential backoff	Fixed delay or no retry
🎯 Score normalization	Per-dimension to `[0,1]`	Raw score arithmetic
🧠 Temperature	`0` for reproducibility	Default (often `1.0`)

Putting It All Together

The full pipeline now looks like this in practice: you call evaluate_dataset() with a JSONL file of source-summary pairs, and it produces a companion JSONL of fully reasoned, schema-validated evaluation records. You call aggregate_results() to turn that into a CSV summary you can share with stakeholders or track across model versions.

Because every output record contains both the raw dimension scores and the chain-of-thought reasoning strings, the results file serves double duty: it is machine-readable for downstream metrics computation, and human-readable for manual audits. When a score looks surprising, you do not have to re-run the evaluation — you already have the model's reasoning on disk.

🎯 Key Principle: The combination of G-Eval's chain-of-thought rubric expansion and structured output's schema enforcement transforms evaluation from a one-off script into a durable pipeline artifact. The reasoning is auditable, the scores are validated, and the results are reproducible — exactly the properties you need to trust your evaluation system as a component of a larger LLM workflow.

Common Pitfalls: Where G-Eval and Structured Output Break in Practice

Even well-intentioned implementations of G-Eval-style judges with structured output can silently degrade into something unreliable and hard to debug. The failure modes discussed in this section are not hypothetical — they are patterns that show up repeatedly when teams first move from prototype evaluation scripts into production pipelines. Understanding them before you hit them will save you hours of confusing debugging and, more importantly, prevent you from drawing false conclusions from corrupted evaluation data.

This section walks through five distinct failure modes, each rooted in a specific conceptual misunderstanding. By naming them precisely and showing what they look like in code, you will develop the instinct to recognize them as you write your own judges.

Pitfall 1: Confusing Prompt-Level JSON Instructions with Schema Enforcement

The most pervasive mistake in the space is assuming that telling a model to "respond in JSON" is the same thing as enforcing structured output. It is not — not even close.

Prompt-level JSON instructions are just text. You are asking the model, politely, to format its response a certain way. The model might comply most of the time. But it will also occasionally add a preamble like "Sure, here's the JSON:", wrap the output in markdown fences (```json), return invalid JSON with a trailing comma, or simply decide the question was better answered in prose. Any of these responses will crash a naive json.loads() call.

Schema enforcement, by contrast, is a constraint applied at the token generation level. The model's output is constrained to tokens that are valid continuations of a well-formed JSON document matching a specific schema. There is no "sometimes" — either the schema is enforced or it is not.

## ❌ Prompt-level JSON instruction — fragile, not structured output
import openai

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are an evaluation judge. Always respond in valid JSON."
        },
        {
            "role": "user",
            "content": "Evaluate this response for coherence. Score from 1-5."
        }
    ]
    # No response_format — this is just asking nicely
)

## This will raise json.JSONDecodeError roughly 5-15% of the time
import json
result = json.loads(response.choices[0].message.content)

Compare that with actual schema enforcement using the response_format parameter with a Pydantic model:

## ✅ Schema-enforced structured output — reliable, pipeline-safe
from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

class CoherenceJudgment(BaseModel):
    reasoning_steps: list[str] = Field(
        description="Step-by-step reasoning through the rubric criteria"
    )
    score: int = Field(
        ge=1, le=5,
        description="Integer coherence score from 1 (incoherent) to 5 (fully coherent)"
    )
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="Judge confidence in this score"
    )

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an evaluation judge."},
        {"role": "user", "content": "Evaluate this response for coherence."}
    ],
    response_format=CoherenceJudgment  # Schema is enforced, not requested
)

## This will NEVER raise a parsing error — the schema is guaranteed
judgment = response.choices[0].message.parsed
print(judgment.score)  # Always an int between 1 and 5

⚠️ Common Mistake: Adding response_format={"type": "json_object"} and thinking you have structured output. json_object mode guarantees valid JSON, but it does not validate against any schema. You can still get {"score": "five out of five"} instead of {"score": 5}. Always use a schema-backed format.

💡 Mental Model: Think of prompt-level JSON instructions as asking someone to "write neatly." Schema enforcement is handing them a form with labeled boxes that only accept specific input types. One is a social request; the other is a technical constraint.

Pitfall 2: Vague or Overlapping Rubric Criteria

G-Eval's power comes from its chain-of-thought criteria expansion step: the judge model generates explicit reasoning steps before committing to a score. But this step can backfire dramatically when the rubric criteria are poorly designed.

Consider a rubric that asks the judge to evaluate both "Is the response accurate?" and "Is the response factually grounded?" These two criteria overlap substantially. When the model expands them into reasoning steps, it will often reason about the same underlying dimension twice, sometimes reaching inconsistent conclusions. The expanded steps might say:

Step 2: The response correctly states that Paris is the capital of France — accuracy is high.
Step 4: The response does not cite any sources, so factual grounding is low.

The model then faces a contradictory signal when producing a score. Does strong accuracy override weak grounding? The chain-of-thought has no principled way to resolve this, and the resulting score will be noisy and hard to interpret.

Vague criteria cause a different problem. A criterion like "Is the response good?" gives the model no concrete anchor for its reasoning steps. The expanded chain-of-thought will fill in its own interpretation of "good," which may differ between runs, between models, and between temperature settings. You have now introduced a hidden random variable into what you thought was a deterministic evaluation.

❌ Poorly designed rubric criteria:

Criteria:
  1. Is the response accurate?
  2. Is the response factually grounded?
  3. Is the response good overall?
  4. Does the response demonstrate quality?

Result of CoT expansion:
  Step 1 and Step 2 overlap → contradictory signals
  Step 3 and Step 4 are undefined → model invents criteria
  Score reflects noise, not signal

✅ Well-designed rubric criteria:

Criteria:
  1. Does every factual claim in the response match the provided reference?
  2. Is the logical flow from premise to conclusion free of gaps or contradictions?
  3. Does the response address all explicit sub-questions in the prompt?

Result of CoT expansion:
  Each step has a concrete, testable anchor
  Steps do not overlap — they cover orthogonal dimensions
  Score reflects a principled aggregation of distinct signals

🎯 Key Principle: Each rubric criterion should be orthogonal (not overlapping), anchored (referencing something concrete in the input), and falsifiable (a model could, in principle, fail this criterion even if it passes all others).

💡 Pro Tip: Before deploying a rubric, test it by deliberately constructing two adversarial examples — one that should pass criterion A but fail criterion B, and one that reverses that. If your chain-of-thought expansion cannot distinguish them cleanly, your criteria are too intertwined.

Pitfall 3: Schema Fields That Are Too Permissive

A structured output schema is only as strong as its type constraints. Defining score as a str field defeats the entire purpose of schema enforcement for a numeric evaluation.

This failure mode is subtle because it does not cause an error. The schema is technically satisfied. But downstream pipeline components expecting a number will either crash or, worse, silently coerce the string — and a coercion from "4" to 4 will work, while a coercion from "four out of five" to 4 will not, leaving you with inconsistent data depending on how the model chose to express the score that day.

## ❌ Too permissive — 'score' is a free string
class BadJudgmentSchema(BaseModel):
    reasoning: str
    score: str  # Model might return "4", "4/5", "four", "high", ...
    verdict: str  # Model might return "pass", "Pass", "PASS", "yes", ...

## ✅ Properly constrained — types and ranges are enforced
from enum import Enum

class Verdict(str, Enum):
    PASS = "pass"
    FAIL = "fail"
    BORDERLINE = "borderline"

class GoodJudgmentSchema(BaseModel):
    reasoning_steps: list[str] = Field(
        min_length=2,
        description="At least 2 explicit reasoning steps before scoring"
    )
    score: int = Field(
        ge=1, le=5,
        description="Integer score: 1=poor, 3=acceptable, 5=excellent"
    )
    verdict: Verdict = Field(
        description="Binary classification after scoring"
    )
    # If score >= 4: verdict must be pass; if score <= 2: verdict must be fail
    # Add a validator to enforce this cross-field constraint:
    
    @classmethod
    def model_validator(cls, values):
        score = values.get('score')
        verdict = values.get('verdict')
        if score and verdict:
            if score >= 4 and verdict == Verdict.FAIL:
                raise ValueError("Score >= 4 cannot have verdict FAIL")
        return values

⚠️ Common Mistake: Defining an int field for score but omitting ge and le bounds. The schema will accept -999 or 10000 as valid scores. Always bound numeric ranges explicitly.

🤔 Did you know? Some teams use Literal types from Python's typing module to constrain scores to a fixed set of allowed values, such as Literal[1, 2, 3, 4, 5]. This is an alternative to ge/le that makes the allowed values explicit in the schema definition itself.

Pitfall 4: Ignoring Token Probability Availability

One of the most intellectually important contributions of the G-Eval paper is probability-weighted scoring: instead of taking the model's argmax output ("4"), you collect the token probabilities for all valid score tokens and compute a weighted average. This produces a continuous score that captures the model's uncertainty rather than discarding it.

For example, if the model assigns 60% probability to "4" and 35% probability to "3", the weighted score is 0.60 × 4 + 0.35 × 3 = 3.45 — meaningfully different from a hard 4. This weighted score is more stable across repeated runs and more sensitive to near-threshold responses.

The failure mode here is silent: if you implement a G-Eval-style judge but never check whether your API supports token probability extraction, you will fall back to argmax decoding without any warning. Your evaluations will look valid but will have lost a key property of the architecture.

## Checking for logprob availability and falling back gracefully
import math
from openai import OpenAI

client = OpenAI()

def probability_weighted_score(
    prompt_messages: list[dict],
    valid_scores: list[int] = [1, 2, 3, 4, 5]
) -> dict:
    """
    Attempt probability-weighted scoring; fall back to argmax with a warning.
    Returns dict with 'score', 'method', and optionally 'score_distribution'.
    """
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=prompt_messages,
        max_tokens=1,          # We only need the score token
        logprobs=True,         # Request token probabilities
        top_logprobs=10        # Get top-10 token probabilities
    )
    
    choice = response.choices[0]
    
    # Check if logprobs were actually returned
    if choice.logprobs is None or not choice.logprobs.content:
        # Silent fallback is dangerous — always log this
        print("⚠️  WARNING: Logprobs unavailable. Falling back to argmax decoding.")
        print("   Probability-weighted scoring is disabled for this call.")
        raw_score = choice.message.content.strip()
        try:
            return {"score": int(raw_score), "method": "argmax"}
        except ValueError:
            raise RuntimeError(f"Argmax fallback returned non-integer: '{raw_score}'")
    
    # Build probability distribution over valid score tokens
    token_probs = {
        t.token: math.exp(t.logprob)   # Convert log-prob to probability
        for t in choice.logprobs.content[0].top_logprobs
    }
    
    # Filter to only valid score tokens
    valid_probs = {
        str(s): token_probs.get(str(s), 0.0)
        for s in valid_scores
    }
    
    total_prob = sum(valid_probs.values())
    
    if total_prob < 0.01:  # Essentially no probability mass on valid scores
        raise RuntimeError(
            f"Model assigned <1% probability to any valid score token. "
            f"Top tokens: {list(token_probs.keys())[:5]}"
        )
    
    # Normalize and compute weighted score
    weighted_score = sum(
        float(token) * (prob / total_prob)
        for token, prob in valid_probs.items()
    )
    
    return {
        "score": round(weighted_score, 3),
        "method": "probability_weighted",
        "score_distribution": valid_probs
    }

This implementation does three things correctly: it requests logprobs explicitly, it checks that logprobs were actually returned before using them, and it logs a warning rather than silently degrading to argmax. The warning is not optional — silent degradation is how evaluation bugs go undetected for weeks.

⚠️ Common Mistake: Combining probability-weighted scoring with structured output enforcement at the same time. These two techniques operate at different layers and can conflict. Structured output constrains the generation process in ways that may alter or suppress logprob data. Use probability-weighted scoring in a separate, minimal completion call (just the score token), not inside a full structured output response.

ARCHITECTURE: Separating CoT Generation from Score Extraction

┌─────────────────────────────────────────────────────────────┐
│  Call 1: Structured Output (CoT + Reasoning Steps)          │
│  Model: gpt-4o                                              │
│  response_format: CoTReasoningSchema                        │
│  Output: { reasoning_steps: [...], preliminary_score: 4 }  │
└──────────────────────┬──────────────────────────────────────┘
                       │ reasoning_steps injected into prompt
                       ▼
┌─────────────────────────────────────────────────────────────┐
│  Call 2: Logprob Extraction (Score Token Only)              │
│  Model: gpt-4o                                              │
│  max_tokens=1, logprobs=True, top_logprobs=10               │
│  Output: logprobs over ["1","2","3","4","5"]               │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
              Weighted score: 3.72

Pitfall 5: Treating a Single Judge Call as Ground Truth

The most dangerous pitfall is conceptual rather than technical: assuming that one call to a judge LLM produces a reliable, repeatable score. It does not.

LLM judges exhibit judge variance — the same model, the same prompt, the same input can produce different scores across runs, especially for borderline cases. This variance has several sources: temperature-induced randomness in the model's reasoning, sensitivity to minor prompt wording differences, and the inherent ambiguity of natural language rubric criteria even when well-designed.

The practical consequence is stark. Imagine your evaluation pipeline returns a score of 3 for a model response on a coherence rubric with a pass threshold of 3.5. Has the response actually failed? Or did you happen to draw a low sample from a distribution centered at 3.8? A single call cannot answer this question.

Judge Score Distribution for a Borderline Response (n=20 samples)

Score │ Count  │ Bar
──────┼────────┼────────────────────────
  1   │   0    │
  2   │   1    │ █
  3   │   6    │ ██████
  4   │  11    │ ███████████
  5   │   2    │ ██
──────┼────────┼────────────────────────

Mean: 3.7  │  Std Dev: 0.68  │  Single sample might be 2 OR 5

Conclusion: A pass/fail threshold at 3.5 would classify this
response differently depending on which sample you happened to draw.

The mitigation is repeated sampling with aggregation: run the judge multiple times (typically 3–5 for production, up to 20 for research settings) and aggregate the results. Aggregation can take several forms:

🧠 Mean score across N runs — simple and robust to outliers when N ≥ 5. 📚 Majority vote on categorical verdicts — appropriate when your schema uses a Verdict enum rather than a continuous score. 🔧 Confidence-weighted mean — if your schema includes a confidence field, weight each score by the judge's reported confidence. 🎯 Ensemble across judge models — run the same rubric with two different judge models and average, reducing model-specific bias.

import statistics
from typing import Callable

def stable_judge_score(
    judge_fn: Callable[[], dict],
    n_samples: int = 5,
    score_field: str = "score"
) -> dict:
    """
    Run a judge function n_samples times and return aggregated statistics.
    judge_fn should return a dict with at least a 'score' key.
    """
    results = [judge_fn() for _ in range(n_samples)]
    scores = [r[score_field] for r in results]
    
    mean_score = statistics.mean(scores)
    std_dev = statistics.stdev(scores) if len(scores) > 1 else 0.0
    
    # Flag high-variance results for human review
    HIGH_VARIANCE_THRESHOLD = 1.0  # Adjust based on your score range
    needs_review = std_dev > HIGH_VARIANCE_THRESHOLD
    
    return {
        "mean_score": round(mean_score, 3),
        "std_dev": round(std_dev, 3),
        "individual_scores": scores,
        "n_samples": n_samples,
        "needs_human_review": needs_review,
        "recommendation": (
            "Flag for review: high judge variance detected"
            if needs_review
            else "Score is stable across samples"
        )
    }

💡 Real-World Example: A team building an automated essay grading pipeline discovered that their single-call judge had a standard deviation of 1.2 points on a 5-point scale for essays that mixed strong and weak paragraphs. By switching to 5-sample averaging, they reduced the standard deviation to 0.3 and caught 40% fewer false positives at their pass threshold.

⚠️ Common Mistake: Increasing sample count to fight variance caused by a bad rubric. Repeated sampling reduces variance from stochasticity, but it cannot fix systematic bias introduced by overlapping or vague criteria. Fix the rubric first, then tune the sample count.

Putting It All Together: A Diagnostic Checklist

Before deploying any G-Eval-style judge with structured output, run through this checklist:

📋 Quick Reference Card: Pre-Deployment Judge Audit

#	Check	✅ Pass	❌ Fail
🔒 Schema enforcement	Using `response_format` with typed schema	Pydantic model with bounds	Plain `json_object` or no format
🎯 Rubric quality	Criteria are orthogonal and anchored	Distinct, falsifiable criteria	Vague or overlapping criteria
🔧 Field types	Score field has tight constraints	`int` with `ge`/`le`	`str` or unbounded `int`
📚 Logprob handling	Logprob availability is checked explicitly	Warning on fallback	Silent argmax fallback
🧠 Sample count	Score aggregated across multiple runs	N ≥ 3 with std dev tracked	Single call treated as final

These five failure modes cluster into two categories: engineering failures (pitfalls 1, 3, and 4) that introduce silent parsing and data integrity errors, and design failures (pitfalls 2 and 5) that produce systematically misleading evaluation results. Both categories can make your evaluation pipeline look functional while quietly returning numbers you cannot trust.

🎯 Key Principle: A judge that appears to work is not the same as a judge you have evidence to trust. Catching these pitfalls is not about being cautious — it is about building the kind of evaluation infrastructure where a score of 3.7 actually means something consistent, reproducible, and interpretable.

Key Takeaways and What Comes Next

You started this lesson with a problem: LLM evaluation is fragile, inconsistent, and hard to automate reliably. You're finishing it with two concrete architectural tools that, used together, solve that problem in a principled way. Before diving into the child lessons where each topic gets its full treatment, let's lock in what you now understand — and make sure it sticks.

The Two Central Ideas, Stated Clearly

Every concept in this lesson flows from two independent but complementary advances. It's worth stating them plainly before layering on nuance.

Advance 1: G-Eval is a reasoning architecture, not a better prompt.

The temptation when you first encounter G-Eval is to summarize it as "just ask the model to think step by step before scoring." That undersells it significantly. The actual insight from the G-Eval paper is a two-stage pipeline: first, use the model to expand your abstract evaluation criteria into concrete, step-by-step sub-questions tailored to the specific task — criteria expansion. Second, rather than taking the model's verbally stated score at face value, sample the token probability distribution over the score tokens and compute a probability-weighted score. The result is a judge that's more calibrated, less sensitive to superficial phrasing variation, and more aligned with human judgments than a naive "rate this 1-5" prompt.

Advance 2: Structured output is an engineering contract, not a formatting trick.

The second temptation is to treat structured output as a convenience — asking the model to "respond in JSON" so you can parse it more easily. That also undersells it. Schema-enforced structured output is a contract between the judge and everything downstream. When you define a Pydantic model or JSON Schema and enforce it at the API level, you're guaranteeing that every consumer of your judge's output — whether that's a database write, a dashboard, an alert trigger, or another model — can rely on a stable, validated interface. It decouples judge logic from downstream consumers and makes your evaluation pipeline composable and robust to model behavior drift.

🎯 Key Principle: G-Eval improves the quality of what the judge thinks. Structured output improves the reliability of what the judge returns. One operates on cognition; the other operates on interface design.

Independence and Complementarity

One of the most practically important things to understand is that these two advances are orthogonal. You can adopt either one without the other, and both deliver real value independently.

                        Structured Output
                   ────────────────────────────
                   │  No Schema  │  Schema     │
      ─────────────┼─────────────┼─────────────┤
      G-Eval       │  Better     │  Best:      │
      Reasoning    │  quality,   │  quality +  │
      ─────────────┤  hard to    │  reliability│
      Naive        │  automate   ├─────────────┤
      Prompt       │  Fragile    │  Parseable, │
                   │  baseline   │  but noisy  │
                   └─────────────┴─────────────┘

The quadrant view above tells the story clearly:

🔧 Naive prompt, no schema: The starting point most teams are at. Inconsistent scoring, manual parsing, brittle pipelines.
🔧 G-Eval reasoning, no schema: Better scores, but your pipeline is still fragile. You're extracting scores with regex and hoping the model follows your format instructions.
🔧 Naive prompt, with schema: Parseable output, but the scores themselves are noisy. You've solved the engineering problem without solving the evaluation quality problem.
🎯 G-Eval reasoning, with schema: The target state. High-quality, calibrated scores delivered through a reliable, machine-readable interface.

💡 Real-World Example: A team building a RAG evaluation harness might adopt structured output first — it's a lower-risk engineering change that immediately improves pipeline reliability. Once that's stable, they layer in G-Eval-style criteria expansion to improve score quality. The two changes can be shipped independently and their effects measured separately.

The Minimum Viable Judge Checklist

Before introducing the full reference card, here's the mental model that ties the checklist together: a production-ready LLM judge has four layers, each addressing a distinct failure mode.

 Layer 4: Score Extraction Strategy
         ↑  How do you get a number out reliably?
 Layer 3: Schema-Enforced Response
         ↑  How do you guarantee the format?
 Layer 2: Chain-of-Thought Expansion
         ↑  How do you make reasoning explicit?
 Layer 1: Rubric
         ↑  What are you actually measuring?

Skip layer 1 and the judge has no grounding — it scores based on vibes. Skip layer 2 and the reasoning is opaque and hard to audit. Skip layer 3 and the pipeline breaks when the model deviates from your format instructions. Skip layer 4 and you end up with a string where you expected a float.

📋 Quick Reference Card: Minimum Viable Judge

Layer	🎯 What It Addresses	✅ Done Right	❌ Common Failure
🔒 Rubric	Grounds the evaluation in explicit criteria	Named dimensions, concrete descriptions, defined score anchors	"Rate the quality of this response" with no further guidance
🧠 CoT Expansion	Makes reasoning auditable and consistent	Model generates step-by-step sub-criteria before scoring	Asking for a score with no reasoning step
📋 Schema Enforcement	Guarantees machine-readable output	Pydantic model or JSON Schema enforced at API level	Asking for JSON in the prompt and parsing with regex
🔧 Score Extraction	Converts judge output to pipeline-usable values	Probability-weighted scoring or validated field access	`int(response.split("Score:")[1])`

Consolidating the Lesson in Code

The best way to cement these principles is to see them expressed together in a minimal but complete implementation. The snippet below is a distillation — not the full working example from Section 4, but a compact reference version that shows all four layers of the checklist in under 50 lines.

from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

## ── Layer 1: Rubric ──────────────────────────────────────────────
RUBRIC = """
Dimension: Factual Accuracy
Description: Does the response contain only claims that are verifiably true
based on the provided context? Penalize hallucinated facts, misattributions,
and unsupported assertions.
Score anchors:
  1 — Multiple factual errors that significantly mislead the reader.
  3 — Mostly accurate with minor unsupported claims.
  5 — Fully accurate; every claim is grounded in the provided context.
"""

## ── Layer 2: CoT Expansion prompt ────────────────────────────────
EXPANSION_PROMPT = """
You are an evaluation assistant. Given the rubric below, generate 3-5
specific yes/no questions that a scorer should answer before assigning a score.
Rubric: {rubric}
Output only the questions, one per line.
"""

## ── Layer 3: Schema-enforced response ────────────────────────────
class JudgeOutput(BaseModel):
    reasoning_steps: list[str] = Field(
        description="Answers to each sub-question from the expansion step"
    )
    score: float = Field(
        ge=1.0, le=5.0,
        description="Final score on the rubric's 1-5 scale"
    )
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="Judge's confidence in this score"
    )

## ── Layer 4: Score extraction via validated field access ──────────
def run_judge(context: str, response: str) -> JudgeOutput:
    # Step A: expand the rubric into concrete sub-questions
    expansion = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user",
                   "content": EXPANSION_PROMPT.format(rubric=RUBRIC)}]
    )
    sub_questions = expansion.choices[0].message.content

    # Step B: score using the expanded criteria + schema enforcement
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Context: {context}\nResponse: {response}\n
            Sub-questions to answer before scoring:\n{sub_questions}\n
            Now score the response on the Factual Accuracy rubric."""
        }],
        response_format=JudgeOutput  # schema enforced at API level
    )
    # Score is accessed as a typed field — no parsing, no regex
    return result.choices[0].message.parsed

Notice what each layer is doing in this compact version:

Layer 1 (Rubric): RUBRIC gives the judge named dimensions and score anchors. The model knows what "5" means before it starts reasoning.
Layer 2 (CoT Expansion): The first API call generates sub-questions from the rubric. These become the scaffolding for the scoring step.
Layer 3 (Schema): response_format=JudgeOutput enforces the Pydantic model at the API level. The model cannot return malformed output.
Layer 4 (Extraction): .parsed gives you a fully typed JudgeOutput object. result.score is a float, not a string to be parsed.

💡 Pro Tip: Even if you're not yet implementing probability-weighted scoring (the deeper G-Eval mechanism covered in the child lesson), the confidence field in the schema approximates that signal. A judge that assigns a score of 4.2 with confidence 0.4 is telling you something different from one that assigns 4.2 with confidence 0.9.

What You Now Understand That You Didn't Before

It's worth being explicit about the conceptual shift this lesson was designed to produce. If you came in thinking about LLM evaluation as "write a prompt that asks the model to score something," you're leaving with a fundamentally different mental model.

❌ Wrong thinking: "I'll ask the model to rate the response 1-5 and parse the number out of its reply."

✅ Correct thinking: "I'll define a rubric with explicit anchors, use criteria expansion to generate concrete sub-questions, enforce a schema so the output is typed and validated, and extract the score from a structured field — not from free text."

The shift isn't just about sophistication for its own sake. It's about reproducibility. An evaluation system built on the wrong thinking will give you different results on Tuesday than it gave on Monday, will break when the model is updated, and will require manual inspection to debug. An evaluation system built on the right thinking behaves like a software component: testable, versioned, and predictable.

🤔 Did you know? The G-Eval paper found that probability-weighted scoring correlated more strongly with human judgments than using the model's verbally stated score — even when the verbal score and the probability-weighted score disagreed. This is because the probability distribution captures the model's uncertainty in a way that a single token cannot.

🧠 Mnemonic: Think R-E-S-E: Rubric → Expand → Schema → Extract. The four layers of the minimum viable judge, in order.

Critical Points to Carry Forward

⚠️ G-Eval is not a model — it's a method. You can implement G-Eval-style evaluation with any sufficiently capable LLM. The paper describes an architecture, not a specific system. Don't confuse "using G-Eval" with "using a particular model or API."

⚠️ Schema enforcement is not the same as asking for JSON in the prompt. Prompt-level format instructions are suggestions. API-level schema enforcement (via response_format, tool calling, or equivalent mechanisms) is a guarantee. These are categorically different in production reliability.

⚠️ Criteria expansion is task-specific. The sub-questions generated from your rubric should be different for a summarization task than for a code generation task. If you're reusing the same expanded criteria across different task types, you're not getting the benefit of the expansion step — you're just adding latency.

Practical Next Steps

Here are three concrete things you can do immediately with what you've learned:

1. Audit your existing evaluation prompts against the MVJ checklist. Take any judge prompt you're currently using and run it through the four-layer checklist. Does it have a rubric with score anchors? Does it request explicit reasoning before scoring? Is the output schema-enforced? Is score extraction robust? Most existing judge prompts fail at least two of these four checks.

## A quick audit helper — check which layers your judge implements
def audit_judge(prompt: str, uses_schema_enforcement: bool,
               uses_structured_extraction: bool) -> dict:
    has_rubric = any(keyword in prompt.lower()
                     for keyword in ["rubric", "criteria", "score anchor",
                                     "dimension", "1 —", "5 —"])
    has_cot = any(keyword in prompt.lower()
                  for keyword in ["step by step", "sub-question",
                                  "before scoring", "think through"])
    return {
        "rubric": has_rubric,
        "chain_of_thought_expansion": has_cot,
        "schema_enforcement": uses_schema_enforcement,
        "structured_extraction": uses_structured_extraction,
        "score": sum([has_rubric, has_cot,
                      uses_schema_enforcement, uses_structured_extraction]),
        "ready_for_production": all([has_rubric, has_cot,
                                     uses_schema_enforcement,
                                     uses_structured_extraction])
    }

## Example usage
print(audit_judge(
    prompt="Rate the quality of this response from 1-5.",
    uses_schema_enforcement=False,
    uses_structured_extraction=False
))  # → score: 0, ready_for_production: False

This audit helper is intentionally simple — it's a starting point for reflection, not a production tool. The value is in the conversation it starts about which layers your current judges are missing.

2. Define one schema for your most important judge. Pick the single evaluation dimension that matters most to your project — relevance, faithfulness, coherence, whatever it is — and define a Pydantic model for its output. Include reasoning_steps as a list of strings and score as a validated float. Ship that schema before adding criteria expansion. Measure the improvement in pipeline reliability.

3. Run an A/B comparison between naive and G-Eval-style scoring. Take a sample of 50 inputs and run them through both a naive prompt judge and a G-Eval-style judge with criteria expansion. Have a human rate a subset. Compare correlation coefficients. This is the fastest way to internalize why the architecture matters — and to make the case to your team.

Where to Go Next: The Child Lessons

This lesson introduced both advances at a conceptual and practical level. The child lessons go deeper on each.

📚 G-Eval Architecture and Variants — This lesson dissects the original G-Eval paper in detail: exactly how criteria expansion is prompted, how probability-weighted scoring is computed across token logprobs, what variants have emerged since the paper, and how to calibrate a G-Eval judge against human annotations. If you want to understand why the architecture works at a mechanistic level and how to tune it, this is where to go.

📚 Structured Output for Judges — This lesson covers schema design as a discipline: how to model evaluation rubrics as typed schemas, how to handle provider-specific differences in schema enforcement (OpenAI's response_format, Anthropic's tool calling, open-source alternatives), how to version schemas as your evaluation criteria evolve, and how to compose multiple judge outputs into aggregate evaluation reports. If you want to build evaluation pipelines that hold up in production, this is where to go.

Neither child lesson assumes you've read the other. But if you've worked through this lesson, you have the context to get full value from both — and to understand how the concepts they teach fit together into a coherent evaluation architecture.

Final Summary Table

📋 Quick Reference Card: G-Eval vs. Structured Output

Dimension	🧠 G-Eval	📋 Structured Output
🎯 Core contribution	Reasoning architecture	Engineering interface pattern
🔧 What it improves	Score quality and calibration	Pipeline reliability and composability
📚 Origin	Academic paper (NLP evaluation research)	Software engineering discipline
🔒 Key mechanism	Criteria expansion + probability weighting	Schema enforcement at API level
🧠 Can be used independently	Yes	Yes
🎯 Value when combined	Calibrated scores delivered through reliable interface	Same — they reinforce each other
⚠️ Main failure mode	Generic criteria, wrong model tier	Prompt-only format instructions, brittle regex extraction
📚 Goes deeper in	G-Eval Architecture and Variants	Structured Output for Judges

You now have the conceptual foundation, the implementation pattern, the pitfall awareness, and the architectural vocabulary to build LLM judges that behave like real software components. The rest is practice — and the child lessons are where that practice deepens.

📝

Ready to practice?

This lesson has 15 questions to help you learn