You are viewing a preview of this lesson. Sign in to start learning
Back to LLM as Judge: Reproducible Evaluation for LLM Systems

G-Eval (2026): Architecture and Variants

The core idea of G-Eval remains: treat LLM outputs as distributions over structured judgments, not single sampled scores. Early implementations relied on token-level log probabilities over discrete score tokens to reduce variance. In modern systems, direct access to token probabilities is often limited or abstracted away. As a result, G-Eval-style methods have evolved toward: Structured prompting + deterministic decoding to elicit stable, rubric-aligned scores Self-consistency and multi-pass aggregation to approximate underlying uncertainty Pairwise and ranking-based evaluation instead of absolute scalar scoring Calibration layers using smaller or open-weight models where logprobs remain accessible Related variants: GPTScore: Uses conditional likelihood as a proxy for quality, still relevant in open-weight or research settings FActScore / decomposition methods: Break evaluation into atomic claims or units, improving interpretability and robustness Benchmark design (e.g., SummEval → modern eval suites): Shifts toward task-specific, distribution-aware, and human-aligned evaluation protocols Key shift (2026): The field is moving from probability-as-score toward consensus-as-signal and structure-as-control. The original insight still holds conceptually, but practical implementations now rely less on raw token probabilities and more on controlled generation and aggregation strategies.

Why G-Eval? The Case for Structured LLM Judgment

Imagine you've just built a summarization pipeline and you want to know: is it any good? You write a quick prompt — "Rate this summary from 1 to 10" — and fire it at GPT-4. You get back a 7. You run it again. You get a 6. Once more: an 8. Three calls, three answers, no consensus. You're not measuring quality anymore; you're sampling noise. This is the problem that G-Eval was designed to solve, and if you've ever tried to use an LLM as an automated judge, you've almost certainly felt this frustration firsthand. Grab the free flashcards embedded throughout this lesson to lock in the key ideas as you go.

G-Eval reframes the entire act of LLM-based evaluation. Instead of treating a single model response as the score, it treats the model's output as a probability distribution over judgments — a fundamentally different and far more principled approach. To understand why that matters, we need to look carefully at what goes wrong when we ignore it.


The Fundamental Problem: Single-Sample Scoring Is Statistical Noise

When you ask an LLM to score a piece of text, you are — whether you realize it or not — drawing a single sample from a complex conditional distribution. The model doesn't have an opinion; it has a probability landscape shaped by its training, its context window, and the exact wording of your prompt. A single decode operation gives you one point on that landscape. It tells you very little about the shape of the distribution underneath.

High variance is the first consequence. The same text, evaluated with the same prompt, can receive meaningfully different scores across runs simply due to temperature-driven stochasticity. Even at low temperatures, subtle prompt formatting differences — a trailing space, a newline, a slightly different rubric phrasing — can shift scores by one or two points on a 1–10 scale.

Non-reproducibility is the second and more damaging consequence. In any serious evaluation context — comparing model versions, tracking regression, auditing output quality — you need scores that are stable enough to mean something. A metric with high variance is not just imprecise; it is unreliable as a signal. You can't tell whether a score of 7 reflects genuine quality or a lucky sample.

import openai
import random

client = openai.OpenAI()

## Naive LLM-as-judge: the variance problem illustrated
def naive_score(summary: str, temperature: float = 0.7) -> int:
    """
    Single-sample scoring — the approach G-Eval was designed to replace.
    Run this multiple times and observe the variance.
    """
    prompt = f"""Rate the following summary on a scale of 1 to 10 for coherence.
    
 Summary: {summary}
 
 Score (just the number):"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=5
    )
    return int(response.choices[0].message.content.strip())

## Demonstrate variance: same input, multiple runs
example_summary = "The economy grew last quarter. Inflation remained stable."
scores = [naive_score(example_summary) for _ in range(5)]
print(f"Scores across 5 runs: {scores}")
print(f"Range: {max(scores) - min(scores)} points — this is your noise floor")
## Typical output: Scores across 5 runs: [7, 6, 8, 7, 6]
## Range: 2 points — meaningful variance for a 10-point scale

This code block isn't just illustrative — it's a diagnostic tool. If you run it on your own evaluation setup and observe a range of 2 or more points across five calls, you're working with a noisy judge. That noise will corrupt any downstream comparison you try to make.

⚠️ Common Mistake — Mistake 1: Treating temperature=0 as a solution. Setting temperature to zero reduces but does not eliminate variance. Deterministic decoding still produces a single point estimate from the model's distribution. You've reduced sampling noise, but you haven't addressed the deeper issue: the model's most likely response to a poorly structured prompt may still be systematically biased or miscalibrated.


The G-Eval Insight: Scores Are Distributions, Not Point Estimates

🎯 Key Principle: The quality score for any piece of text is not a single number that exists in the world waiting to be discovered. It is a distribution of plausible judgments, shaped by evaluation criteria, scorer perspective, and inherent ambiguity in the text itself.

This is the conceptual breakthrough at the heart of G-Eval. The researchers behind the original paper — Liu et al., working with the NLG evaluation benchmark SummEval — observed that human annotators themselves don't perfectly agree on quality scores. Inter-annotator agreement on dimensions like coherence, consistency, fluency, and relevance is measurable but imperfect. A "true" quality score, in this framing, is the central tendency of a distribution of informed human judgments.

If human scores are distributed, then a good automated evaluator should approximate that distribution — not collapse it to a single number. G-Eval achieves this by:

  1. 📋 Using structured rubrics to constrain what the model is evaluating (reducing prompt ambiguity)
  2. 🔧 Accessing token-level log probabilities over discrete score tokens (in original implementations)
  3. 🧠 Computing a weighted expectation across possible score values rather than taking the argmax

The result is a continuous score with lower variance and better alignment with human judgments. In the original SummEval experiments, G-Eval substantially outperformed both traditional reference-based metrics (ROUGE, BERTScore) and naive LLM prompting in correlation with human ratings.

## Conceptual illustration of the G-Eval scoring mechanism
## (original implementation using log probabilities)

import numpy as np

def geval_expected_score(
    logprobs_by_token: dict[str, float],
    score_range: range = range(1, 6)
) -> float:
    """
    Compute the expected score as a weighted average over score token probabilities.
    
    This is the core G-Eval insight: instead of taking the highest-probability score,
    compute the expectation over the full distribution of score tokens.
    
    Args:
        logprobs_by_token: dict mapping score tokens ('1','2',...) to log probabilities
        score_range: the valid score values
    Returns:
        float: expected score, a continuous value capturing distributional uncertainty
    """
    # Convert log probabilities to probabilities
    probs = {}
    for score in score_range:
        token = str(score)
        if token in logprobs_by_token:
            probs[score] = np.exp(logprobs_by_token[token])
        else:
            probs[score] = 0.0
    
    # Normalize (in case not all score tokens are present)
    total = sum(probs.values())
    if total == 0:
        return float(np.mean(list(score_range)))
    
    normalized = {k: v / total for k, v in probs.items()}
    
    # Compute weighted expectation — this is the G-Eval score
    expected = sum(score * prob for score, prob in normalized.items())
    return expected

## Example: model assigns probability mass across score tokens
example_logprobs = {
    "1": -3.5,   # low probability
    "2": -2.1,   # moderate probability  
    "3": -0.8,   # high probability
    "4": -1.4,   # moderate probability
    "5": -3.2    # low probability
}

score = geval_expected_score(example_logprobs)
print(f"G-Eval expected score: {score:.3f}")
## Output: G-Eval expected score: 3.127
## Compare to argmax (mode): 3 — similar here, but diverges when distribution is bimodal

The critical line is the expectation computation. Rather than asking "what is the model's best guess?", G-Eval asks "what is the model's center of gravity across all plausible scores?" This is a meaningful difference when the model is genuinely uncertain — for example, when a summary is coherent but contains a factual inconsistency. A naive decode might round to 3 or 4; the weighted expectation might settle at 3.4, preserving the signal that quality is somewhere in between.

💡 Mental Model: Think of G-Eval's scoring like a poll, not a single vote. Asking one person how they feel gives you noise. Asking many people and averaging gives you signal. G-Eval uses the model's internal probability distribution as a stand-in for that population of voters.


From SummEval to a General Framework: The Human-Alignment Objective

SummEval was a pivotal benchmark in NLG evaluation, providing a large-scale dataset of news summaries annotated by both humans and automated metrics across four dimensions: coherence, consistency, fluency, and relevance. Its uncomfortable finding was stark: most automated metrics correlated poorly with human judgment. ROUGE, despite its ubiquity, essentially measures surface-level n-gram overlap — a proxy that breaks down completely for abstractive summarization, where a fluent and accurate summary might share few words with the reference.

The researchers behind G-Eval used SummEval as their proving ground precisely because it offered ground truth in the form of expert human ratings. The design question became: can we construct an LLM-based evaluator that matches what informed humans actually care about?

The answer required three things:

  • 🎯 Explicit evaluation criteria: Human judges don't score vaguely — they apply specific, articulable standards. A rubric for coherence might specify: "Does the text flow logically from one sentence to the next? Are there contradictions? Is the structure clear?" Encoding these criteria in a prompt forces the model to reason about the same dimensions a human would.

  • 🔧 Chain-of-thought reasoning: Before assigning a score, G-Eval prompts the model to explain its reasoning. This serves two purposes: it improves calibration (models that articulate a reasoning chain tend to score more consistently) and it makes the evaluation interpretable.

  • 📚 Probabilistic aggregation: Rather than trusting a single sample, G-Eval extracts the distribution over score tokens to compute a stable expected value.

The combination of these three elements is what defines the G-Eval architecture. Each element addresses a specific failure mode of naive LLM judging, and together they produce a system whose outputs correlate more strongly with human ratings than any prior automated approach at the time of publication.

🤔 Did you know? In the original G-Eval paper, the method achieved Spearman correlations with human judgments exceeding 0.50 on the coherence dimension of SummEval — compared to near-zero or negative correlations for ROUGE-based metrics. The gap was not marginal; it was the difference between a useful signal and a misleading one.


The 2026 Evolution: From Probability-as-Score to Consensus-as-Signal

The original G-Eval architecture had a practical dependency: access to token-level log probabilities from the model's output layer. This was straightforward when evaluating with open-weight models or early API versions that exposed logprobs. By 2025–2026, the landscape had shifted considerably. Many production LLM APIs either don't expose log probabilities at all, limit access to top-k tokens, or return probabilities that are post-processed in ways that make them unreliable as quality proxies.

This constraint forced a pragmatic evolution in how G-Eval-style evaluation is actually implemented:

Original G-Eval (2023)          Modern G-Eval-style (2026)
─────────────────────           ───────────────────────────
Single prompt                   Structured multi-step prompt
     │                                    │
     ▼                                    ▼
Extract logprobs               Elicit score + reasoning
over score tokens                         │
     │                                    ▼
     ▼                          Multiple independent runs
Weighted expectation            (self-consistency passes)
= final score                             │
                                          ▼
                                 Aggregate by mean/median
                                 = final score

The key shift is from probability-as-score to consensus-as-signal. When you can't directly access the distribution over score tokens, you approximate it by sampling multiple independent evaluations and aggregating them. This is the same statistical insight — scores are distributions, not points — implemented through repeated sampling rather than direct probability extraction.

Self-consistency aggregation has emerged as the dominant practical pattern. Rather than one call to the judge, you make N calls (typically 5–10), collect the scores, and report the mean or median. The variance across runs becomes an explicit quality signal: high variance indicates that the text being evaluated sits in a genuinely ambiguous region of the quality space.

Pairwise and ranking-based evaluation has also gained traction as a complementary approach. Instead of asking "how good is this text on a 1–5 scale?", you ask "which of these two texts is better, and why?" Pairwise judgments tend to be more stable than absolute scalar scores because the comparison provides implicit context that anchors the model's evaluation. A text that might score anywhere from 3 to 5 in isolation often has a clear winner when compared directly to a specific alternative.

💡 Real-World Example: An engineering team evaluating a customer service response generator might run each candidate response through a G-Eval-style rubric five times, average the scores, and flag any response with a standard deviation above 1.0 for human review. The SD threshold acts as an automatic uncertainty detector — it catches the cases where the automated judge itself is unsure, which are exactly the cases where human oversight adds the most value.


Where G-Eval Fits: The Broader Evaluation Ecosystem

G-Eval does not exist in isolation. Understanding where it sits relative to sibling methods sharpens your intuition for when to use it and when to reach for something else.

📋 Quick Reference Card: LLM Evaluation Methods Compared

Method 🎯 Core Mechanism 📚 Best For ⚠️ Limitation
🔧 G-Eval Rubric-guided scoring with distributional aggregation Holistic quality dimensions (coherence, relevance) Needs careful rubric design; can still be prompt-sensitive
📊 GPTScore Conditional log-likelihood of reference given input Reference-based quality; works well with open-weight models Requires logprob access; sensitive to reference quality
🔬 FActScore Atomic claim decomposition + verification Factual accuracy in long-form generation Computationally expensive; needs a retrieval component
🧮 ROUGE/BERTScore N-gram overlap / semantic similarity to reference Quick sanity checks on extractive tasks Poor correlation with human judgment on abstractive text

The relationship between these methods reflects a broader principle: no single evaluation approach is universally best. G-Eval excels at capturing holistic quality judgments that align with what humans care about in free-form text — but it can miss factual errors that a decomposition-based method like FActScore would catch. GPTScore remains valuable in research settings where open-weight models give you full probability access, enabling more rigorous statistical treatment. The methods are complementary, and production evaluation pipelines increasingly combine them.

🎯 Key Principle: Use G-Eval when you need a stable, human-aligned quality score for a holistic dimension. Use FActScore when factual accuracy is the primary concern. Use GPTScore when you have logprob access and want a reference-grounded signal. Use all three when the stakes are high enough to warrant the compute.

Wrong thinking: "I'll just use G-Eval for everything — it's the most sophisticated method."

Correct thinking: "I'll match the evaluation method to the specific failure mode I'm trying to detect. G-Eval for quality and coherence, FActScore for factual grounding, pairwise ranking when absolute scores feel unstable."


Why This Section Matters for the Rest of the Lesson

Everything that follows in this lesson builds on the conceptual foundation laid here. The G-Eval architecture (Section 2) is easier to understand when you know why each component exists — the rubric exists because vague prompts produce vague distributions; the chain-of-thought exists because reasoning before scoring improves calibration; the aggregation exists because distributions are more trustworthy than samples.

The modern variants (Section 3) make sense as pragmatic adaptations to a world where direct logprob access isn't always available — the core insight is preserved, but the mechanism changes. The implementation patterns (Section 4) will be immediately applicable if you keep the variance problem front of mind: every design decision in a G-Eval pipeline is ultimately an answer to the question how do we reduce noise and increase alignment with what humans actually care about?

🧠 Mnemonic: DRACDistribution over scores, Rubric to constrain the prompt, Aggregation to reduce variance, Correlation with human judgment as the north star. If you remember DRAC, you remember the four pillars of G-Eval's design philosophy.

The shift from probability-as-score to consensus-as-signal isn't just a technical workaround for limited API access. It reflects a deeper maturation in how the field thinks about automated evaluation: not as a lookup of some ground-truth quality value, but as a process of constructing a reliable estimate from inherently noisy signals. That framing — evaluation as estimation under uncertainty — is what makes G-Eval more than a clever prompt. It's a principled methodology.

G-Eval Architecture: Chain-of-Thought Rubrics and Scoring Mechanics

Understanding why G-Eval was designed the way it was is the first step. Now we need to understand how it actually works — the mechanical details that transform a vague instruction like "rate this summary" into a principled, reproducible judgment. G-Eval's architecture is elegant in its simplicity: it is fundamentally a structured prompting strategy, combined with controlled decoding, that forces the model into the role of a consistent, rubric-following evaluator rather than a free-form responder.

This section dissects that architecture piece by piece.

The Three-Part Prompt Structure

The heart of G-Eval is its three-part prompt structure. Rather than asking a model to score an output in a single open-ended question, G-Eval decomposes the evaluation task into three distinct instructional layers that are assembled into a single prompt at inference time. Each layer serves a specific cognitive function.

┌─────────────────────────────────────────────────────────────┐
│                   G-EVAL PROMPT STRUCTURE                   │
├─────────────────────────────────────────────────────────────┤
│  LAYER 1: Task Description                                  │
│  ─────────────────────────────────────────────────────────  │
│  "You are evaluating a machine-generated summary of a       │
│   news article. Your task is to assess the COHERENCE of    │
│   the summary..."                                           │
│                                                             │
│  LAYER 2: Evaluation Criteria Definition                    │
│  ─────────────────────────────────────────────────────────  │
│  "Coherence (1–5): Does the summary form a unified,         │
│   logical whole? Consider sentence ordering, topic          │
│   consistency, and narrative flow..."                       │
│                                                             │
│  LAYER 3: Chain-of-Thought Scoring Instructions             │
│  ─────────────────────────────────────────────────────────  │
│  "Step 1: Read the source document carefully.               │
│   Step 2: Identify any coherence issues in the summary.     │
│   Step 3: Based on your analysis, assign a score..."        │
└─────────────────────────────────────────────────────────────┘

Layer 1: Task Description establishes the evaluative context. It tells the model what kind of output is being evaluated (a summary? a dialogue response? a code snippet?), what the source material is, and what role the model should adopt. This matters more than it might seem: without explicit role framing, large language models tend to drift toward generic helpfulness behaviors rather than strict evaluative judgment. By anchoring the model as an assessor with a specific task, you reduce the probability of it slipping into explanation mode or hedging excessively.

Layer 2: Evaluation Criteria Definition is where the rubric lives. This is the most important layer for reproducibility. Rather than leaving "coherence" or "fluency" as intuitive concepts the model must interpret on its own, you provide an operational definition: what the criterion means, what counts as evidence for high vs. low scores, and what the scoring scale represents at each level. This is sometimes called rubric decomposition — the act of breaking a holistic quality dimension into observable, checkable sub-components.

Layer 3: Chain-of-Thought Scoring Instructions is where G-Eval borrows heavily from the chain-of-thought prompting literature. Instead of asking the model to jump directly to a score, you instruct it to reason through the evaluation step by step before committing to a number. This serves two purposes: it forces deliberate analysis rather than reflex scoring, and it creates an intermediate reasoning trace that you can inspect for debugging or calibration purposes.

🎯 Key Principle: The three-part structure is not just organizational tidiness — each layer reduces a different source of variance. Layer 1 reduces role ambiguity. Layer 2 reduces criterion ambiguity. Layer 3 reduces scoring-without-reasoning shortcuts.

Why Rubric Decomposition Reduces Ambiguity

One of the most common failure modes in naive LLM evaluation is criterion conflation — the model collapses multiple quality dimensions into a single gestalt impression and scores based on that impression. A summary might be factually accurate but poorly organized; a response might be coherent but subtly wrong. Without explicit rubric decomposition, the evaluator model tends to average these signals in unpredictable ways.

Consider the difference between these two prompts for evaluating coherence:

Wrong thinking: "Rate the coherence of this summary from 1 to 5."

Correct thinking: "Coherence measures whether the summary reads as a unified whole. Assess: (a) whether sentences follow a logical order, (b) whether the topic remains consistent throughout, (c) whether transitions between ideas are smooth, and (d) whether the summary avoids contradicting itself. A score of 5 means all four properties hold strongly; a score of 1 means the summary is disjointed and hard to follow."

The second version gives the model a decomposed checklist of what coherence actually means operationally. Research on human raters consistently shows that rubric specificity is the strongest predictor of inter-rater agreement — and the same principle applies to LLM raters. When the model knows exactly what to look for, its judgments become more stable across runs, more aligned with human intuitions, and less sensitive to superficial features of the text being evaluated.

💡 Real-World Example: In the original G-Eval paper (Liu et al., 2023), the authors found that adding detailed criteria definitions improved correlation with human judgments by a statistically significant margin over prompts that simply named the dimension. The rubric wasn't decorative — it was load-bearing.

Deterministic and Near-Deterministic Decoding

Even with a perfectly constructed prompt, LLM inference has an inherent stochastic element: temperature sampling introduces randomness at each token generation step. For evaluation tasks, this is problematic. If you run the same prompt twice and get different scores, your evaluation pipeline is unreliable — you're sampling from a distribution, not measuring a stable signal.

G-Eval addresses this with deterministic or near-deterministic decoding settings. In practice, this means:

  • 🔧 Temperature = 0 (or close to 0): Setting temperature to zero makes the model select the highest-probability token at each step, effectively making generation deterministic. Most API providers support this.
  • 🔧 Fixed random seed: Where APIs expose seed parameters, fixing the seed ensures reproducibility across identical inputs even when temperature is slightly above zero.
  • 🔧 Top-p / top-k restrictions: Constraining the sampling pool further reduces variance for borderline tokens.

⚠️ Common Mistake — Mistake 1: Using default API settings (often temperature=0.7 or higher) for evaluation runs. This introduces scoring variance that looks like meaningful signal but is actually noise. Always explicitly set temperature for evaluation workloads.

The original G-Eval paper went one step further: rather than sampling a single score token, it used token-level log probabilities over the discrete score tokens ("1", "2", "3", "4", "5") and computed a weighted average as the final score. This is conceptually elegant — it treats the score as an expectation over the model's probability distribution rather than a single argmax sample. In practice:

Score = Σ (score_value_i × P(score_token_i | context))

Example:
P("1") = 0.02
P("2") = 0.08  
P("3") = 0.25
P("4") = 0.48
P("5") = 0.17

Weighted score = (1×0.02)+(2×0.08)+(3×0.25)+(4×0.48)+(5×0.17) = 3.70

This weighted expectation approach reduces variance significantly compared to sampling a single score. However, as the Author Guidance notes, direct access to token log probabilities is increasingly restricted in modern API-based deployments. The practical implication is that the field has shifted toward alternative variance-reduction strategies — multi-pass aggregation, self-consistency, pairwise ranking — which we cover in Section 3.

🤔 Did you know? The difference between using argmax scoring (pick the single most likely score) and probability-weighted expectation can be surprisingly large for borderline cases. A response that scores 3 under argmax might score 3.8 under weighted expectation — a meaningful difference when you're comparing systems.

Structured Output Schemas: Constraining the Score Space

Beyond decoding settings, G-Eval relies on structured output schemas to ensure the model produces a parseable, well-formed score rather than a discursive response. Free-form responses like "I would give this a 3.5 because the summary is mostly coherent but has a few awkward transitions" are hard to parse reliably at scale and introduce inconsistency in the evaluation pipeline.

Modern LLM APIs offer several mechanisms for enforcing structured outputs:

  • JSON mode / structured outputs: Force the model to return a JSON object with a predefined schema, e.g., {"score": 4, "reasoning": "..."}
  • Function calling / tool use: Define a "submit_evaluation" function that accepts only integer scores in a valid range
  • Constrained decoding: Open-weight models accessed through frameworks like vLLM or Outlines can apply grammar-based constraints that literally prevent the model from generating tokens outside the allowed set

The simplest and most portable approach for API-based systems is to explicitly instruct the model to output only the integer score on the final line, and then parse that final line programmatically. This is less robust than schema enforcement but works across virtually all providers.

Code Example: Building a G-Eval Prompt Template for Coherence

Let's put this all together with a concrete implementation. The following code constructs a complete G-Eval prompt for coherence evaluation, assembling all three layers into a single prompt string ready for API submission.

from dataclasses import dataclass
from typing import Optional

@dataclass
class GEvalPromptConfig:
    """Configuration for a G-Eval evaluation prompt."""
    task_description: str
    criterion_name: str
    criterion_definition: str
    score_range: tuple[int, int]
    scoring_steps: list[str]


## Define the coherence evaluation configuration
COHERENCE_CONFIG = GEvalPromptConfig(
    task_description=(
        "You are an expert evaluator assessing the quality of machine-generated "
        "summaries of news articles. Your task is to evaluate a single quality "
        "dimension: COHERENCE."
    ),
    criterion_name="Coherence",
    criterion_definition=(
        "Coherence measures the collective quality of all sentences in the summary. "
        "A coherent summary should (a) present sentences in a logical order, "
        "(b) maintain consistent focus on a central topic, "
        "(c) use smooth transitions between ideas, and "
        "(d) avoid contradictions or non-sequiturs. "
        "Score 1: The summary is largely incoherent or disjointed. "
        "Score 2: Some logical order exists but significant gaps or jumps occur. "
        "Score 3: Mostly coherent with occasional transitions that feel abrupt. "
        "Score 4: Well-organized with minor coherence issues. "
        "Score 5: Fully coherent, logical, and easy to follow throughout."
    ),
    score_range=(1, 5),
    scoring_steps=[
        "Read the source document carefully to understand the original content.",
        "Read the summary and identify any sentences that feel out of order or disconnected.",
        "Check whether the summary maintains a consistent topic focus throughout.",
        "Note any abrupt transitions or logical jumps between sentences.",
        "Based on your analysis, assign a coherence score from 1 to 5.",
    ]
)


def build_geval_prompt(
    config: GEvalPromptConfig,
    source_document: str,
    generated_summary: str,
) -> str:
    """Assemble a complete G-Eval prompt from config and input texts."""
    
    # Layer 1: Task description
    prompt_parts = [
        config.task_description,
        "",  # blank line for readability
    ]
    
    # Layer 2: Evaluation criteria definition
    prompt_parts += [
        f"## Evaluation Criterion: {config.criterion_name}",
        config.criterion_definition,
        "",
    ]
    
    # Input texts
    prompt_parts += [
        "## Source Document",
        source_document.strip(),
        "",
        "## Generated Summary",
        generated_summary.strip(),
        "",
    ]
    
    # Layer 3: Chain-of-thought scoring instructions
    prompt_parts.append("## Evaluation Steps")
    for i, step in enumerate(config.scoring_steps, 1):
        prompt_parts.append(f"Step {i}: {step}")
    
    prompt_parts += [
        "",
        # Force structured output: reasoning first, then score on last line
        "Provide your step-by-step reasoning, then on the FINAL LINE output "
        f"ONLY a single integer between {config.score_range[0]} and "
        f"{config.score_range[1]}. Do not include any other text on the final line.",
    ]
    
    return "\n".join(prompt_parts)


## Example usage
source = """Scientists at MIT announced a breakthrough in quantum computing 
Wednesday, demonstrating a 1000-qubit processor that maintains coherence 
for record-breaking durations at room temperature."""

summary_to_evaluate = """MIT researchers showed a new quantum chip. 
Room temperature is unusual for quantum systems. The chip has many qubits. 
Coherence time was improved significantly by the team."""

prompt = build_geval_prompt(COHERENCE_CONFIG, source, summary_to_evaluate)
print(prompt)

This code produces a fully formed G-Eval prompt ready for submission. Notice how the GEvalPromptConfig dataclass separates the evaluation design from the prompt assembly logic — this is a useful pattern because it lets you define multiple evaluation criteria (coherence, fluency, relevance, consistency) as separate configs and reuse the same assembly function for all of them.

Now let's add the API call layer with deterministic settings:

import openai
import re

client = openai.OpenAI()


def run_geval(
    prompt: str,
    model: str = "gpt-4o",
    temperature: float = 0.0,  # deterministic decoding
    seed: Optional[int] = 42,  # fixed seed for reproducibility
) -> dict:
    """
    Execute a G-Eval prompt and parse the integer score from the response.
    
    Returns a dict with 'score' (int), 'reasoning' (str), and 'raw_response' (str).
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        seed=seed,
        max_tokens=512,
    )
    
    raw_text = response.choices[0].message.content.strip()
    
    # Parse the score from the final line
    lines = [line.strip() for line in raw_text.split("\n") if line.strip()]
    final_line = lines[-1] if lines else ""
    
    # Extract the integer score; raise if malformed
    score_match = re.fullmatch(r"[1-5]", final_line)
    if not score_match:
        raise ValueError(
            f"Score parsing failed. Final line was: {repr(final_line)}\n"
            f"Full response: {raw_text}"
        )
    
    score = int(final_line)
    reasoning = "\n".join(lines[:-1])  # everything except the score line
    
    return {
        "score": score,
        "reasoning": reasoning,
        "raw_response": raw_text,
    }


## Evaluate the summary
result = run_geval(prompt)
print(f"Coherence Score: {result['score']}/5")
print(f"Reasoning:\n{result['reasoning']}")

💡 Pro Tip: The seed parameter in the OpenAI API does not guarantee perfect determinism across different server instances or model versions, but it dramatically improves reproducibility within a session. Always log both the seed value and the model version string when running evaluation experiments so results can be traced and replicated.

Finally, let's add a multi-run averaging wrapper — a simple form of variance reduction when log probabilities aren't available:

import statistics


def run_geval_with_averaging(
    prompt: str,
    n_runs: int = 3,
    model: str = "gpt-4o",
    temperature: float = 0.3,  # slight variance for diversity
) -> dict:
    """
    Run G-Eval multiple times and return the average score.
    
    Using a small non-zero temperature + averaging approximates the
    probability-weighted scoring that logprobs would provide directly.
    """
    scores = []
    reasonings = []
    
    for run_idx in range(n_runs):
        result = run_geval(
            prompt,
            model=model,
            temperature=temperature,
            seed=42 + run_idx,  # different seed per run
        )
        scores.append(result["score"])
        reasonings.append(result["reasoning"])
    
    return {
        "mean_score": statistics.mean(scores),
        "stdev": statistics.stdev(scores) if len(scores) > 1 else 0.0,
        "scores": scores,
        "reasonings": reasonings,
    }


## Example: three-run averaged evaluation
averaged_result = run_geval_with_averaging(prompt, n_runs=3)
print(f"Mean Coherence Score: {averaged_result['mean_score']:.2f}")
print(f"Score Std Dev: {averaged_result['stdev']:.2f}")
print(f"Individual Scores: {averaged_result['scores']}")

The run_geval_with_averaging function introduces a small but important design choice: using a slightly non-zero temperature (0.3) combined with different seeds per run, then averaging. This is the modern approximation of probability-weighted scoring when logprobs are unavailable — we're sampling from the model's implicit distribution and computing an empirical mean. Section 3 will examine this and related strategies in much more depth.

Putting the Architecture Together

The full G-Eval architecture, as implemented above, can be visualized as a pipeline:

  EVALUATION DESIGN PHASE
  ─────────────────────────────────────────────────────────
  Define criterion → Write rubric levels → Specify CoT steps
           │
           ▼
  PROMPT ASSEMBLY PHASE  
  ─────────────────────────────────────────────────────────
  Task description + Criteria definition + CoT instructions
  + Source text + Generated output
           │
           ▼
  INFERENCE PHASE
  ─────────────────────────────────────────────────────────
  LLM call with:
  • temperature=0 (or low + averaged)
  • fixed seed
  • max_tokens budget
           │
           ▼
  OUTPUT PARSING PHASE
  ─────────────────────────────────────────────────────────
  Extract integer score from structured response
  Validate range [1–5]
  Optionally extract reasoning trace
           │
           ▼
  AGGREGATION PHASE (optional)
  ─────────────────────────────────────────────────────────
  Average over N runs OR weight by log probabilities
  → Final score: float in [1.0, 5.0]

Each phase has a clear failure point, and understanding the architecture in these discrete stages is crucial for debugging. When scores are unexpectedly high or low, the problem usually lives in one of three places: criterion definition (ambiguous rubric), inference settings (wrong temperature), or output parsing (malformed responses silently defaulting to a fallback value).

📋 Quick Reference Card:

🔧 Component 📚 Purpose ⚠️ Common Failure
🎯 Task description Sets evaluator role and context Too generic, model drifts from judging
📋 Criterion definition Defines what to measure Under-specified, model conflates dimensions
🧠 CoT steps Forces deliberate reasoning Missing steps, model skips analysis
🔒 Decoding settings Controls output variance Default temperature introduces noise
🔧 Output schema Ensures parseable score Free-form responses break parsing
📚 Aggregation Reduces residual variance Single-sample scores are noisy

The architecture we've built here is intentionally modular. You can swap out criteria configs for different quality dimensions, adjust the number of averaging runs based on your cost budget, and layer in log-probability weighting if your infrastructure provides access to it. That modularity is, in many ways, the real engineering achievement of G-Eval — not any single clever trick, but the discipline of decomposing evaluation into stages that can each be reasoned about, tested, and improved independently.

💡 Mental Model: Think of G-Eval as a structured interview protocol, not a free-form conversation. Just as a well-designed interview rubric produces more reliable assessments than an open discussion, a well-designed G-Eval prompt produces more reliable scores than asking an LLM to "just judge" an output. The architecture is the discipline.

Modern G-Eval Variants: From Absolute Scores to Aggregation and Ranking

The original G-Eval paper demonstrated something elegant: if you could read the log-probabilities of score tokens like "1", "2", "3", "4", and "5" from a language model, you could construct a weighted expected score that was far more stable than a single sampled number. That insight remains valid today. What has changed is the plumbing. Most production-facing LLM APIs have quietly closed the door on direct token-probability access, either abstracting it away entirely or restricting it to narrow use cases. This section is about what principled practitioners have built in response — and why the resulting variants often work just as well, or better, than the original formulation.

The shift is not merely a workaround. It reflects a deeper maturation in how the field thinks about LLM-based evaluation: moving from probability-as-score (a single model's confidence over discrete tokens) toward consensus-as-signal (the agreement structure across multiple passes, models, or judgment formats). Each variant we cover here embodies that philosophy in a slightly different way.


Self-Consistency Aggregation: Approximating Distributional Uncertainty

The cleanest conceptual substitute for token-probability weighting is self-consistency aggregation — running the same evaluation prompt multiple times with non-zero temperature, then combining the resulting scores to estimate the underlying distribution.

Here is the intuition. When a model assigns token probabilities over {1, 2, 3, 4, 5}, it is telling you that the output is not a single answer but a weighted mixture. Self-consistency sampling approximates that mixture by drawing from it repeatedly. Run the scorer ten times. If eight runs return 4, one returns 3, and one returns 5, your aggregate score of 4.0 captures roughly the same signal as a probability-weighted expectation would — without requiring logprob access.

Single-pass G-Eval (classic):          Multi-pass G-Eval (modern):

Prompt ──► LLM ──► logprobs           Prompt ──► LLM ──► score_1
              {1: 0.05,                       ──► LLM ──► score_2
               2: 0.10,                       ──► LLM ──► score_3
               3: 0.20,          ≈            ──► LLM ──► score_4
               4: 0.50,                       ──► LLM ──► score_5
               5: 0.15}                              │
                  │                           Aggregate (mean/CI)
             Weighted sum
             E[score] = 3.60                  E[score] ≈ 3.60

The advantage beyond accessibility is that multi-pass aggregation also gives you a confidence interval — something the single probability-weighted score cannot provide on its own. If your ten passes return {3, 3, 3, 3, 3, 3, 3, 3, 3, 3}, you have high confidence in the score 3. If they return {1, 2, 4, 5, 3, 4, 2, 5, 3, 4}, the score is genuinely ambiguous, and a CI of [1, 5] tells your downstream system to treat this evaluation with suspicion.

💡 Mental Model: Think of each scoring pass as a human annotator completing a rubric independently. The mean gives you the consensus estimate; the standard deviation tells you how much disagreement exists. A tight distribution signals a clear-cut case; a wide one flags a genuinely hard judgment.

⚠️ Common Mistake: Running multi-pass aggregation at temperature 0 defeats the purpose entirely. At temperature 0, the model is deterministic — every pass returns the same score, and your "confidence interval" collapses to a single point. You need temperature ≥ 0.3 (often 0.5–0.7) to get meaningful variance. The tradeoff is that higher temperature introduces more noise, so you typically want 5–15 passes rather than 2–3.

Here is a practical implementation:

import openai
import numpy as np
from typing import Optional

def multi_pass_geval(
    system_prompt: str,
    user_prompt: str,
    n_passes: int = 10,
    temperature: float = 0.5,
    score_range: tuple = (1, 5),
    client: Optional[openai.OpenAI] = None,
) -> dict:
    """
    Run multi-pass G-Eval scoring and return aggregate statistics.
    
    Returns a dict with keys: mean, std, ci_low, ci_high, raw_scores, valid_passes
    """
    if client is None:
        client = openai.OpenAI()
    
    raw_scores = []
    min_score, max_score = score_range

    for _ in range(n_passes):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt},
            ],
            temperature=temperature,
            max_tokens=10,  # We only need a short numeric response
        )
        
        raw_text = response.choices[0].message.content.strip()
        
        # Parse the score — handle both "4" and "Score: 4" formats
        try:
            score = float(''.join(c for c in raw_text if c.isdigit() or c == '.'))
            if min_score <= score <= max_score:
                raw_scores.append(score)
        except ValueError:
            pass  # Skip unparseable responses

    if not raw_scores:
        raise ValueError("No valid scores returned across all passes.")

    scores_arr = np.array(raw_scores)
    mean = float(np.mean(scores_arr))
    std = float(np.std(scores_arr))
    
    # 95% confidence interval using normal approximation
    margin = 1.96 * std / np.sqrt(len(scores_arr))

    return {
        "mean": round(mean, 3),
        "std": round(std, 3),
        "ci_low": round(max(min_score, mean - margin), 3),
        "ci_high": round(min(max_score, mean + margin), 3),
        "raw_scores": raw_scores,
        "valid_passes": len(raw_scores),
    }

This function wraps the repeated API calls, handles parse failures gracefully, and returns both the point estimate and the confidence interval. In production, you would add retry logic, cost tracking, and caching for deterministic base cases.

💡 Pro Tip: If cost is a concern, use a cheaper model for the multi-pass runs (e.g., gpt-4o-mini) and reserve the stronger model for a single reference pass. The variance estimate from the cheaper model is usually sufficient to flag uncertain cases.


Pairwise and Ranking-Based G-Eval

Absolute scalar scores carry a hidden burden: they require calibration. When you ask a model to rate a summary on a 1–5 coherence scale, you are implicitly assuming the model has a stable, consistent interpretation of what "3" means across different summaries, different contexts, and different prompting sessions. That assumption is fragile. Research consistently shows that LLM judges exhibit positional bias, verbosity bias, and anchoring effects that distort absolute scores in ways that are hard to detect without ground truth.

Pairwise G-Eval sidesteps this problem by replacing the absolute score with a relative preference judgment. Instead of asking "rate this summary from 1 to 5", you ask "which of these two summaries is more coherent, A or B?". The comparative framing is cognitively simpler for the model, anchors the judgment in a concrete contrast, and eliminates the need for a shared scale.

Absolute G-Eval:                     Pairwise G-Eval:

Summary A ──► Score: 3.8             Summary A ─┐
                                                  ├──► Preferred: A
Summary B ──► Score: 3.6             Summary B ─┘

"Is 0.2 a meaningful difference?"    "A is directly better than B"
(Calibration required)               (No scale needed)

From pairwise preferences you can construct a ranking over a candidate set using algorithms like TrueSkill, Bradley-Terry, or simple win-rate aggregation. This is particularly powerful in model comparison tasks — if you are evaluating five different summarization systems, pairwise G-Eval gives you a tournament bracket that is far more interpretable than five separate score distributions.

🎯 Key Principle: Use absolute scoring when you need a threshold ("flag outputs below 3/5") and pairwise ranking when you need ordering ("which system is best?"). The two modes answer fundamentally different questions.

⚠️ Common Mistake: Pairwise evaluation scales quadratically with the number of candidates. Comparing 10 systems exhaustively requires 45 pairs; 20 systems requires 190. In practice, use Swiss-system tournament scheduling or adaptive sampling (only compare candidates that are close in current estimated rank) to keep costs manageable.

A practical pairwise scorer looks like this:

from enum import Enum

class Preference(Enum):
    A = "A"
    B = "B"
    TIE = "TIE"

def pairwise_geval(
    criterion: str,
    context: str,
    output_a: str,
    output_b: str,
    client,
    model: str = "gpt-4o",
) -> Preference:
    """
    Ask the model to choose between two outputs on a given criterion.
    Returns a Preference enum value.
    """
    system_prompt = (
        "You are an expert evaluator. You will be given a context and two candidate "
        "outputs, labeled A and B. Your task is to decide which output better satisfies "
        f"the following criterion: {criterion}\n\n"
        "Respond with ONLY one of: A, B, or TIE. No explanation."
    )
    
    user_prompt = (
        f"Context:\n{context}\n\n"
        f"Output A:\n{output_a}\n\n"
        f"Output B:\n{output_b}\n\n"
        "Which output better satisfies the criterion? Respond with A, B, or TIE."
    )
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0,  # Deterministic for preference judgments
        max_tokens=5,
    )
    
    verdict = response.choices[0].message.content.strip().upper()
    
    if "A" in verdict and "B" not in verdict:
        return Preference.A
    elif "B" in verdict and "A" not in verdict:
        return Preference.B
    else:
        return Preference.TIE


def compute_win_rates(results: list[tuple[str, str, Preference]]) -> dict[str, float]:
    """
    Given a list of (system_a_name, system_b_name, Preference) tuples,
    compute win rates for each system (ties count as 0.5 wins each).
    """
    wins = {}
    totals = {}
    
    for a_name, b_name, pref in results:
        for name in (a_name, b_name):
            wins.setdefault(name, 0.0)
            totals.setdefault(name, 0)
        
        totals[a_name] += 1
        totals[b_name] += 1
        
        if pref == Preference.A:
            wins[a_name] += 1.0
        elif pref == Preference.B:
            wins[b_name] += 1.0
        else:  # TIE
            wins[a_name] += 0.5
            wins[b_name] += 0.5
    
    return {
        name: round(wins[name] / totals[name], 3)
        for name in wins
    }

Notice that the pairwise scorer uses temperature=0. Unlike multi-pass aggregation, pairwise judgments are structurally simpler — the model is making a binary (or ternary) choice rather than placing a point on a continuous scale. Deterministic decoding is appropriate here; if you want variance estimates, run each pair multiple times and track flip rate.

🤔 Did you know? Position bias in pairwise evaluation is real and measurable. Models tend to favor whichever output appears first ("A") in a non-trivial fraction of cases. A robust pairwise pipeline always evaluates each pair in both orders (A vs B, then B vs A) and resolves contradictions as ties.


GPTScore: Conditional Likelihood as a Quality Proxy

GPTScore is a related but architecturally distinct approach that predates the G-Eval framework and remains relevant in specific deployment contexts. Rather than prompting a model to explicitly output a score, GPTScore uses the model's conditional generation likelihood as a proxy for quality. The intuition: a high-quality output should receive high probability under a well-calibrated language model conditioned on the appropriate context and instruction.

Formally, given a generation hypothesis h and a context c with a quality-indicating instruction d, GPTScore computes:

GPTScore(h | d, c) = (1/|T|) * Σ log P(h_t | h_<t, d, c)

where T is the set of tokens in h and the sum is the mean log-probability of the hypothesis tokens under the model. Higher log-probability means the model finds the output more plausible given the instruction.

The critical practical constraint is that GPTScore requires direct access to token log-probabilities — which is exactly what closed APIs increasingly restrict. This is why GPTScore is most relevant today in two specific contexts:

🔧 Open-weight models (LLaMA 3, Mistral, Qwen) where you have full logit access via HuggingFace or vLLM 🔧 Research and benchmark contexts where you control the inference stack end-to-end

When you do have logprob access, GPTScore has a meaningful advantage: it produces a continuous, unbounded score without requiring the model to follow an explicit output format. There is no parsing step, no risk of the model outputting "I think this summary is quite good" instead of a number. The score is read directly from the inference engine.

💡 Real-World Example: Suppose you are building a quality filter for a RAG pipeline using a self-hosted Mistral 7B instance. You want to flag generated answers that are unlikely given the retrieved context. GPTScore gives you a lightweight, format-free quality signal that can run on every inference request without an additional API call.


Calibration Layers: Bridging Open and Closed Models

The final variant addresses a structural asymmetry in the current LLM landscape: the most capable models (which produce the most reliable judgments) are usually the ones with restricted logprob access, while the models with open logprob access are smaller and potentially less calibrated.

A calibration layer is a post-processing step that uses a smaller, logprob-accessible model to re-scale or validate the scores produced by a larger model. The architecture looks like this:

┌─────────────────────────────────────────────────────┐
│               Calibration Layer Pipeline             │
│                                                      │
│  Input ──► Large Closed Model ──► Raw Score (0–5)    │
│               (GPT-4o, Claude)         │             │
│                                        ▼             │
│              Small Open Model ──► Logprob Signal     │
│           (Mistral-7B, Llama-3-8B)     │             │
│                                        ▼             │
│                              Calibrated Score        │
│                           (regression or percentile) │
└─────────────────────────────────────────────────────┘

In practice, calibration layers work in two main modes:

Mode 1 — Regression calibration: Collect a labeled dataset of (large-model score, human score) pairs. Fit a lightweight regression (isotonic regression works well) that maps raw model scores to human-aligned scores. Apply this mapping at inference time.

Mode 2 — Confidence gating: Use the small model's logprob to estimate whether the large model's score is plausible. If the small model assigns very low probability to the large model's verdict, flag the case for human review or a second large-model pass.

📋 Quick Reference Card: G-Eval Variants Comparison

🎯 Variant 🔧 Access Required 📊 Output Type 💰 Cost Profile
Classic G-Eval Logprobs Weighted score Low (1 pass)
Multi-Pass Aggregation Completions only Score + CI Medium (N passes)
Pairwise G-Eval Completions only Preference/rank Medium-High
GPTScore Logprobs (required) Continuous likelihood Low
Calibration Layer Completions + open logprobs Re-scaled score Low overhead

🎯 Key Principle: No single variant dominates. The right choice depends on three factors: what API access you have, whether you need absolute scores or relative rankings, and how much inference budget you can spend per evaluation.


Putting It Together: A Multi-Pass Scorer with Confidence Estimation

The code below integrates the multi-pass approach with a simple confidence-gating mechanism. If the confidence interval is too wide (indicating genuine scoring ambiguity), the function can optionally escalate to a pairwise comparison against a reference output.

import openai
import numpy as np
from dataclasses import dataclass
from typing import Optional

@dataclass
class GEvalResult:
    mean_score: float
    std: float
    ci_low: float
    ci_high: float
    ci_width: float
    raw_scores: list[float]
    valid_passes: int
    flagged_uncertain: bool


COHERENCE_SYSTEM_PROMPT = """
You are an expert evaluator assessing the coherence of a text summary.
Coherence measures whether the summary reads as a logically organized,
consistent whole — not just a collection of related facts.

Evaluation steps:
1. Read the source document and the summary.
2. Check whether the summary maintains a clear logical flow.
3. Check whether claims in the summary are internally consistent.
4. Assign a score from 1 to 5 using these anchors:
   1 = Incoherent, contradictory, or randomly ordered
   2 = Mostly disjointed with some coherent passages
   3 = Partially coherent but with notable structural issues
   4 = Mostly coherent with minor issues
   5 = Fully coherent, logically organized throughout

Respond with ONLY the integer score (1, 2, 3, 4, or 5).
"""


def evaluate_coherence(
    source_document: str,
    summary: str,
    n_passes: int = 8,
    temperature: float = 0.5,
    uncertainty_threshold: float = 1.5,  # Flag if CI width exceeds this
    client: Optional[openai.OpenAI] = None,
) -> GEvalResult:
    """
    Evaluate summary coherence using multi-pass G-Eval.
    
    Args:
        source_document: The original document being summarized.
        summary: The candidate summary to evaluate.
        n_passes: Number of scoring passes to run.
        temperature: Sampling temperature (must be > 0 for variance).
        uncertainty_threshold: CI width above which to flag as uncertain.
        client: OpenAI client instance.
    
    Returns:
        GEvalResult with score statistics and uncertainty flag.
    """
    if client is None:
        client = openai.OpenAI()
    
    user_prompt = (
        f"Source Document:\n{source_document}\n\n"
        f"Summary to Evaluate:\n{summary}\n\n"
        "Score (1-5):"
    )
    
    raw_scores = []
    
    for pass_idx in range(n_passes):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",  # Cost-efficient for repeated passes
                messages=[
                    {"role": "system", "content": COHERENCE_SYSTEM_PROMPT},
                    {"role": "user", "content": user_prompt},
                ],
                temperature=temperature,
                max_tokens=5,
            )
            
            raw_text = response.choices[0].message.content.strip()
            
            # Extract integer score from response
            digits = [c for c in raw_text if c.isdigit()]
            if digits:
                score = int(digits[0])  # Take first digit
                if 1 <= score <= 5:
                    raw_scores.append(float(score))
                    
        except Exception as e:
            # Log and continue — don't let one failed pass break the batch
            print(f"Pass {pass_idx} failed: {e}")
            continue
    
    if len(raw_scores) < 3:
        raise ValueError(
            f"Too few valid scores ({len(raw_scores)}) to compute reliable statistics."
        )
    
    arr = np.array(raw_scores)
    mean = float(np.mean(arr))
    std = float(np.std(arr))
    
    # 95% confidence interval
    margin = 1.96 * std / np.sqrt(len(arr))
    ci_low = max(1.0, mean - margin)
    ci_high = min(5.0, mean + margin)
    ci_width = ci_high - ci_low
    
    return GEvalResult(
        mean_score=round(mean, 3),
        std=round(std, 3),
        ci_low=round(ci_low, 3),
        ci_high=round(ci_high, 3),
        ci_width=round(ci_width, 3),
        raw_scores=raw_scores,
        valid_passes=len(raw_scores),
        flagged_uncertain=ci_width > uncertainty_threshold,
    )


## Example usage
if __name__ == "__main__":
    doc = "Researchers at MIT have developed a new battery chemistry..."
    summary_a = "MIT researchers created a better battery."
    
    result = evaluate_coherence(doc, summary_a, n_passes=8)
    
    print(f"Mean Score: {result.mean_score}")
    print(f"95% CI: [{result.ci_low}, {result.ci_high}] (width: {result.ci_width})")
    print(f"Uncertain: {result.flagged_uncertain}")
    print(f"Raw scores: {result.raw_scores}")

This implementation uses gpt-4o-mini for the repeated passes, keeping cost manageable. The flagged_uncertain field is the practical payoff: downstream systems can route flagged evaluations to a human reviewer or to a second-opinion pass with a stronger model, while confident evaluations flow through automatically.

💡 Pro Tip: Track your flagged_uncertain rate across a dataset. If more than 15–20% of evaluations are flagged, your rubric or prompt is likely ambiguous — not the outputs being evaluated. Revisit your evaluation criteria before investing in more compute.

🧠 Mnemonic: PARCPairwise for ranking, Aggregation for uncertainty, Regression for calibration, Conditional likelihood for open models. Four tools, four contexts.


The variants covered in this section collectively represent the field's response to a practical constraint — and in the process, they have enriched the G-Eval toolkit. Self-consistency aggregation gives us uncertainty quantification. Pairwise ranking gives us bias resistance. GPTScore gives us a format-free signal for open-weight deployments. Calibration layers give us a bridge between model capability and logprob accessibility. In the next section, we will move from these conceptual building blocks into the full implementation patterns and API integration details that turn these ideas into production-ready evaluation pipelines.

Implementing G-Eval in Practice: Patterns and Integration

Understanding G-Eval's architecture is one thing; building a system that reliably produces stable, interpretable scores in production is another. This section bridges that gap. We'll move from theory to code, covering the decisions that matter most when you sit down to implement a G-Eval pipeline: how to slice your task into well-defined evaluation dimensions, how to write prompts that consistently elicit structured scores, how to parse and validate those outputs defensively, and how to wire everything together into a reusable evaluator class that feeds into reporting infrastructure.

Designing Evaluation Dimensions

The first and most consequential decision in any G-Eval implementation is how to decompose quality into evaluation dimensions. A dimension is a single, independently assessable axis of quality — coherence, relevance, faithfulness, and fluency being the canonical four from the original SummEval benchmark work, though real tasks often require custom additions.

The key design principle is orthogonality: each dimension should capture something that the others cannot. When dimensions overlap, judges — human or LLM — conflate them, and you lose the diagnostic value of having separate scores. Coherence asks whether the text hangs together logically. Relevance asks whether it addresses what was asked. Faithfulness asks whether claims are grounded in source material. Fluency asks whether the language is grammatically natural. These four can all fail independently: a response can be fluent but unfaithful, or relevant but incoherent.

🎯 Key Principle: Each evaluation dimension should be answerable without consulting the other dimensions. If scoring dimension A requires you to think about dimension B, they are not truly independent.

For task-specific deployments, you'll often need to add or replace dimensions. A customer support evaluator might add tone appropriateness and resolution completeness. A code-generation evaluator might replace fluency with syntactic correctness and add test coverage alignment. The right set of dimensions is the one that, taken together, captures all the ways a response could fail for your users.

Score anchors — explicit definitions of what each integer score means — are equally important. Without them, the model's interpretation of "a 3 out of 5" drifts across prompts and runs. A well-anchored rubric looks like this:

1 - The response is almost entirely incoherent; ideas are disconnected and the reader cannot follow the logic.
2 - The response has significant coherence problems; some ideas connect but the overall structure is confusing.
3 - The response is mostly coherent with occasional lapses in logical flow.
4 - The response is coherent and well-structured with minor issues.
5 - The response is fully coherent; every sentence follows logically from the previous one.

This level of specificity dramatically reduces variance compared to a prompt that simply says "rate coherence from 1 to 5."

Prompt Engineering Patterns That Elicit Structured Scores

Once your dimensions and anchors are defined, the challenge is writing a prompt that reliably produces a score in the expected format. Three patterns have proven most effective in practice: step labels, explicit score anchors embedded in the prompt, and output format constraints.

Step labels instruct the model to reason before scoring, creating a chain-of-thought trace that both improves score quality and gives you interpretable evidence for the score. The pattern looks like this:

Step 1: Read the source document carefully.
Step 2: Read the generated summary.
Step 3: Identify any claims in the summary that are not supported by the source.
Step 4: Based on your analysis, assign a faithfulness score from 1 to 5 using the rubric below.
Step 5: Output your score as a JSON object with the key "faithfulness_score".

The numbered steps do two things: they force the model to do work before scoring (reducing anchoring bias toward the middle of the scale), and they create a predictable structure you can parse even when the final score format varies slightly.

Output format constraints deserve their own attention. In 2026, most production G-Eval implementations use JSON-mode or structured output APIs to enforce format at the decoding level rather than relying on the model to comply voluntarily. Where that is unavailable, a combination of a format instruction at the end of the prompt and a regex-based fallback parser is the standard defensive strategy.

💡 Pro Tip: Place the output format instruction at both the beginning and end of the prompt. Models with long context windows sometimes "forget" early instructions by the time they generate the final score. Repeating the constraint at both positions costs almost nothing and meaningfully reduces malformed outputs.

Here is a complete prompt template that combines all three patterns:

FAITHFULNESS_PROMPT = """
You are an expert evaluator assessing the faithfulness of a summary to its source document.
Your output MUST be a JSON object with exactly this structure: {{"reasoning": "...", "score": <integer 1-5>}}

RUBRIC:
1 - The summary contains multiple claims not supported by the source.
2 - The summary contains at least one clear factual hallucination.
3 - The summary is mostly faithful with minor unsupported inferences.
4 - The summary is faithful; all claims are supported, with trivial paraphrasing.
5 - The summary is perfectly faithful; every claim is directly traceable to the source.

SOURCE DOCUMENT:
{source}

SUMMARY TO EVALUATE:
{summary}

Step 1: Identify all factual claims in the summary.
Step 2: For each claim, determine whether it is supported by the source document.
Step 3: Assign a score from 1 to 5 using the rubric above.
Step 4: Output ONLY a JSON object with keys \"reasoning\" and \"score\".
"""

The double-braces around the JSON structure example are Python f-string escapes. The {source} and {summary} placeholders are filled at runtime.

Parsing and Validating Structured Outputs

Even with excellent prompt engineering and structured output APIs, malformed responses happen. A production G-Eval system must handle them gracefully rather than crashing or silently propagating garbage scores.

Malformed responses fall into several categories: the model returns prose instead of JSON, it returns JSON with the wrong keys, it returns a score outside the valid range, or it returns a non-integer where an integer is expected. Each requires a different handling strategy.

The recommended approach is a validation cascade:

Raw LLM Output
      │
      ▼
┌─────────────────────┐
│  JSON parse attempt │──── success ──▶ schema validation
└─────────────────────┘
         │ fail
         ▼
┌─────────────────────┐
│  Regex extraction   │──── success ──▶ range validation
│  (score from text)  │
└─────────────────────┘
         │ fail
         ▼
┌─────────────────────┐
│  Retry with         │──── success ──▶ normal path
│  stricter prompt    │
└─────────────────────┘
         │ fail
         ▼
  Log failure + return
  null / sentinel value

The regex fallback is important because models sometimes wrap valid JSON in markdown code fences or add a preamble sentence before the JSON object. A pattern like r'\{[^{}]*"score"\s*:\s*(\d+)[^{}]*\}' will extract the score from most near-miss outputs.

⚠️ Common Mistake: Clamping out-of-range scores silently. If a model returns a score of 7 on a 1–5 scale, that is a signal that something is wrong with your prompt or the model's instruction-following — not just a number to clip to 5. Log the anomaly, investigate the pattern, and treat it as a prompt quality signal rather than a data cleaning problem.

A Reusable G-Eval Evaluator Class

The following implementation brings together everything covered so far into a class that accepts a rubric definition and returns structured results with metadata. It is designed for real use: it handles retries, logs raw outputs for auditability, supports both JSON-mode APIs and text-mode with fallback parsing, and computes basic consistency metrics when multiple passes are requested.

import json
import re
import logging
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI

logger = logging.getLogger(__name__)


@dataclass
class RubricDimension:
    """Defines a single evaluation dimension with its prompt template and scoring range."""
    name: str
    prompt_template: str  # Must contain {source} and {target} placeholders
    min_score: int = 1
    max_score: int = 5
    weight: float = 1.0   # For weighted aggregation across dimensions


@dataclass
class EvalResult:
    """Structured result from a single dimension evaluation."""
    dimension: str
    score: Optional[float]
    reasoning: Optional[str]
    raw_output: str
    passes: list[float] = field(default_factory=list)  # Scores from multi-pass runs
    is_valid: bool = True
    failure_reason: Optional[str] = None

    @property
    def consistency(self) -> Optional[float]:
        """Standard deviation of multi-pass scores; lower is more consistent."""
        if len(self.passes) < 2:
            return None
        mean = sum(self.passes) / len(self.passes)
        variance = sum((p - mean) ** 2 for p in self.passes) / len(self.passes)
        return variance ** 0.5


class GEvalEvaluator:
    """
    Reusable G-Eval pipeline supporting multi-pass aggregation,
    structured output parsing, and per-dimension rubrics.
    """

    def __init__(
        self,
        dimensions: list[RubricDimension],
        model: str = "gpt-4o",
        num_passes: int = 3,         # Multi-pass for self-consistency
        temperature: float = 0.3,    # Low but nonzero for variance sampling
        use_json_mode: bool = True,
    ):
        self.dimensions = dimensions
        self.model = model
        self.num_passes = num_passes
        self.temperature = temperature
        self.use_json_mode = use_json_mode
        self.client = OpenAI()

    def _call_llm(self, prompt: str) -> str:
        """Single LLM call with optional JSON mode enforcement."""
        kwargs = {
            "model": self.model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": self.temperature,
        }
        if self.use_json_mode:
            kwargs["response_format"] = {"type": "json_object"}
        response = self.client.chat.completions.create(**kwargs)
        return response.choices[0].message.content

    def _parse_output(self, raw: str, dim: RubricDimension) -> tuple[Optional[float], Optional[str]]:
        """Validation cascade: JSON parse → regex fallback → None."""
        # Attempt 1: Clean JSON parse
        try:
            data = json.loads(raw)
            score = float(data.get("score", data.get("Score", None)))
            reasoning = data.get("reasoning", data.get("Reasoning", ""))
            if dim.min_score <= score <= dim.max_score:
                return score, reasoning
            else:
                logger.warning(
                    f"Out-of-range score {score} for dimension '{dim.name}'. "
                    f"Expected [{dim.min_score}, {dim.max_score}]."
                )
                return None, reasoning  # Don't silently clamp — return None
        except (json.JSONDecodeError, TypeError, ValueError):
            pass

        # Attempt 2: Regex extraction from prose
        match = re.search(r'"?score"?\s*[:\-]?\s*(\d+(?:\.\d+)?)', raw, re.IGNORECASE)
        if match:
            score = float(match.group(1))
            if dim.min_score <= score <= dim.max_score:
                logger.info(f"Regex fallback succeeded for dimension '{dim.name}'.")
                return score, None

        return None, None

    def evaluate_dimension(
        self, dim: RubricDimension, source: str, target: str
    ) -> EvalResult:
        """Run multi-pass evaluation for a single dimension."""
        prompt = dim.prompt_template.format(source=source, target=target)
        scores = []
        last_reasoning = None
        last_raw = ""

        for pass_num in range(self.num_passes):
            try:
                raw = self._call_llm(prompt)
                last_raw = raw
                score, reasoning = self._parse_output(raw, dim)
                if score is not None:
                    scores.append(score)
                    if reasoning:
                        last_reasoning = reasoning
            except Exception as e:
                logger.error(f"LLM call failed on pass {pass_num} for '{dim.name}': {e}")

        if not scores:
            return EvalResult(
                dimension=dim.name,
                score=None,
                reasoning=None,
                raw_output=last_raw,
                is_valid=False,
                failure_reason="All passes failed to produce a valid score.",
            )

        aggregated_score = sum(scores) / len(scores)  # Mean aggregation
        return EvalResult(
            dimension=dim.name,
            score=round(aggregated_score, 3),
            reasoning=last_reasoning,
            raw_output=last_raw,
            passes=scores,
        )

    def evaluate(
        self, source: str, target: str
    ) -> dict[str, EvalResult]:
        """Evaluate all dimensions and return a results dict keyed by dimension name."""
        results = {}
        for dim in self.dimensions:
            results[dim.name] = self.evaluate_dimension(dim, source, target)
        return results

    def aggregate_score(self, results: dict[str, EvalResult]) -> Optional[float]:
        """Compute a weighted composite score across all valid dimensions."""
        dim_map = {d.name: d for d in self.dimensions}
        total_weight = 0.0
        weighted_sum = 0.0
        for name, result in results.items():
            if result.is_valid and result.score is not None:
                w = dim_map[name].weight
                weighted_sum += result.score * w
                total_weight += w
        return round(weighted_sum / total_weight, 3) if total_weight > 0 else None

This class encodes several important design decisions. The num_passes parameter enables self-consistency aggregation, which as discussed in the previous section has replaced direct logprob access as the primary variance-reduction mechanism. The consistency property on EvalResult surfaces pass-level standard deviation, giving you a signal when a particular (source, target) pair produces unstable scores — often a sign that the input is ambiguous or the rubric is under-specified for that case.

💡 Real-World Example: In a production summarization evaluation pipeline, you might set num_passes=5 for dimensions like faithfulness where hallucination detection is critical, but num_passes=1 for fluency where scores tend to be stable. This lets you allocate API budget where variance is actually a problem.

Connecting G-Eval Outputs to Downstream Reporting

A G-Eval pipeline that runs scores but doesn't surface them systematically is only half the system. The other half is the reporting layer: logging scores with enough context to be useful, tracking inter-run consistency, and making results visible in dashboards.

Structured logging is the foundation. Every evaluation run should emit a log record that includes the dimension name, the aggregated score, the per-pass scores, the model used, the prompt template hash (so you know which version of the rubric produced the score), and a content hash of the inputs (so you can deduplicate and trace back to specific examples). Avoid logging raw source and target text at high volume; log a content hash and keep the full text in a separate store.

Here is a lightweight logging pattern that integrates with the evaluator class above:

import hashlib
import time
from datetime import datetime, timezone


def log_eval_run(
    results: dict,
    source: str,
    target: str,
    composite_score: Optional[float],
    run_metadata: dict = None,
) -> dict:
    """
    Produces a structured log record for one complete evaluation run.
    Suitable for writing to a JSON log file, a database, or a
    monitoring platform like W&B, MLflow, or a custom dashboard.
    """
    def _hash(text: str) -> str:
        return hashlib.sha256(text.encode()).hexdigest()[:12]

    record = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "source_hash": _hash(source),
        "target_hash": _hash(target),
        "composite_score": composite_score,
        "dimensions": {},
        "metadata": run_metadata or {},
    }

    for name, result in results.items():
        record["dimensions"][name] = {
            "score": result.score,
            "passes": result.passes,
            "consistency_std": result.consistency,
            "is_valid": result.is_valid,
            "failure_reason": result.failure_reason,
        }

    return record


## Usage: write to a JSONL log file for downstream analysis
def append_to_log(record: dict, log_path: str = "eval_log.jsonl") -> None:
    with open(log_path, "a") as f:
        f.write(json.dumps(record) + "\n")

The JSONL format (one JSON object per line) is intentional: it is easy to stream, append, and process with tools like jq, pandas, or any log aggregation service. Each record is self-contained, which means you can analyze score trends over time, compare rubric versions, or investigate specific failure cases without reconstructing context.

Inter-run consistency is a metric you should track at the dataset level, not just the example level. After running your evaluator over a fixed test set multiple times (with different random seeds or on different days), compute the Pearson correlation or mean absolute difference between run scores. If that correlation drops below ~0.85 for your core dimensions, something has changed — either the model's behavior, your API configuration, or the prompt template — and the change needs investigation before you can trust trend comparisons.

📋 Quick Reference Card: Reporting Metrics to Track

📊 Metric 🔧 How to Compute 🎯 Healthy Range
📈 Per-dimension mean score Mean across all examples in a run Baseline-relative
🔄 Pass-level std dev Std dev of multi-pass scores per example < 0.5 on 1–5 scale
🔁 Inter-run correlation Pearson r between two full run score vectors > 0.85
❌ Parse failure rate % of examples with no valid score returned < 2%
⚠️ Out-of-range rate % of raw scores outside valid range < 1%
🏆 Composite score trend Rolling mean of weighted composite over time Monitor for drift

Dashboard integration depends on your infrastructure, but the log record schema above maps directly onto what most experiment tracking tools expect. In MLflow, each evaluation run becomes an experiment run with metrics logged per dimension. In Weights & Biases, the record["dimensions"] dict maps onto a custom summary table. For teams without dedicated ML infrastructure, a simple pandas aggregation over the JSONL file, rendered in a Jupyter notebook or Streamlit app, is often sufficient.

💡 Pro Tip: Track the parse failure rate as a first-class metric in your dashboard. A sudden spike in parse failures is often the first signal that a model API change or prompt regression has occurred — it shows up before score trends move, giving you earlier warning.

🤔 Did you know? The choice of mean vs. median for multi-pass aggregation matters more than it might seem. Mean aggregation is more sensitive to the occasional outlier pass where the model returns an extreme score. For dimensions where you observe heavy-tailed pass distributions — common with faithfulness evaluation on adversarial inputs — median aggregation can produce more robust composite scores.

Pulling It Together: A Full Evaluation Pipeline

With the evaluator class, the logging function, and the rubric design principles in hand, the full pipeline looks like this:

Rubric Definition (RubricDimension objects)
         │
         ▼
   GEvalEvaluator
  (model, passes, temp)
         │
    ┌────┴────┐
    │         │
    ▼         ▼
Dimension 1  Dimension 2  ...  Dimension N
(multi-pass) (multi-pass)      (multi-pass)
    │         │
    └────┬────┘
         │
    EvalResult dict
         │
    ┌────┴───────────────┐
    │                    │
    ▼                    ▼
Aggregate Score    Log Record (JSONL)
                         │
                    ┌────┴────┐
                    │         │
                    ▼         ▼
               Dashboard   Trend
               (live)      Analysis

The rubric definition is the only component that changes between tasks. The evaluator, logging, and reporting infrastructure are reusable across every evaluation scenario in your system — summarization quality, RAG faithfulness, code generation correctness, conversational coherence. That reusability is the practical payoff of the G-Eval abstraction: once you have invested in the infrastructure, the marginal cost of adding a new evaluation dimension is writing a single well-anchored prompt template.

🎯 Key Principle: Build the infrastructure once around the abstraction, not around the task. A G-Eval evaluator that requires rewriting core parsing logic for each new dimension is a sign that the dimension's prompt template is doing too much work that should be in the framework.

With a working pipeline in place, the natural next question is: where does this break? The following section maps the most common failure modes — from sycophancy bias to dimension collapse — and shows you how to detect them before they corrupt your evaluation results.

Common Pitfalls: Where G-Eval Breaks Down

G-Eval's architecture is elegant in theory: give an LLM a rubric, ask it to reason through evaluation criteria, and collect stable scores. In practice, however, the gap between a well-designed G-Eval pipeline and a broken one is often invisible until it causes real damage—wasted engineering cycles, misguided model comparisons, or production systems quietly degrading in quality. This section catalogs the most common failure modes, explains why each one occurs at a mechanistic level, and gives you concrete tools to detect and correct them before they corrupt your evaluation signal.

🎯 Key Principle: G-Eval failures are rarely catastrophic crashes. They are silent, systematic distortions that make bad outputs look acceptable and obscure meaningful differences between system versions. Detection requires active instrumentation, not just inspection.


Pitfall 1: Rubric Underspecification

Rubric underspecification occurs when the evaluation criteria in your prompt are so vague that the judge model cannot distinguish between the dimension you care about and a proxy it finds easier to evaluate. The most common proxy is surface fluency—grammatical correctness, sentence variety, and stylistic polish—because these features are salient, consistent across outputs, and strongly represented in LLM training data.

Consider a rubric for evaluating a customer support response on the dimension of "helpfulness":

Rate how helpful this response is on a scale of 1 to 5.

This prompt gives the judge no anchor for what "helpful" means in context. Does helpfulness mean the response answers the user's literal question? That it provides actionable next steps? That it escalates appropriately when the issue is out of scope? In the absence of specificity, the judge model will pattern-match to the most statistically probable interpretation of "helpful" from its training distribution—which, for a polished-sounding response, often means it scores high regardless of whether it actually solves the user's problem.

The failure signature: You run your rubric on a set of responses and notice that outputs with confident, well-formed prose consistently score 4–5, even when they contain factual errors or miss the user's actual question. Meanwhile, a terse but accurate response scores 2–3. The rubric is measuring fluency, not helpfulness.

How to fix it: Rubrics must specify observable behaviors tied to the evaluation dimension. Each criterion should describe something the judge can verify by reading the output, not an impression it needs to form.

## Underspecified rubric — avoid this
vague_rubric = """
Evaluate the helpfulness of the following customer support response.
Score from 1 to 5, where 1 is not helpful and 5 is very helpful.
"""

## Specified rubric — prefer this
specific_rubric = """
Evaluate the following customer support response on the dimension of HELPFULNESS.
Score from 1 to 5 using these criteria:

5 - Directly addresses the user's stated issue, provides at least one actionable next step,
    and does not introduce irrelevant information.
4 - Addresses the user's issue but omits one actionable step or includes minor tangents.
3 - Partially addresses the issue; the user would need to ask at least one follow-up question.
2 - Acknowledges the issue but provides no actionable guidance.
1 - Fails to address the issue or provides incorrect information.

Do NOT score based on tone, grammar, or length unless they directly affect comprehension.
"""

The second rubric eliminates the fluency shortcut by explicitly deprioritizing tone and grammar. It also gives the judge a behavioral anchor at every point on the scale, reducing the interpretive freedom that leads to proxy scoring.

💡 Pro Tip: After drafting a rubric, test it on a deliberately broken output—one that sounds polished but is factually wrong or off-topic. If your judge scores it above 3, your rubric is probably underspecified.


Pitfall 2: Anchor Collapse

Anchor collapse (sometimes called centrality bias or severity avoidance) is the tendency for LLM judges to cluster scores around the middle of a scale, avoiding extreme values even when the evaluated outputs are clearly at the high or low end of quality. This is not random noise—it is a systematic compression of the score distribution that destroys the discriminative power of your evaluation.

The mechanism is rooted in how instruction-tuned models are trained. RLHF and preference-based fine-tuning reward balanced, hedged responses. A model trained to avoid overconfidence will naturally soften extreme judgments. When asked to rate something a 1 or a 5, it experiences something analogous to a confidence penalty and retreats toward safer middle ground.

Ideal score distribution (meaningful variance):

  Count
  |
30|        ██
25|     ██ ██ ██
20|  ██ ██ ██ ██ ██
15|  ██ ██ ██ ██ ██
10|  ██ ██ ██ ██ ██
 5|  ██ ██ ██ ██ ██
  +--+--+--+--+--+--
     1  2  3  4  5   Score

Anchor-collapsed distribution (centrality bias):

  Count
  |
50|        ██
40|     ██ ██ ██
30|     ██ ██ ██
20|  ██ ██ ██ ██ ██
10|  ██ ██ ██ ██ ██
 5|  ██ ██ ██ ██ ██
  +--+--+--+--+--+--
     1  2  3  4  5   Score
        ^peak at 3

How to detect it: Compute the score distribution across a calibration set of at least 50–100 outputs spanning a known quality range. If more than 60% of scores fall within one point of the scale midpoint, you likely have anchor collapse. A quick Python diagnostic:

import numpy as np
from collections import Counter

def detect_anchor_collapse(scores: list[int], scale_min=1, scale_max=5, threshold=0.60):
    """
    Flags potential anchor collapse if too many scores cluster around the midpoint.
    scores: list of integer scores from your G-Eval pipeline
    threshold: fraction of scores within ±1 of midpoint that triggers a warning
    """
    midpoint = (scale_min + scale_max) / 2
    # For a 1-5 scale, midpoint is 3.0; window is [2, 4]
    window_low = midpoint - 1
    window_high = midpoint + 1

    scores_arr = np.array(scores)
    in_window = np.sum((scores_arr >= window_low) & (scores_arr <= window_high))
    fraction = in_window / len(scores_arr)

    distribution = Counter(scores)
    print(f"Score distribution: {dict(sorted(distribution.items()))}")
    print(f"Mean: {np.mean(scores_arr):.2f}, Std Dev: {np.std(scores_arr):.2f}")
    print(f"Fraction within ±1 of midpoint ({window_low}–{window_high}): {fraction:.1%}")

    if fraction > threshold:
        print(f"⚠️  Anchor collapse detected: {fraction:.1%} of scores in central window (threshold: {threshold:.0%})")
        print("Consider: explicit anchoring examples, pairwise ranking, or rescaling prompts.")
    else:
        print("✅ Score distribution appears healthy.")

## Example usage
sample_scores = [3, 3, 2, 3, 4, 3, 3, 3, 2, 3, 3, 4, 3, 3, 2]
detect_anchor_collapse(sample_scores)

How to fix it: Three strategies work well in combination:

🔧 Explicit behavioral anchors — As shown in the rubric example above, define what a 1 and a 5 look like in concrete terms. The judge needs permission to assign extreme scores.

🔧 Pairwise ranking — Instead of asking for absolute scores, ask the judge which of two outputs is better. Pairwise judgments are far more stable because they eliminate the scale-positioning problem entirely. Pairwise results can be converted to ratings using Elo or Bradley-Terry models.

🔧 Rescaling prompts — Prompt the judge to "use the full range of the scale" and remind it that a 1 should be assigned to genuinely poor outputs and a 5 to genuinely excellent ones. This helps, but is less reliable than structural changes.


Pitfall 3: Prompt Sensitivity Drift

Prompt sensitivity drift is the phenomenon where small, semantically equivalent rewording of the evaluation prompt produces large, reproducible shifts in scores. This is one of the most insidious G-Eval failure modes because it is easy to dismiss as noise when it is actually a systematic vulnerability.

🤔 Did you know? Studies on LLM-as-judge systems have found score shifts of 0.5–1.5 points on a 5-point scale from changes as minor as moving the scoring instruction from the beginning to the end of the prompt, or substituting "rate" for "evaluate."

The root cause is that LLMs are not executing a logical procedure—they are completing a token sequence. The entire prompt, including its framing, word choice, and structural ordering, conditions the probability distribution over score tokens. Two prompts that a human reader would treat as equivalent may position the model in very different regions of its behavior space.

Prompt Version A:  "You are a strict evaluator. Rate the factual accuracy..."
                          ↓
                   Mean score: 2.8

Prompt Version B:  "You are a helpful evaluator. Rate the factual accuracy..."
                          ↓
                   Mean score: 3.6

Difference: 0.8 points — from a single adjective change

How to detect it: Maintain a prompt regression suite—a fixed set of 20–30 outputs with established reference scores (either human-annotated or from a well-validated prior run). Whenever you modify your evaluation prompt, re-run this suite and compare score distributions. A mean shift greater than 0.2 points or a rank-order correlation below 0.9 (Spearman's ρ) should trigger a review.

How to fix it:

  • Freeze prompt versions and treat them like software artifacts. Use version control (Git) for all evaluation prompts, not just model code.
  • Multi-pass aggregation across slightly varied prompt phrasings (the self-consistency approach from Section 3) reduces sensitivity to any single wording by averaging over the variation.
  • Structural prompting discipline: keep rubric criteria in the same order, use the same scale label consistently ("1 to 5" not "one to five" in some runs), and avoid persona instructions ("strict," "helpful") that carry strong connotations.

⚠️ Common Mistake: Treating prompt iteration as low-stakes experimentation. Every prompt change to an evaluation pipeline is a potential validity threat. Changes should go through the same review process as changes to the system being evaluated.


Pitfall 4: Conflating Multi-Dimensional Quality into a Single Score

One of G-Eval's most powerful features is its support for dimension-specific evaluation: you can define separate rubrics for fluency, factual accuracy, relevance, and coherence, and score them independently. The pitfall—and it is an extremely common one—is collapsing all of these into a single composite score before analysis.

Wrong thinking: "I'll average the four dimension scores to get an overall quality score. That gives me one number to track."

Correct thinking: "I'll track dimension scores separately. An average hides the diagnostic information I need to know why a system's quality changed."

Consider a summarization system that improves between version A and version B:

Dimension Version A Version B Change
📝 Fluency 4.2 4.5 +0.3
🎯 Relevance 3.8 3.9 +0.1
✅ Factual Accuracy 3.5 3.1 -0.4
🔗 Coherence 3.9 4.2 +0.3
Average 3.85 3.93 +0.08

The composite score suggests Version B is a modest improvement. The dimension breakdown reveals a critical regression in factual accuracy that the composite obscures. If you were tracking only the average, you would ship Version B with confidence and introduce a factual accuracy problem into production.

💡 Real-World Example: Retrieval-augmented generation (RAG) systems are particularly vulnerable to this pattern. Prompt tuning that makes responses more fluent often increases hallucination rates simultaneously. A single composite score will average these opposing effects into a misleadingly stable number.

The structural fix is architectural: design your G-Eval pipeline to return a dictionary of dimension scores, never a single float, and build your dashboards and alerting around the full vector.

from dataclasses import dataclass
from typing import Optional

@dataclass
class GEvalResult:
    """
    Structured output from a G-Eval pipeline.
    Dimension scores are stored separately; composite is computed
    only on explicit request and never stored as the primary result.
    """
    fluency: float
    relevance: float
    factual_accuracy: float
    coherence: float
    metadata: Optional[dict] = None

    def composite(self, weights: Optional[dict] = None) -> float:
        """Compute a weighted composite — for reporting only, not primary storage."""
        if weights is None:
            weights = {"fluency": 0.2, "relevance": 0.3,
                       "factual_accuracy": 0.4, "coherence": 0.1}
        return (
            self.fluency * weights["fluency"] +
            self.relevance * weights["relevance"] +
            self.factual_accuracy * weights["factual_accuracy"] +
            self.coherence * weights["coherence"]
        )

    def regression_alert(self, baseline: "GEvalResult", threshold: float = 0.2) -> list[str]:
        """Flag any dimension that has dropped by more than the threshold."""
        alerts = []
        for dim in ["fluency", "relevance", "factual_accuracy", "coherence"]:
            delta = getattr(self, dim) - getattr(baseline, dim)
            if delta < -threshold:
                alerts.append(f"⚠️ {dim} regressed by {abs(delta):.2f} points")
        return alerts

## Usage example
baseline = GEvalResult(fluency=4.2, relevance=3.8, factual_accuracy=3.5, coherence=3.9)
current = GEvalResult(fluency=4.5, relevance=3.9, factual_accuracy=3.1, coherence=4.2)

alerts = current.regression_alert(baseline)
for alert in alerts:
    print(alert)
## Output: ⚠️ factual_accuracy regressed by 0.40 points

🧠 Mnemonic: Think of G-Eval dimension scores as vital signs, not a single health score. A patient with normal temperature, normal blood pressure, but dangerously low oxygen is not "mostly healthy." You need the full panel.


Pitfall 5: Over-Trusting High-Capability Judge Models Without Validation

The final and perhaps most dangerous pitfall is treating a capable judge model—GPT-4o, Claude Sonnet, Gemini Ultra—as a ground truth source simply because it is large and well-regarded. Model capability does not transfer automatically to evaluation validity on your specific task. A judge model may be excellent at general reasoning while systematically miscalibrated for your domain, your output format, or your definition of quality.

The failure manifests in several ways:

Over-trust failure modes:

  General capability              Task-specific validity
  ─────────────────              ──────────────────────
  High                           May be low because:
  ↓                              ↓
  GPT-4o scores well             - Domain is specialized (legal, medical, code)
  on public benchmarks     →     - Output format is unusual (structured data, poetry)
                                 - Quality definition is idiosyncratic to your use case
                                 - Judge has systematic biases (length, verbosity)

One well-documented form of over-trust bias is verbosity preference: large instruction-tuned models consistently rate longer responses higher than shorter ones, even when the shorter response is more accurate and appropriate. If your task rewards conciseness (a one-sentence answer to a simple factual question), an unvalidated judge will systematically misevaluate your outputs.

The validation protocol is non-negotiable for any production G-Eval deployment:

🔧 Step 1 — Collect a calibration set. Gather 50–100 outputs from your actual system. These should span the quality range you expect in production, not cherry-picked examples.

🔧 Step 2 — Collect human annotations. Have at least 2–3 domain-knowledgeable annotators score each output using the same rubric you plan to use with the judge model. Compute inter-annotator agreement (Cohen's κ or Krippendorff's α) to verify the rubric itself is clear.

🔧 Step 3 — Compute judge-human correlation. Run your G-Eval pipeline on the calibration set and compute Spearman's ρ between judge scores and averaged human scores. A correlation below 0.7 is a red flag. Below 0.5 means the judge is not reliably measuring what humans care about.

🔧 Step 4 — Inspect failure cases. Do not stop at the correlation coefficient. Pull the 10–15 examples where judge and human scores diverge most. These cases will reveal the systematic bias pattern: is the judge favoring length? Rewarding confident tone regardless of accuracy? Penalizing domain-specific terminology it doesn't recognize?

🔧 Step 5 — Recalibrate the rubric. Use the failure case analysis to add specificity to the rubric, add explicit instructions to counter identified biases (e.g., "Do not adjust your score based on response length"), and re-validate.

💡 Pro Tip: Validation is not a one-time event. As your system evolves and as judge model versions update (model providers frequently update models in place), schedule quarterly re-validation against a held-out set of human-annotated examples. Score drift between judge versions has caused silent evaluation failures in production pipelines.

⚠️ Common Mistake: Using the judge model to evaluate outputs from a different judge model's family without accounting for self-preference bias. OpenAI models have been shown to prefer outputs from other OpenAI models; Anthropic models show similar patterns. Cross-family validation—using an open-weight judge to spot-check a closed-model judge—is a useful safeguard.


A Unified Diagnostic Checklist

Before trusting any G-Eval pipeline in production, run through this checklist:

📋 Quick Reference Card: G-Eval Health Check

Check Signal Action if Failed
🎯 Rubric specificity Test on polished-but-wrong output Add behavioral anchors per scale point
📊 Score distribution Std dev < 0.8 on 1–5 scale Add anchors, switch to pairwise
🔁 Prompt stability Score shift > 0.2 on regression suite Freeze prompt, use multi-pass
📐 Dimension separation Using composite as primary metric Restructure pipeline to return dict
🧪 Human correlation Spearman ρ < 0.7 Re-validate rubric, inspect failures
📅 Temporal stability Not re-validated in > 90 days Schedule re-validation against held set

Putting It Together: A Failure Mode Interaction

These pitfalls rarely appear in isolation. A particularly common cascade begins with rubric underspecification, which makes scores susceptible to surface features, which in turn causes anchor collapse (the model scores most outputs as "acceptable"), which makes prompt sensitivity drift harder to detect (there's little variance to shift), which encourages the practitioner to collapse dimensions into a single score (since all dimensions look similarly flat), which finally creates false confidence that the judge is stable—until a spot-check against human annotations reveals the system has been measuring fluency all along.

Failure Cascade:

 Vague rubric
     ↓
 Proxy scoring (fluency instead of target dimension)
     ↓
 Compressed score range (anchor collapse)
     ↓
 Low variance masks prompt sensitivity drift
     ↓
 Composite score hides dimension signal
     ↓
 No human validation → false confidence
     ↓
 Silent evaluation failure in production

The good news is that the cascade has a single structural intervention point: rubric specificity. A well-anchored, dimension-specific rubric with behavioral definitions at each scale point makes all downstream pitfalls less likely and easier to detect when they do occur. Validate early, track distributions, and never promote an evaluation pipeline to production without at least one pass of human-judge correlation analysis on your actual task data.

Key Takeaways and What Comes Next

You have now traveled the full arc of G-Eval: from the motivating frustration with single-sample LLM scores, through the architectural choices that make structured judgment reliable, across the landscape of modern variants that have emerged as the ecosystem has matured, and into the practical realities of implementation and failure. This final section cements what you have learned, gives you a quick-reference toolkit you can return to, and positions you for the deeper explorations ahead.

What You Now Understand That You Didn't Before

Before this lesson, you might have thought of LLM-based evaluation as simply "ask the model to score this output." That framing collapses important distinctions. Here is what the G-Eval lens has added to your mental model.

First, you now understand that a single LLM score is a sample from a distribution, not a ground truth. The same rubric, the same input, and the same model will produce different scores across runs unless you deliberately control for variance. G-Eval's original contribution was recognizing this and addressing it by aggregating over multiple samples — weighted by token probabilities when those were accessible, averaged over multiple passes when they were not.

Second, you understand the role that prompt structure plays in stabilizing outputs. A vague instruction like "rate the quality from 1 to 10" produces noisy, hard-to-reproduce scores. A chain-of-thought rubric that defines each criterion, walks the model through an explicit reasoning step, and constrains the output format produces scores that are far more interpretable and consistent. The structure is not just cosmetic — it is load-bearing.

Third, you now see the 2026 landscape clearly. The original G-Eval paper leaned on token-level log probabilities over discrete score tokens as the aggregation mechanism. That mechanism is still theoretically elegant, but direct probability access has become limited or abstracted away in most production API contexts. The field has responded by moving toward controlled generation plus aggregation: deterministic or low-temperature decoding to stabilize individual passes, followed by multi-pass aggregation or self-consistency voting to approximate the distributional signal that log probabilities used to provide.

Fourth, you have a vocabulary for choosing among variants. Absolute scoring with aggregation when you need fine-grained, criterion-level feedback. Pairwise ranking when relative comparisons are more meaningful than absolute scores. Calibration layers using open-weight models when you do have logprob access and want to recover some of the original probabilistic grounding.

🎯 Key Principle: G-Eval's enduring insight is not about token probabilities specifically — it is about treating evaluation as a distribution estimation problem. The mechanism for estimating that distribution has evolved, but the core commitment to aggregation over single-sample scoring remains.


The 2026 Landscape at a Glance

The diagram below summarizes how implementation strategy has shifted from the original paper to current practice:

ORIGINAL G-EVAL (2023)
─────────────────────────────────────────────────────────
 Prompt + Rubric
        │
        ▼
   LLM (with logprob access)
        │
        ▼
  Token probabilities over {1, 2, 3, 4, 5}
        │
        ▼
  Weighted average → final score

MODERN G-EVAL (2026)
─────────────────────────────────────────────────────────
 Prompt + Structured Rubric + CoT instruction
        │
        ▼
   LLM (low temperature, JSON output mode)
        │
  ┌─────┴──────┐
  │  Run N=5   │  (self-consistency or multi-pass)
  └─────┬──────┘
        │
        ▼
  Parse & validate each output
        │
        ▼
  Aggregate (mean / majority vote / weighted)
        │
        ▼
  Optional: calibration layer (open-weight model
  with logprobs to re-weight or validate)
        │
        ▼
  Final score + reasoning trace

The endpoints are similar — a reliable, rubric-aligned score — but the path through the middle has changed substantially.


Choosing the Right Variant: A Decision Framework

One of the most practically useful things you can take away from this lesson is a clear decision procedure for selecting which G-Eval variant fits your situation.

Do you need criterion-level scores (not just a single number)?
    │
    ├─ YES ──► Absolute scoring with per-criterion rubric + multi-pass aggregation
    │           Best for: feedback generation, regression testing, quality dashboards
    │
    └─ NO ──► Do you have two or more candidates to compare?
                │
                ├─ YES ──► Pairwise ranking or tournament-style evaluation
                │           Best for: model selection, A/B testing, RLHF data collection
                │
                └─ NO ──► Do you have logprob access (open-weight or research API)?
                            │
                            ├─ YES ──► Calibrated scoring with probability weighting
                            │           Best for: research, high-stakes decisions, calibration audits
                            │
                            └─ NO ──► Structured absolute scoring with 3-5 passes
                                        Best for: most production use cases

💡 Real-World Example: A team building a customer support response quality system used absolute scoring with five passes and per-criterion rubrics (accuracy, tone, completeness, brevity). They found that the per-criterion breakdown was more actionable for their QA team than a single aggregate score — annotators could pinpoint which criterion was failing rather than re-reading the whole response to guess why it scored low.



Quick-Reference Implementation Checklist

Use this checklist every time you stand up a G-Eval pipeline. Each item corresponds to a failure mode covered in earlier sections.

📋 Quick Reference Card: G-Eval Implementation Checklist

# 🎯 Checkpoint ✅ What to verify ⚠️ Risk if skipped
1 🔧 Rubric specificity Each criterion has a definition and anchored score examples Vague rubrics produce inconsistent scores
2 🔒 Decoding settings Temperature ≤ 0.2 for absolute scoring; consistent across passes High temperature inflates variance
3 📚 Number of passes Minimum 3; prefer 5 for high-stakes evals Single-pass results are unreliable
4 🔧 Output validation Schema check on every response; log and retry on parse failure Silent failures corrupt aggregates
5 🎯 Aggregation method Documented and consistent (mean, median, majority vote) Switching methods mid-run invalidates comparisons
6 🧠 Human spot-check cadence Weekly sample review for ongoing pipelines Model drift goes undetected
7 📚 Position bias control Randomize candidate order in pairwise evals Position bias skews rankings
8 🔒 Calibration audit Correlate LLM scores against human scores on a held-out set Miscalibrated scores mislead downstream decisions

Here is a minimal implementation that encodes several of these checklist items:

import json
import re
from statistics import mean, stdev
from typing import Optional

## A production-grade G-Eval runner with validation and aggregation
def run_geval(
    llm_client,
    rubric_prompt: str,
    input_text: str,
    candidate_output: str,
    n_passes: int = 5,
    temperature: float = 0.1,
    score_field: str = "score",
    score_range: tuple = (1, 5),
) -> dict:
    """
    Runs a multi-pass G-Eval evaluation with validation and aggregation.
    Returns a summary dict with mean score, std dev, and reasoning traces.
    """
    full_prompt = rubric_prompt.format(
        input=input_text,
        output=candidate_output
    )

    scores = []
    traces = []
    failures = 0

    for pass_idx in range(n_passes):
        try:
            # Low temperature for stability; JSON mode for reliable parsing
            response = llm_client.complete(
                prompt=full_prompt,
                temperature=temperature,
                response_format={"type": "json_object"},
                max_tokens=512,
            )
            parsed = json.loads(response.text)

            # Validate score exists and is in range
            raw_score = parsed.get(score_field)
            if raw_score is None:
                raise ValueError(f"Missing '{score_field}' field in response")

            score = float(raw_score)
            lo, hi = score_range
            if not (lo <= score <= hi):
                raise ValueError(f"Score {score} outside range [{lo}, {hi}]")

            scores.append(score)
            traces.append(parsed.get("reasoning", ""))

        except (json.JSONDecodeError, ValueError, KeyError) as e:
            # Log parse failures but don't abort — retry logic could go here
            failures += 1
            print(f"Pass {pass_idx + 1} failed: {e}")

    if not scores:
        raise RuntimeError("All evaluation passes failed — check prompt and API settings")

    return {
        "mean_score": round(mean(scores), 3),
        "std_dev": round(stdev(scores), 3) if len(scores) > 1 else 0.0,
        "n_valid": len(scores),
        "n_failed": failures,
        "scores": scores,
        "reasoning_traces": traces,
    }

This runner handles the three most common silent failure modes: JSON parse errors, missing score fields, and out-of-range values. It logs failures without aborting, so you can audit pass-level reliability over time.

💡 Pro Tip: Track std_dev across your evaluation runs as a health metric. If standard deviation on a stable test set starts climbing, it often signals a model update on the provider side or a rubric that has become ambiguous relative to new types of inputs entering your pipeline.


The Core Lessons, Stated Plainly

Before moving to the next lessons, it helps to state the core lessons in unambiguous terms.

🧠 Mnemonic: S-A-VStructure the rubric, Aggregate over passes, Validate every output. These three actions separate reliable G-Eval pipelines from brittle ones.

Structure the rubric. The quality of your evaluation is bounded by the quality of your rubric. A rubric that defines criteria vaguely, omits scoring anchors, or conflates multiple dimensions into a single score will produce scores that are neither reproducible nor actionable. Invest time here before you write a line of code.

Aggregate over passes. A single LLM call gives you a point estimate from a distribution. Whether you weight by token probabilities, average over multiple low-temperature samples, or use self-consistency voting, the principle is the same: you need more than one draw to characterize the distribution reliably. In practice, five passes is a reasonable default for most use cases.

Validate every output. Evaluation pipelines that silently accept malformed responses will produce corrupted aggregates that you will trust until a human spot-check reveals the problem. Build schema validation and range checking into the pipeline from day one.

⚠️ Critical point to remember: The biggest practical risk in G-Eval deployments is not rubric quality or aggregation strategy — it is silent parse failures accumulating in your aggregate. A pipeline that runs five passes but silently drops two bad parses and averages only three scores will appear to work correctly while producing systematically biased results. Always log your n_valid and n_failed counts and set an alert threshold.



What the Next Lessons Cover

This lesson has given you a complete picture of G-Eval's architecture and its 2026 variants, but it has deliberately kept two adjacent topics at arm's length. The next lessons address both of them directly.

Token Probability Scoring: The Probabilistic Foundations

G-Eval's original design was grounded in a specific probabilistic intuition: if you ask a model to assign a score and you observe the probability mass the model places over each discrete score token, you have a richer signal than the single highest-probability token alone. The expected score under that distribution is more stable, better calibrated, and more informative than a greedy sample.

The next lesson on token probability scoring digs into this foundation systematically. You will learn:

🔧 How log probabilities are computed and what they represent 📚 Why probability-weighted scoring is theoretically superior to sampling-based scoring 🎯 In which settings logprob access is still available and how to use it correctly 🧠 How to use open-weight models as calibration oracles even when your primary evaluation model is a closed API

If you have been working with APIs that expose logprobs — Mistral, certain OpenAI endpoints in research tiers, or self-hosted models via vLLM or llama.cpp — this lesson will give you the mathematical and practical tools to exploit that access fully.

FActScoring: Claim-Level Decomposition for Factuality

G-Eval treats a model output as a whole and scores it against a rubric. This works well for dimensions like fluency, coherence, or tone — dimensions that are inherently holistic. It works less well for factuality, where a single response might contain fifteen factual claims, twelve of which are correct and three of which are hallucinated.

A holistic score of 4/5 on factuality obscures which claims are wrong and makes it difficult to diagnose the failure or track improvement over time. FActScoring and its relatives address this by decomposing the evaluation unit from the full response to the individual atomic claim.

The lesson on FActScoring will cover:

🔧 How to decompose a model output into atomic factual claims 📚 How to score each claim independently against a reference corpus or knowledge source 🎯 How to aggregate claim-level scores into response-level metrics while preserving interpretability 🧠 How decomposition-based methods relate to G-Eval — they are complementary, not competing

💡 Mental Model: Think of G-Eval and FActScoring as operating at different granularities. G-Eval evaluates the paragraph; FActScoring evaluates the sentences within it. For many real systems, you want both: G-Eval for overall quality dimensions and FActScoring for factuality specifically.


Here is a minimal illustration of how the two approaches can be composed in a single evaluation pipeline:

## Composing G-Eval (holistic) with claim-level decomposition (FActScore-style)
## This shows the architectural relationship — not a full implementation

def evaluate_response_comprehensive(
    llm_client,
    knowledge_client,
    response: str,
    source_document: str,
    geval_rubric: str,
) -> dict:
    """
    Runs G-Eval for holistic quality dimensions and claim-level
    decomposition for factuality. Returns a unified evaluation report.
    """

    # --- G-Eval pass: holistic quality (fluency, coherence, relevance) ---
    geval_result = run_geval(
        llm_client=llm_client,
        rubric_prompt=geval_rubric,
        input_text=source_document,
        candidate_output=response,
        n_passes=5,
    )

    # --- Claim decomposition: factuality ---
    # Step 1: Extract atomic claims from the response
    decomposition_prompt = (
        "Extract all atomic factual claims from the following text. "
        "Return a JSON array of strings, each a single verifiable claim.\n\n"
        f"Text: {response}"
    )
    claims_raw = llm_client.complete(
        prompt=decomposition_prompt,
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    claims = json.loads(claims_raw.text).get("claims", [])

    # Step 2: Score each claim against the knowledge source
    # (In a real system, knowledge_client.verify() would use
    # retrieval + entailment or a fine-tuned NLI model)
    claim_results = []
    for claim in claims:
        supported = knowledge_client.verify(claim, source=source_document)
        claim_results.append({"claim": claim, "supported": supported})

    n_supported = sum(1 for c in claim_results if c["supported"])
    factscore = n_supported / len(claim_results) if claim_results else None

    # --- Unified report ---
    return {
        "holistic_quality": geval_result,
        "factuality": {
            "factscore": round(factscore, 3) if factscore is not None else None,
            "n_claims": len(claim_results),
            "n_supported": n_supported,
            "claim_details": claim_results,
        },
    }

This structure illustrates the architectural relationship between the two methods. G-Eval handles the holistic dimensions where rubric-based scoring is appropriate; claim decomposition handles factuality where granular verification is required. The two results can be presented side by side in a quality dashboard, giving both aggregate signals and interpretable drill-downs.


Practical Next Steps

Before you move to the next lesson, here are three concrete actions that will anchor what you have learned:

1. Audit an existing evaluation prompt you use today. Apply the rubric specificity checklist: Does each criterion have a clear definition? Are there anchored score examples for at least the low, middle, and high points? Is the output format explicitly constrained? Most evaluation prompts in the wild fail at least one of these checks.

2. Add a multi-pass wrapper to your most critical evaluation call. Even three passes at low temperature will substantially reduce the variance of your scores and give you a standard deviation metric to track over time. The implementation cost is low; the reliability gain is significant.

3. Identify one evaluation task in your system where factuality matters most. This is the task where FActScoring will give you the most leverage. Prepare a list of the types of factual claims your model makes, and think about what a reference source for verifying those claims would look like. You will be ready to apply the techniques in the FActScoring lesson immediately.

🤔 Did you know? The SummEval benchmark, one of the early datasets used to validate G-Eval's correlation with human judgments, evaluated model outputs across four dimensions: coherence, consistency, fluency, and relevance. Modern evaluation suites have expanded this to dozens of task-specific dimensions — but the structural lesson from SummEval remains: multi-dimensional rubrics produce more actionable and more reproducible evaluations than single-number quality scores.



Summary: The G-Eval Architecture in One View

📋 Quick Reference Card: G-Eval Architecture and Variants

🎯 Dimension 📚 Original G-Eval 🔧 Modern G-Eval (2026)
🔒 Core mechanism Token log-probability weighting Multi-pass aggregation + structured output
🧠 Aggregation signal Probability distribution over score tokens Mean/median/vote across N low-temp passes
🔧 Decoding strategy Varied temperature to sample distribution Low temperature (≤ 0.2) for each pass
📚 Output format Free-form with score extraction JSON schema with validation
🎯 Logprob dependency Required Optional (used for calibration when available)
🔒 Best variant for feedback Absolute scoring, per-criterion Same, with explicit CoT per criterion
🧠 Best variant for comparison Pairwise (less common originally) Pairwise ranking with position randomization
🔧 Calibration approach Inherent in probability weighting Separate calibration layer (open-weight model)

The through-line across all of these variants is the same insight that motivated the original work: a single LLM judgment is a sample, not a truth. Structure your prompts to elicit the right distribution, aggregate enough samples to characterize it, and validate your outputs to ensure the aggregate is clean. Everything else — the specific mechanism for weighting, the choice of model, the format of the rubric — is an implementation detail that should be chosen based on what your environment makes possible.

You are now equipped to build G-Eval pipelines that are reproducible, interpretable, and robust to the practical constraints of 2026's LLM ecosystem. The next two lessons will deepen your toolkit: one by grounding you in the probabilistic theory G-Eval was built on, and the other by extending your evaluation capabilities to the granularity of individual factual claims.