You are viewing a preview of this lesson. Sign in to start learning
Back to LLM as Judge: Reproducible Evaluation for LLM Systems

Deterministic Scoring: Rules, Trees, and DAGs

Datalog-style rules, decision trees, and DAG metrics as a family of deterministic scoring engines. Each is fully reproducible, versionable, and auditable. Trade-offs in expressiveness, debuggability, and implementation complexity.

Why Determinism Matters in LLM Evaluation

Imagine you've built an LLM-powered customer support assistant, and your compliance team asks a simple question: "How do we know the system is scoring responses consistently?" You run the same response through your LLM judge twice. You get a 7 out of 10 the first time, and a 6 out of 10 the second. The compliance officer raises an eyebrow. "Which score is correct?" You don't have a good answer — and that's the problem this lesson solves. Free flashcards are embedded throughout to help you retain the key ideas as you go.

This section sets up the conceptual foundation for everything that follows. We'll look at why deterministic scoring engines behave differently from probabilistic LLM-as-judge approaches, explore the real-world scenarios where reproducibility isn't optional, and introduce the three engine families — rule systems, decision trees, and DAGs — that give you a path to evaluation you can actually defend.


The Reproducibility Problem

Large language models are, at their core, probabilistic machines. Even at temperature zero, the sampling process and hardware-level floating-point non-determinism mean that the same prompt fed into a model API can produce subtly or dramatically different outputs across runs. When you're using an LLM as a judge — asking it to score another model's output — this variability compounds.

Consider the following scenario:

import openai

client = openai.OpenAI()

prompt = """
Score the following customer support response on a scale of 1-10 for helpfulness.
Respond with only a number.

Response: "I understand your frustration. Your refund will be processed in 3-5 business days."
"""

## Run the same judge prompt multiple times
scores = []
for i in range(5):
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,  # Even at temp=0, variance can appear
        seed=42           # Seed helps but does not guarantee determinism
    )
    score = result.choices[0].message.content.strip()
    scores.append(score)
    print(f"Run {i+1}: {score}")

## You might see: ['8', '8', '7', '8', '9']
## Or even:       ['8', '8', '8', '8', '8']  <- looks stable...
## ...until you change the model version, API provider, or hardware.
print(f"Scores across runs: {scores}")

This code illustrates the problem concisely. Even with temperature=0.0 and a fixed seed, OpenAI's documentation explicitly states that determinism across different runs, model versions, or hardware configurations is not guaranteed. The seed parameter improves reproducibility within a session, but it is not a contract.

Now scale this up. You're running 50,000 evaluations per day across a production pipeline. A 1-point variance on a 10-point scale might seem minor, but it means your pass/fail thresholds are being applied to scores that drift based on factors you cannot observe or control. This is the reproducibility problem: the mapping from input to score is mediated by a stochastic process that you don't fully control.

🤔 Did you know? Research on LLM-as-judge consistency has shown that the order in which options are presented to a model judge can shift its preferences — a phenomenon called positional bias. The same response can score differently simply because it appeared first or second in the prompt.

The deeper issue is epistemic. When a score changes between runs, you can't easily answer: Was the first score wrong? The second? Both? There's no ground truth to appeal to. You're left with averages, error bars, and uncomfortable uncertainty — which is fine for research but often unacceptable in production.


When Determinism Is Non-Negotiable

For many evaluation tasks — exploratory research, A/B testing between model versions, generating qualitative feedback — probabilistic LLM judges are genuinely useful and their variance is tolerable. But there's a distinct class of scenarios where variance is not tolerable at all.

Regulatory and Compliance Contexts

In regulated industries — healthcare, finance, legal — systems that make or influence decisions must be auditable. If your LLM assistant is helping triage patient intake forms, and a regulator asks "Show me every case where the system scored below the threshold and why", you need to produce a consistent, reproducible answer. A log entry that says "GPT-4o gave it a 4/10 on April 3rd" is not auditable in any meaningful sense, because you cannot re-run that exact evaluation and get the same answer.

The EU AI Act and similar frameworks increasingly require traceability: the ability to trace any output back to the exact logic that produced it. Probabilistic judges fail this requirement by design.

Debugging and Root Cause Analysis

When a deployed LLM system starts behaving badly — users complaining, quality metrics dropping — you need to isolate what changed. Was it the model? The prompt? The retrieval context? Your evaluation pipeline itself?

If your evaluation scores are themselves non-deterministic, you can't distinguish signal from noise in the debugging process. A drop in average score might reflect a real model regression, or it might reflect natural variance in your judge. Deterministic scoring gives you a stable baseline against which real changes stand out clearly.

Versioning and CI/CD Integration

Modern ML systems treat evaluation as a gate in the deployment pipeline — similar to unit tests in software. A new model version ships only if it clears the evaluation threshold. But if the threshold check itself is probabilistic, you've introduced a flaky test into your CI/CD pipeline. Engineers running the same commit twice might get different pass/fail decisions. This erodes trust in the entire evaluation system.

💡 Real-World Example: A fintech company building an LLM-powered loan document checker needed every scoring decision to be replayable. Their legal team required that a document scored "insufficient disclosure" on January 15th could be re-evaluated on March 10th and produce the same score with the same reasoning. LLM judges couldn't satisfy this requirement. A deterministic rule engine could.



The Core Promise of Deterministic Engines

A deterministic scoring engine makes a simple but powerful guarantee: identical inputs always produce identical scores, and every score comes with a traceable reason.

This isn't just about getting the same number twice. It's about the entire causal chain being inspectable. When a deterministic engine scores a response as 6 out of 10, you can answer:

  • 🔧 What rules fired?
  • 🔧 What data was extracted from the input?
  • 🔧 Which conditions were met or failed?
  • 🔧 How did intermediate scores combine into the final score?

None of these questions have clear answers in an LLM judge. The model's internal reasoning is opaque — even if you ask it to explain itself, that explanation is itself generated probabilistically and may not accurately reflect the actual computation.

🎯 Key Principle: Deterministic scoring is not about being "smarter" than an LLM judge. It's about being transparent and accountable. You're trading the LLM's nuanced language understanding for a system whose behavior you can fully explain, version, and defend.

Here's a minimal example of what a deterministic scorer looks like in practice:

from dataclasses import dataclass
from typing import Dict, Any

@dataclass
class ScoringResult:
    score: float
    max_score: float
    reasons: list[str]  # Traceable explanation for every decision

def score_customer_response(response: Dict[str, Any]) -> ScoringResult:
    """
    Deterministic scorer for customer support responses.
    Same input dict -> same ScoringResult, every single time.
    """
    score = 0.0
    max_score = 10.0
    reasons = []

    # Rule 1: Acknowledgment of customer issue (2 points)
    acknowledgment_phrases = ["understand", "sorry", "apologize", "hear you"]
    if any(phrase in response["text"].lower() for phrase in acknowledgment_phrases):
        score += 2.0
        reasons.append("PASS: Contains acknowledgment phrase (+2.0)")
    else:
        reasons.append("FAIL: No acknowledgment phrase found (+0.0)")

    # Rule 2: Concrete timeline provided (3 points)
    import re
    timeline_pattern = r'\d+[-–]\d+\s*(business\s*)?days?'
    if re.search(timeline_pattern, response["text"], re.IGNORECASE):
        score += 3.0
        reasons.append("PASS: Contains concrete timeline (+3.0)")
    else:
        reasons.append("FAIL: No concrete timeline detected (+0.0)")

    # Rule 3: Response length appropriate (2 points)
    word_count = len(response["text"].split())
    if 10 <= word_count <= 100:
        score += 2.0
        reasons.append(f"PASS: Word count {word_count} in range [10,100] (+2.0)")
    else:
        reasons.append(f"FAIL: Word count {word_count} outside range [10,100] (+0.0)")

    # Rule 4: No prohibited phrases (3 points)
    prohibited = ["can't help", "not my problem", "read the manual"]
    violations = [p for p in prohibited if p in response["text"].lower()]
    if not violations:
        score += 3.0
        reasons.append("PASS: No prohibited phrases found (+3.0)")
    else:
        reasons.append(f"FAIL: Prohibited phrases found: {violations} (+0.0)")

    return ScoringResult(score=score, max_score=max_score, reasons=reasons)

## Test it
response = {
    "text": "I understand your frustration. Your refund will be processed in 3-5 business days."
}
result = score_customer_response(response)
print(f"Score: {result.score}/{result.max_score}")
for reason in result.reasons:
    print(f"  {reason}")
## Score: 7.0/10.0
##   PASS: Contains acknowledgment phrase (+2.0)
##   PASS: Contains concrete timeline (+3.0)
##   PASS: Word count 14 in range [10,100] (+2.0)
##   FAIL: No prohibited phrases found... wait, this should PASS
## -> Run it again tomorrow: identical output. Every time.

This scorer does something an LLM judge cannot: it produces an audit trail. Every line in reasons corresponds to a specific, inspectable decision. You can version this code in git, write unit tests for it, and replay it on historical data with complete confidence.

⚠️ Common Mistake: Mistake 1 — Confusing determinism with correctness. A deterministic scorer can be consistently wrong if its rules are poorly designed. Determinism guarantees reproducibility, not quality. You still need to validate that your rules actually capture what you care about.


Deterministic and Probabilistic: Better Together

At this point you might be thinking: if deterministic scoring is so reliable, why use LLM judges at all? The answer is that both approaches have genuine strengths, and the most powerful evaluation pipelines use them in combination.

Here's the honest trade-off:

┌─────────────────────────────────────────────────────────────────┐
│           DETERMINISTIC vs. PROBABILISTIC EVALUATION            │
├─────────────────────────┬───────────────────────────────────────┤
│   Deterministic Engine  │        LLM-as-Judge                   │
├─────────────────────────┼───────────────────────────────────────┤
│ ✅ Fully reproducible   │ ⚠️  Varies across runs                │
│ ✅ Auditable trace      │ ⚠️  Opaque reasoning                  │
│ ✅ Zero inference cost  │ ⚠️  API cost per evaluation           │
│ ✅ Versionable in git   │ ✅  Understands nuance                │
│ ⚠️  Brittle on novelty  │ ✅  Generalizes to new patterns       │
│ ⚠️  Requires spec work  │ ✅  Low setup cost                    │
│ ⚠️  Misses semantics    │ ✅  Captures tone, coherence          │
└─────────────────────────┴───────────────────────────────────────┘

The architectural pattern that emerges from this trade-off is the hybrid evaluation pipeline: deterministic engines handle everything that can be specified precisely — format compliance, required fields, prohibited content, structural constraints — while LLM judges handle the semantic dimensions that resist formal specification, such as tone, helpfulness, and reasoning quality.

  LLM Output
      │
      ▼
┌─────────────────────────────────┐
│   DETERMINISTIC LAYER (fast)    │
│  Rules → Trees → DAGs           │
│  Format, Safety, Structure      │
│  → Score: 0-40 pts              │
└────────────────┬────────────────┘
                 │
                 ▼
        ┌────────────────┐
        │ Hard Fail?     │──── YES ──→ REJECT (skip LLM judge)
        └────────┬───────┘
                 │ NO
                 ▼
┌─────────────────────────────────┐
│   PROBABILISTIC LAYER (slower)  │
│   LLM-as-Judge                  │
│   Coherence, Tone, Helpfulness  │
│   → Score: 0-60 pts             │
└────────────────┬────────────────┘
                 │
                 ▼
        Combined Score + Audit Log

In this pipeline, the deterministic layer acts as a gate and a ground truth anchor. Responses that fail hard constraints (e.g., contain PII, are too short, violate safety rules) are rejected immediately without spending inference budget on an LLM judge. The responses that pass then receive probabilistic scoring for the nuanced dimensions.

💡 Mental Model: Think of deterministic scoring as the checklist a pilot completes before takeoff. It doesn't tell you whether the flight will be comfortable — that depends on weather, passengers, and dozens of factors that can't be pre-specified. But it guarantees that the doors are closed, fuel is loaded, and instruments are calibrated. You can't fly well without the checklist passing. The LLM judge is the pilot's judgment; the deterministic engine is the checklist.



Three Engine Families: A Preview

The rest of this lesson explores three distinct families of deterministic scoring engines, each with different trade-offs in expressiveness, debuggability, and implementation complexity. Understanding when to reach for which is one of the most practical skills in building evaluation infrastructure.

Rule Systems

Rule-based scoring engines express evaluation logic as a collection of conditions and weights. Inspired by Datalog and expert systems, they let you write declarative statements like "if the response contains a greeting AND is under 200 words, add 2 points." Rules are easy to read, easy to test, and easy to hand to domain experts who don't write code. Their weakness is that complex interdependencies between rules can become hard to manage as the rule set grows.

We'll cover these in depth in Section 2, including how to implement weighted boolean rule engines in Python.

Decision Trees

Decision tree scorers encode evaluation logic as a branching structure where each node asks a yes/no question and routes the input down a specific path. They're ideal when the scoring logic is fundamentally sequential — "first check X, then if X passes check Y, otherwise check Z." Decision trees are visually intuitive and generate naturally readable audit logs ("the response failed at node 3: no concrete resolution provided"). Their limitation is that they can become unwieldy for logic that is genuinely cross-cutting rather than hierarchical.

Section 3 expands on trees and also introduces their generalization: DAG-based metrics.

DAGs (Directed Acyclic Graphs)

DAG metric engines are the most expressive of the three. A DAG allows multiple metrics to depend on shared upstream computations, enabling rich, dependency-aware scoring flows. For example, extracting named entities from a response might feed into three different downstream metrics — and with a DAG, that extraction happens once, with results shared across all dependents. DAGs are powerful but require more upfront design work and carry higher implementation complexity.

📋 Quick Reference Card: The Three Deterministic Engine Families

🔧 Engine 📚 Best For ⚠️ Watch Out For
🎯 Rules Datalog-style rule sets Cross-cutting, declarative conditions Rule interaction complexity
🌲 Trees Decision tree scorers Sequential, hierarchical logic Combinatorial path explosion
🔗 DAGs Directed acyclic graphs Shared computation, complex dependencies Design and maintenance overhead

All three share the core deterministic promise: same input, same output, traceable why. They differ in how they structure the logic that gets you there.

🧠 Mnemonic: Think R-T-DRules Tell you what to check. Trees Tell you in what order. DAGs Tell you how checks relate to each other.


Setting Expectations for What Follows

By the end of this lesson, you'll have working implementations of all three engine families, a clear mental model for when to reach for each, and the ability to build hybrid evaluation pipelines that combine the strengths of deterministic and probabilistic scoring.

The goal isn't to abandon LLM judges — they remain invaluable for the semantic dimensions of quality that resist formalization. The goal is to earn your confidence in the system as a whole by ensuring that everything which can be deterministic, is deterministic. That's what compliance officers, debugging sessions, and CI/CD pipelines actually need from you.

❌ Wrong thinking: "Deterministic scoring is just regex and string matching — it's too simple for real evaluation."

✅ Correct thinking: "Deterministic scoring is a formal specification of evaluation criteria. Its power comes from its precision and verifiability, not its sophistication. Simple rules, correctly specified, are more valuable than complex rules that can't be tested."

The transition from Section 1 to Section 2 is a transition from why to how. You now understand what's at stake. Let's build the machinery.


Rule-Based Scoring: Logic, Conditions, and Datalog-Style Engines

Deterministic scoring begins with the most fundamental building block in formal logic: the rule. A rule is a statement of the form if this condition holds, then this conclusion follows. When you build a scoring engine from rules, you are encoding your evaluation criteria as explicit, inspectable, and replayable logic — the same response evaluated twice will always produce the same score, and you can always point to exactly which rules fired and why.

This section walks you through the anatomy of scoring rules, the conceptual framework of Datalog-inspired forward-chaining evaluation, and the practical Python machinery for implementing a rule engine you can use today.

Anatomy of a Scoring Rule

Every scoring rule has the same skeleton, regardless of how it is expressed. Understanding each part is essential before you write a single line of code.

A predicate is a function that takes an extracted field (or a set of fields) from the LLM response and returns a boolean. For example, word_count >= 50 is a predicate operating on a numeric field. "sorry" in response_text.lower() is a predicate operating on a string field. Predicates are the atoms of your rule language — they are the smallest units of testable truth.

A condition is a logical combination of predicates. Conditions can be simple (a single predicate) or compound (predicates joined with AND, OR, and NOT). The condition defines when a rule is eligible to fire.

A weight is a numeric value associated with a rule that contributes to the final composite score when the rule fires. Weights can be positive (rewarding a quality) or negative (penalizing a flaw). A rule with weight 0.0 and a boolean verdict functions as a hard gate — a pass/fail check.

A terminal verdict is a special kind of rule outcome that short-circuits further evaluation. If the response contains harmful content, you do not want to continue scoring tone and formatting — you want to immediately return a score of zero and an explanation. Terminal rules implement this early-exit behavior.

Rule Structure:

┌─────────────────────────────────────────────────────┐
│                     SCORING RULE                    │
├─────────────────┬───────────────────────────────────┤
│  CONDITION      │  IF word_count >= 50              │
│                 │  AND NOT contains_apology         │
│                 │  AND language == "en"             │
├─────────────────┼───────────────────────────────────┤
│  WEIGHT         │  +0.25  (contributes to score)    │
├─────────────────┼───────────────────────────────────┤
│  VERDICT        │  PASS / FAIL / TERMINAL_FAIL      │
└─────────────────┴───────────────────────────────────┘

🎯 Key Principle: A scoring rule is not a heuristic guess — it is an explicit contract between your evaluation criteria and your scoring engine. Every rule should be traceable back to a concrete quality requirement.

Datalog-Inspired Thinking: Facts, Derived Facts, and Forward Chaining

To build richer rule systems, it helps to borrow a mental model from Datalog, a declarative logic programming language originally designed for database query reasoning. Datalog operates on two kinds of things: facts (ground truths known directly) and derived facts (truths inferred by applying rules to existing facts).

Applied to LLM evaluation, your facts are the fields you extract directly from the response: the raw text, its word count, the detected language, whether a disclaimer is present, the sentiment score from an off-the-shelf classifier, and so on. These are base-level observations.

Derived facts are computed by applying rules to base facts. For example, is_too_short is a derived fact produced by the rule word_count < 50 → is_too_short. Once derived, is_too_short can be used as an input to other rules — for instance, is_too_short AND is_factual_question → penalize_brevity.

This layered derivation is called forward chaining: you start with your base facts, apply rules to derive new facts, then apply more rules to those derived facts, propagating conclusions until no new facts can be derived. The result is a rich fact base that your final scoring rules can query.

Forward-Chaining Evaluation Flow:

Extracted Fields (Facts)
  word_count = 42
  language = "en"
  contains_apology = True
  sentiment = "negative"
         │
         ▼
Derivation Layer (Derived Facts)
  is_too_short      ← word_count < 50
  is_apologetic     ← contains_apology == True
  is_poor_sentiment ← sentiment == "negative"
         │
         ▼
Composite Rules
  penalize_quality  ← is_too_short AND is_apologetic
  flag_for_review   ← is_poor_sentiment OR penalize_quality
         │
         ▼
Final Score Aggregation
  score = sum(weight for each fired rule)

💡 Mental Model: Think of your extracted fields as rows in a database table. Derivation rules are SQL views that compute new columns from existing ones. Your final scoring rules are queries over those views.

Implementing a Rule Engine in Python

Let's build a clean, extensible rule engine from scratch using Python dataclasses and callable predicates. The design prioritizes readability and serializability — every rule is data, and the evaluation loop is a small, auditable function.

from dataclasses import dataclass, field
from typing import Callable, Dict, Any, List, Optional

## A "fact base" is just a dictionary of extracted fields.
FactBase = Dict[str, Any]

@dataclass
class ScoringRule:
    """
    A single rule in the scoring engine.
    
    - name:      Human-readable identifier (used in audit logs)
    - predicate: Callable that takes a FactBase and returns bool
    - weight:    Score contribution when predicate is True
    - terminal:  If True and predicate fires, evaluation stops immediately
    - version:   Rule version string for change tracking
    """
    name: str
    predicate: Callable[[FactBase], bool]
    weight: float = 0.0
    terminal: bool = False
    version: str = "1.0.0"

@dataclass
class EvaluationResult:
    """Structured output from the rule engine."""
    score: float
    fired_rules: List[str]
    terminal_rule: Optional[str]
    passed: bool
    details: Dict[str, Any] = field(default_factory=dict)


def evaluate(fact_base: FactBase, rules: List[ScoringRule]) -> EvaluationResult:
    """
    Forward-chain through all rules, accumulate weights, and
    short-circuit on any terminal rule that fires.
    """
    score = 0.0
    fired_rules = []
    terminal_rule = None

    for rule in rules:
        try:
            fired = rule.predicate(fact_base)
        except Exception as e:
            # Predicates should never crash silently — surface the error.
            raise RuntimeError(f"Rule '{rule.name}' raised: {e}")

        if fired:
            fired_rules.append(rule.name)
            score += rule.weight

            if rule.terminal:
                terminal_rule = rule.name
                break  # Short-circuit: do not evaluate remaining rules.

    return EvaluationResult(
        score=round(score, 4),
        fired_rules=fired_rules,
        terminal_rule=terminal_rule,
        passed=(terminal_rule is None and score >= 0.0),
    )

This engine does four things: it iterates over an ordered list of rules, evaluates each predicate against the fact base, accumulates weights for fired rules, and stops early if a terminal rule fires. The EvaluationResult dataclass preserves a full audit trail — you always know which rules fired, not just what the final score was.

💡 Real-World Example: A customer support bot evaluation might define a terminal rule contains_pii_leak with weight -1.0 and terminal=True. The moment a response leaks PII, evaluation stops and the response is flagged regardless of how polished its tone is.

Composing Rules with AND, OR, and NOT Logic

Single-predicate rules are useful, but real quality criteria are compound. "The response should be concise but not so short as to be unhelpful" requires combining multiple conditions. Python's lambda and closure system makes this composition clean.

## ── Helper combinators for logical composition ──────────────────────────────

def AND(*predicates):
    """Returns a predicate that fires only if ALL sub-predicates fire."""
    def combined(fb: FactBase) -> bool:
        return all(p(fb) for p in predicates)
    combined.__name__ = "AND(" + ", ".join(p.__name__ for p in predicates) + ")"
    return combined

def OR(*predicates):
    """Returns a predicate that fires if ANY sub-predicate fires."""
    def combined(fb: FactBase) -> bool:
        return any(p(fb) for p in predicates)
    combined.__name__ = "OR(" + ", ".join(p.__name__ for p in predicates) + ")"
    return combined

def NOT(predicate):
    """Returns a predicate that fires when the sub-predicate does NOT fire."""
    def combined(fb: FactBase) -> bool:
        return not predicate(fb)
    combined.__name__ = f"NOT({predicate.__name__})"
    return combined


## ── Base predicates (atoms) ──────────────────────────────────────────────────

def is_too_short(fb: FactBase) -> bool:
    return fb.get("word_count", 0) < 50

def is_too_long(fb: FactBase) -> bool:
    return fb.get("word_count", 0) > 500

def contains_apology(fb: FactBase) -> bool:
    text = fb.get("response_text", "").lower()
    return any(phrase in text for phrase in ["i'm sorry", "i apologize", "unfortunately"])

def is_english(fb: FactBase) -> bool:
    return fb.get("detected_language", "") == "en"

def cites_source(fb: FactBase) -> bool:
    return fb.get("citation_count", 0) >= 1


## ── Composed rules ───────────────────────────────────────────────────────────

RULE_SET = [
    ScoringRule(
        name="appropriate_length",
        predicate=AND(NOT(is_too_short), NOT(is_too_long)),
        weight=0.20,
        version="1.1.0",
    ),
    ScoringRule(
        name="confident_tone",
        predicate=NOT(contains_apology),
        weight=0.15,
        version="1.0.0",
    ),
    ScoringRule(
        name="english_with_citation",
        predicate=AND(is_english, cites_source),
        weight=0.25,
        version="1.0.0",
    ),
    ScoringRule(
        name="flagged_as_harmful",       # Terminal: short-circuit on harm
        predicate=lambda fb: fb.get("harm_score", 0.0) > 0.8,
        weight=-1.0,
        terminal=True,
        version="1.2.0",
    ),
]

## ── Usage ────────────────────────────────────────────────────────────────────

fact_base = {
    "response_text": "The capital of France is Paris.",
    "word_count": 7,
    "detected_language": "en",
    "citation_count": 0,
    "harm_score": 0.02,
}

result = evaluate(fact_base, RULE_SET)
print(f"Score: {result.score}")
print(f"Fired rules: {result.fired_rules}")
print(f"Passed: {result.passed}")
## Score: 0.15
## Fired rules: ['confident_tone']
## Passed: True

Notice that appropriate_length did not fire because the response was too short (7 words < 50). The english_with_citation rule did not fire because citation_count is 0. Only confident_tone fired, contributing 0.15 to the score. The audit trail tells this story immediately.

⚠️ Common Mistake: Naming rules by their implementation rather than their intent. A rule named word_count_gte_50 is harder to reason about in audit logs than one named meets_minimum_length. Name rules after the quality property they test, not the threshold that implements it.

Derivation Layers: Building Derived Facts Before Scoring

For more complex scenarios, you will want a derivation pass that enriches the fact base before the scoring rules run. This separates the concern of what is true from the concern of how we score it — a clean separation that makes both layers easier to test and evolve independently.

A derivation layer is just a function that takes a raw fact base, applies a list of derivation rules, and returns an enriched fact base. Each derivation rule is itself a callable that adds new keys to the dictionary.

from typing import Tuple

## Derivation rules return (key, value) pairs to add to the fact base.
DerivationRule = Callable[[FactBase], Tuple[str, Any]]

def derive_is_too_short(fb: FactBase) -> Tuple[str, Any]:
    return ("is_too_short", fb.get("word_count", 0) < 50)

def derive_is_apologetic(fb: FactBase) -> Tuple[str, Any]:
    text = fb.get("response_text", "").lower()
    apologetic = any(p in text for p in ["i'm sorry", "i apologize", "unfortunately"])
    return ("is_apologetic", apologetic)

def derive_quality_flagged(fb: FactBase) -> Tuple[str, Any]:
    # Derived from other derived facts — this is the layering power.
    flagged = fb.get("is_too_short", False) and fb.get("is_apologetic", False)
    return ("quality_flagged", flagged)


def enrich_fact_base(
    raw_facts: FactBase,
    derivations: List[DerivationRule]
) -> FactBase:
    """Apply derivation rules in order, enriching the fact base at each step."""
    enriched = dict(raw_facts)  # Never mutate the original.
    for derive in derivations:
        key, value = derive(enriched)  # Later rules can see earlier derived facts.
        enriched[key] = value
    return enriched


## Derivations run in order — earlier derivations feed later ones.
DERIVATION_PIPELINE = [
    derive_is_too_short,
    derive_is_apologetic,
    derive_quality_flagged,   # Depends on the two above.
]

raw = {
    "response_text": "I'm sorry, I don't know the answer.",
    "word_count": 8,
}

enriched = enrich_fact_base(raw, DERIVATION_PIPELINE)
## enriched now contains:
## { "is_too_short": True, "is_apologetic": True, "quality_flagged": True, ... }

🎯 Key Principle: The derivation layer is where domain knowledge lives. The scoring layer is where quality thresholds live. Keep them separate and you can update either without touching the other.

Versioning Rule Sets as Plain Data

One of the most powerful properties of a rule-based engine is that the rule set itself is serializable. If you express predicates as strings (evaluated at load time) and store everything else as data, your rule sets become diffable artifacts you can commit to version control, review in pull requests, and compare across releases.

Here is a YAML representation of the same rules we built above:

## rule_set_v1.2.0.yaml
meta:
  name: "customer_support_quality"
  version: "1.2.0"
  created_at: "2024-08-01"
  author: "eval-platform-team"

rules:
  - name: "appropriate_length"
    version: "1.1.0"
    weight: 0.20
    terminal: false
    condition:
      and:
        - field: word_count
          op: gte
          value: 50
        - field: word_count
          op: lte
          value: 500

  - name: "confident_tone"
    version: "1.0.0"
    weight: 0.15
    terminal: false
    condition:
      not:
        field: contains_apology
        op: eq
        value: true

  - name: "flagged_as_harmful"
    version: "1.2.0"
    weight: -1.0
    terminal: true
    condition:
      field: harm_score
      op: gt
      value: 0.8

This YAML file is your rule set's source of truth. Your Python engine loads it at startup, compiles the condition expressions into callable predicates, and proceeds as before. When you change a threshold from 0.8 to 0.7, a git diff shows exactly which rule changed, which version field was bumped, and when. This is the operational discipline that makes deterministic scoring genuinely auditable.

💡 Pro Tip: Include a changelog block in your YAML file alongside the rules. Each version bump should record why the threshold changed — a regression in production, a new policy requirement, a dataset audit. Scores without provenance are just numbers; scores with provenance are evidence.

⚠️ Common Mistake: Storing rule logic in application code rather than in versioned data files. When rules live in code, a change to scoring criteria requires a code deployment. When rules live in data, they can be updated, reviewed, and rolled back independently of application releases.

📋 Quick Reference Card: Rule Engine Component Summary

🔧 Component 📚 Role 🎯 Best Practice
🔍 Predicate Tests a single condition on the fact base Keep stateless and pure
🔗 Condition Logical combination of predicates Use AND/OR/NOT combinators
⚖️ Weight Numeric score contribution Use signed floats; sum to 1.0
🚨 Terminal Rule Short-circuit on critical failure Always evaluate first
🗂️ Derivation Layer Enriches fact base before scoring Keep separate from scoring rules
📁 Rule Set File Versioned data source for rules Store in YAML/JSON with metadata

Connecting the Pieces

At this point, you have all the components of a working rule-based scoring engine: base predicates that test individual fields, logical combinators that compose them into rich conditions, a derivation layer that builds derived facts before scoring, a forward-chaining evaluator that produces auditable results, and a data-driven rule set format that supports versioning and diffing.

This family of mechanisms — simple, transparent, and fully reproducible — forms the bedrock of deterministic evaluation. In the next section, we extend this foundation into tree-structured and graph-structured scoring flows, where the order and dependencies between scoring decisions become first-class citizens of the evaluation architecture.

🧠 Mnemonic: P-C-W-TPredicate tests facts, Condition composes predicates, Weight scores conditions, Terminal gates the whole pipeline. Every rule is just these four elements arranged deliberately.

Decision Trees and DAG Metrics as Scoring Engines

Flat rule lists take you surprisingly far, but they have a ceiling. Once your scoring logic starts depending on which conditions fired earlier, or when certain criteria only make sense after others have been met, a linear list of rules starts to bend under the weight of its own conditionals. This section introduces two more powerful structures from the deterministic toolkit: decision trees and Directed Acyclic Graphs (DAGs). Both are fully reproducible and auditable — every score can be traced back to a precise execution path — but each offers a different shape of expressiveness.

Decision Trees as Interpretable Scoring Structures

A decision tree organizes scoring logic as a hierarchy of binary (or multi-way) conditions. Each internal node poses a question about the input. Each branch represents one possible answer. Each leaf node holds a final score, label, or action. The evaluator walks the tree from root to leaf, making one decision at each node, until it arrives at a terminal value.

This structure maps naturally onto the way human evaluators actually reason about complex outputs. Consider a customer-support chatbot evaluation. A human reviewer might think: "First, did the response answer the question at all? If not, it's a zero — don't bother checking tone. If yes, was the answer factually correct? If not, partial credit. If yes, was the tone appropriate? If yes, full marks." That reasoning is already a tree.

                    [Did the response address the question?]
                           /                    \
                         NO                    YES
                          |                     |
                      Score: 0        [Is the answer factually correct?]
                                           /              \
                                         NO              YES
                                          |               |
                                      Score: 0.4   [Is the tone appropriate?]
                                                      /         \
                                                    NO          YES
                                                     |           |
                                                 Score: 0.7   Score: 1.0

Notice what this tree gives you that a flat rule list cannot express cleanly: mutually exclusive branches. The tone check only runs if the factual check already passed. There is no risk of a tone bonus accidentally inflating the score of a factually wrong response. In a flat rule list, you would have to add explicit guard conditions — if factually_correct and tone_appropriate — and manage those interactions manually. In a tree, the structure itself enforces the evaluation order.

🎯 Key Principle: A decision tree's branching structure encodes preconditions implicitly. A node deeper in the tree is only evaluated when all ancestor conditions on that path have been satisfied. This eliminates entire classes of scoring inconsistencies.

When Trees Outperform Flat Rule Lists

Flat rules shine when criteria are genuinely independent. If checking for profanity, checking response length, and checking for a required disclaimer are all orthogonal, a weighted rule list is simpler and equally correct. But trees earn their keep in three specific situations.

Hierarchical criteria occur when some dimensions of quality only matter in the presence of others. Code generation evaluation is a classic case: if the output is not valid Python syntax, there is no point checking whether the algorithm is efficient. The syntactic validity check gates all downstream quality checks.

Mutually exclusive scoring regimes arise when the same surface feature means different things depending on context. A short response might score highly for a yes/no question but poorly for an explanation request. A tree can route those two cases to completely separate sub-trees, each calibrated for its context, without any cross-contamination.

Hierarchical penalty/bonus structures appear when you want a bonus to apply only within a passing tier, not universally. For example, a response that passes a basic quality bar might earn bonus points for citing sources; a response that fails the basic bar should not receive that bonus even if it happens to include citations.

💡 Mental Model: Think of a flat rule list as a checklist where every item is evaluated regardless. A decision tree is an interview where each follow-up question depends on the previous answer. The interview is more efficient and more coherent — but harder to modify on the fly.

Let's implement a lightweight decision tree scorer in Python. The design uses plain dictionaries to represent nodes, making the tree fully serializable to JSON or YAML for versioning.

from typing import Any, Callable, Dict, Optional

## A node is either a decision node (with condition + branches)
## or a leaf node (with a final score).
DecisionTree = Dict[str, Any]

def evaluate_tree(node: DecisionTree, context: Dict[str, Any]) -> float:
    """
    Walk a decision tree and return the leaf score.

    Node schema:
      - Leaf:     {"score": float, "label": str}
      - Decision: {"condition": callable, "if_true": node, "if_false": node}
    """
    if "score" in node:
        # Leaf node: return the terminal score
        return node["score"]

    # Decision node: evaluate the condition and recurse
    condition_fn: Callable[[Dict[str, Any]], bool] = node["condition"]
    branch = node["if_true"] if condition_fn(context) else node["if_false"]
    return evaluate_tree(branch, context)


## --- Build the customer-support evaluation tree ---

support_eval_tree: DecisionTree = {
    "condition": lambda ctx: ctx.get("addresses_question", False),
    "if_false": {"score": 0.0, "label": "did_not_address"},
    "if_true": {
        "condition": lambda ctx: ctx.get("factually_correct", False),
        "if_false": {"score": 0.4, "label": "addressed_but_wrong"},
        "if_true": {
            "condition": lambda ctx: ctx.get("tone_appropriate", False),
            "if_false": {"score": 0.7, "label": "correct_poor_tone"},
            "if_true":  {"score": 1.0, "label": "excellent"},
        },
    },
}

## --- Run two test cases ---

case_a = {"addresses_question": True, "factually_correct": True, "tone_appropriate": False}
case_b = {"addresses_question": True, "factually_correct": False, "tone_appropriate": True}

print(f"Case A score: {evaluate_tree(support_eval_tree, case_a)}")  # 0.7
print(f"Case B score: {evaluate_tree(support_eval_tree, case_b)}")  # 0.4
## Note: Case B has good tone but wrong facts — the tree correctly ignores the tone bonus.

This implementation is deliberately minimal. Real deployments typically serialize the tree to YAML (replacing lambdas with named condition references), which makes it diffable in version control and auditable by non-engineers.

⚠️ Common Mistake: Storing conditions as raw lambda functions in JSON is impossible — lambdas are not serializable. Instead, register conditions in a lookup dictionary keyed by string names and reference those names in your tree definition. The evaluator resolves names to functions at load time.

Directed Acyclic Graphs for Metric Computation

Decision trees improve on flat rules, but they still have one limitation: a tree is a single path. You traverse root to one leaf, and that's your score. Some evaluation problems don't fit that shape. They require computing multiple sub-scores that then combine, where some sub-scores depend on others.

This is the natural domain of Directed Acyclic Graphs (DAGs). In a DAG scorer, each node represents a metric — a scoring function that produces a numeric value. Directed edges represent dependencies: an edge from node A to node B means B requires A's output before it can run. The graph is acyclic because cycles would create impossible dependency loops (B needs A which needs B).

        [raw_text]
           / \
          /   \
  [length]   [fluency]
      \         /
       \       /
    [readability]      [factual_accuracy]
            \              /
             \            /
           [composite_quality_score]

In this DAG, readability depends on both length and fluency. composite_quality_score depends on both readability and factual_accuracy. The evaluator must compute nodes in an order that respects all dependencies — this is called a topological sort.

🎯 Key Principle: A DAG scorer separates what to compute (the graph structure) from in what order to compute it (the topological sort). You define dependencies declaratively; the engine figures out the execution order automatically.

Implementing a DAG Scorer in Python

The implementation has three parts: defining the graph, performing a topological sort, and executing nodes in sorted order while passing intermediate results forward.

from collections import deque
from typing import Any, Callable, Dict, List

## Each node definition:
##   - "deps":    list of node names this node depends on
##   - "compute": function(inputs: dict) -> float
##                inputs is a dict of {dep_name: dep_score} for all deps

NodeDef = Dict[str, Any]
Graph   = Dict[str, NodeDef]


def topological_sort(graph: Graph) -> List[str]:
    """
    Kahn's algorithm: returns nodes in dependency-first order.
    Raises ValueError if a cycle is detected.
    """
    # Build in-degree count and adjacency list
    in_degree = {name: 0 for name in graph}
    dependents: Dict[str, List[str]] = {name: [] for name in graph}

    for name, node in graph.items():
        for dep in node.get("deps", []):
            in_degree[name] += 1
            dependents[dep].append(name)

    # Start with all nodes that have no dependencies
    queue = deque([n for n, deg in in_degree.items() if deg == 0])
    order: List[str] = []

    while queue:
        current = queue.popleft()
        order.append(current)
        for dependent in dependents[current]:
            in_degree[dependent] -= 1
            if in_degree[dependent] == 0:
                queue.append(dependent)

    if len(order) != len(graph):
        raise ValueError("Cycle detected in DAG — scoring graph is invalid.")

    return order


def evaluate_dag(graph: Graph, raw_input: Dict[str, Any]) -> Dict[str, float]:
    """
    Execute all nodes in topological order, collecting scores.
    Returns a dict of {node_name: score} for every node in the graph.
    """
    order = topological_sort(graph)
    scores: Dict[str, float] = {}

    for name in order:
        node = graph[name]
        deps = node.get("deps", [])
        # Gather this node's dependency scores (or raw_input for root nodes)
        dep_scores = {dep: scores[dep] for dep in deps}
        scores[name] = node["compute"](dep_scores if deps else raw_input)

    return scores


## --- Define the evaluation DAG ---

eval_graph: Graph = {
    # Root nodes: read directly from raw_input
    "length_score": {
        "deps": [],
        "compute": lambda inp: min(len(inp["text"].split()) / 100, 1.0),
    },
    "fluency_score": {
        "deps": [],
        # Placeholder: a real impl would call a grammar checker
        "compute": lambda inp: inp.get("fluency", 0.8),
    },
    "factual_accuracy": {
        "deps": [],
        "compute": lambda inp: inp.get("factual_score", 0.9),
    },
    # Derived node: depends on two roots
    "readability": {
        "deps": ["length_score", "fluency_score"],
        "compute": lambda d: 0.4 * d["length_score"] + 0.6 * d["fluency_score"],
    },
    # Final node: depends on a derived node and another root
    "composite_quality": {
        "deps": ["readability", "factual_accuracy"],
        "compute": lambda d: 0.5 * d["readability"] + 0.5 * d["factual_accuracy"],
    },
}

## --- Run the scorer ---

raw = {"text": "The mitochondria is the powerhouse of the cell. " * 5,
       "fluency": 0.85,
       "factual_score": 0.95}

all_scores = evaluate_dag(eval_graph, raw)
for metric, score in all_scores.items():
    print(f"{metric:25s}: {score:.3f}")

The output shows every intermediate score alongside the final composite — a complete audit trail. If composite_quality comes out lower than expected, you can immediately see whether readability or factual_accuracy is the culprit, and drill further into length_score versus fluency_score if readability is the weak link.

💡 Pro Tip: Store the full all_scores dictionary, not just the final value, alongside every evaluated LLM output. This makes regression analysis across model versions trivial — you can see exactly which sub-metric changed when a new model ships.

⚠️ Common Mistake: Forgetting to handle root nodes correctly. Root nodes have no dependencies, so their compute function receives raw_input, while derived nodes receive dep_scores. Always distinguish these two calling conventions in your implementation — mixing them is a frequent source of KeyError bugs.

Visualizing DAG Execution Flow

It helps to see how the topological sort translates into an execution timeline:

Execution Wave 1 (no dependencies):
  ├─ length_score    → 0.500
  ├─ fluency_score   → 0.850
  └─ factual_accuracy → 0.950

Execution Wave 2 (depends only on Wave 1):
  └─ readability     → 0.4×0.500 + 0.6×0.850 = 0.710

Execution Wave 3 (depends on Wave 2 + Wave 1):
  └─ composite_quality → 0.5×0.710 + 0.5×0.950 = 0.830

Final audit trail: all five scores recorded.

Nodes in the same wave have no dependencies on each other, which means they can run in parallel if performance matters. For CPU-bound metric computation at scale, this property is valuable: your topological sort gives you the parallelism structure for free.

🤔 Did you know? The topological sort used here (Kahn's algorithm) has O(V + E) time complexity, where V is the number of metrics and E is the number of dependency edges. Even for very large evaluation graphs with hundreds of metrics, the sort itself is essentially free compared to the cost of actually computing the metrics.

Trade-Off Comparison Across the Three Families

Now that you've seen all three families in action — flat rule lists (from the previous section), decision trees, and DAGs — it's worth comparing them directly across the dimensions that matter most for production evaluation systems.

📋 Quick Reference Card: Deterministic Scorer Families

Dimension 🔧 Rule Lists 🌳 Decision Trees 🔗 DAG Metrics
🎯 Expressiveness Low–Medium: independent criteria only Medium–High: hierarchical, exclusive branches High: arbitrary dependencies, reuse of sub-scores
🔍 Debuggability Excellent: linear trace Very good: single path from root to leaf Good: full score map; more nodes to inspect
🔒 Auditability Excellent Excellent Excellent (every sub-score recorded)
🧠 Implementation complexity Very low Low–Medium Medium (requires topological sort)
📚 Versionability Simple YAML/JSON Moderate (condition serialization) Moderate (graph + condition serialization)
🔧 Best for Independent quality checks, compliance flags Gated criteria, multi-regime scoring Composite metrics with reusable sub-scores

The table reveals a clear progression. Rule lists are your default starting point — reach for them first. When you find yourself writing guard conditions like if A_passed and B_passed, that's a signal to reach for a tree. When you notice the same intermediate calculation being re-used across multiple final metrics, or when your scoring logic is genuinely a computation graph rather than a flow, reach for a DAG.

Wrong thinking: "DAGs are strictly better than trees, so I should always use DAGs."

Correct thinking: "DAGs add implementation complexity and cognitive overhead. Use the simplest structure that correctly encodes your scoring logic."

💡 Real-World Example: A content moderation pipeline at a large platform might use all three in concert. A rule list checks for banned keywords and hard policy violations (fast, binary, independent). A decision tree routes content into different scoring regimes based on content category (different criteria apply to news vs. fiction vs. user reviews). A DAG computes composite safety scores that combine toxicity, bias, and factual-risk sub-scores, where the factual-risk calculation reuses the bias sub-score as an input.

Combining Trees and DAGs

Nothing prevents using these structures together. A common pattern is to use a DAG to compute sub-scores, then feed those sub-scores as inputs to a decision tree that produces the final categorical verdict.

DAG Phase (numeric sub-scores):
  length_score ─┐
  fluency_score ─┤─► readability_score ─┐
  factual_score ─┘                       ├─► [inputs to tree]
  citation_score ────────────────────────┘

Tree Phase (categorical verdict based on DAG outputs):
  [readability_score >= 0.6?]
       ├─ NO  → verdict: "REJECT"
       └─ YES → [factual_score >= 0.8?]
                    ├─ NO  → verdict: "REVIEW"
                    └─ YES → verdict: "APPROVE"

This hybrid approach gives you the computation reuse benefits of a DAG and the categorical gating benefits of a tree, without cramming both concerns into a single structure.

🧠 Mnemonic: Think of the DAG as your calculator and the tree as your decision-maker. The calculator crunches numbers; the decision-maker interprets them into actions.

Practical Guidance on Choosing a Structure

When you sit down to design a new evaluation pipeline, the choice of structure is rarely obvious from first principles. A more reliable approach is to start from your scoring requirements and let them guide you:

🔧 Start with a rule list if:

  • Each criterion has a clear, independent definition
  • Criteria don't gate each other
  • You need non-engineers to read and modify the scoring logic quickly

🌳 Upgrade to a decision tree if:

  • Some criteria only make sense after others pass
  • You have distinct scoring regimes for different input types
  • You want the structure itself to prevent illogical score combinations

🔗 Upgrade to a DAG if:

  • Some sub-computations feed into multiple higher-level metrics
  • Your scoring logic is genuinely a pipeline of dependent transformations
  • You want to record and analyze intermediate sub-scores across your dataset

All three structures share the properties that make them valuable for production LLM evaluation: full reproducibility (same input always yields same output), version control compatibility (the structure can be serialized and diffed), and auditability (every scoring decision can be traced to a specific node or rule). These aren't minor conveniences — in regulated industries, in A/B testing, and in any context where you need to explain why a score changed between model versions, these properties are non-negotiable.

The next section puts these structures to work in a complete end-to-end implementation, showing how raw LLM outputs flow through extraction, into whichever structure fits the evaluation task, and out to final scores that inform model decisions.

Building a Deterministic Scorer in Practice

Theory becomes valuable only when it ships. In the previous sections, we explored rules, decision trees, and DAGs as distinct deterministic scoring primitives. Now we pull those primitives together into a single, coherent system: a scorer you could realistically deploy to evaluate LLM outputs in a production pipeline. We will work through a concrete scenario end to end—from defining what data the scorer accepts, through the Python implementation of a multi-criterion engine, to packaging the output so downstream systems can consume it reliably.

The scenario we will use throughout this section is customer support response evaluation. An LLM generates responses to customer queries, and we need to score those responses on three dimensions: factual accuracy (did the model cite correct information?), tone appropriateness (was the language professional and empathetic?), and completeness (did the response address all sub-questions in the customer's message?). Each dimension has its own scoring logic, and a final composite score must be produced with a full reasoning trace.

Defining the Input Contract

The single most important design decision you will make for a deterministic scorer is not which algorithm to use—it is what data the scorer accepts as input. We call this the input contract: a formal, typed specification of the structured fields that flow from an upstream extraction step into the scoring engine.

Why does this boundary matter so much? Because a deterministic scorer is only as reproducible as its inputs. If the scorer receives raw LLM text and extracts features internally, you have hidden nondeterminism inside the scorer itself. The correct architecture separates extraction (which may involve an LLM call) from evaluation (which must be deterministic). The extracted fields are the interface between those two worlds.

┌─────────────────────┐      ┌──────────────────────┐      ┌─────────────────────┐
│   Raw LLM Output    │─────▶│  Extraction Layer    │─────▶│  Deterministic      │
│   (free text)       │      │  (LLM or regex or    │      │  Scoring Engine     │
│                     │      │   classifier)        │      │  (rules / tree/DAG) │
└─────────────────────┘      └──────────────────────┘      └─────────────────────┘
                                         │
                              Produces structured
                              ExtractedFields object
                              (the input contract)

For our customer support scenario, the extracted fields might include: a boolean indicating whether the response cited a knowledge-base article, an integer count of customer sub-questions detected, an integer count of sub-questions addressed, a string enum for detected tone, and a float confidence score from a toxicity classifier. These are the fields our scorer will trust absolutely—they are typed, bounded, and validated before the scoring engine ever sees them.

🎯 Key Principle: The input contract should be defined as a schema—not as a loose dictionary. Use Python dataclasses or Pydantic models to enforce types at the boundary. Any field that could be None or out of range should be caught before it enters the scorer.

Step-by-Step Python Implementation

We will implement a scorer that combines a rule layer (for tone and citation checks) with a small decision tree (for completeness scoring). The two sub-scores are then combined by a weighted aggregation node, mirroring the DAG pattern described in the previous section.

Defining the Data Model

The first code block establishes the input contract and the output contract together. A score result object should be just as strongly typed as the input—it is the artifact that every downstream system will consume.

from __future__ import annotations
from dataclasses import dataclass, field, asdict
from enum import Enum
from typing import Optional
import json


class ToneLabel(str, Enum):
    PROFESSIONAL = "professional"
    NEUTRAL = "neutral"
    INFORMAL = "informal"
    HOSTILE = "hostile"


@dataclass
class ExtractedFields:
    """Input contract: structured fields produced by the extraction layer."""
    cited_kb_article: bool          # Did the response cite a knowledge-base source?
    sub_questions_detected: int     # How many sub-questions were in the customer query?
    sub_questions_addressed: int    # How many sub-questions does the response address?
    tone: ToneLabel                 # Classified tone of the response
    toxicity_score: float           # 0.0 (clean) to 1.0 (highly toxic)

    def __post_init__(self):
        # Validate bounds immediately on construction
        if not (0.0 <= self.toxicity_score <= 1.0):
            raise ValueError(f"toxicity_score must be in [0, 1], got {self.toxicity_score}")
        if self.sub_questions_addressed > self.sub_questions_detected:
            raise ValueError("Cannot address more sub-questions than were detected")


@dataclass
class DimensionScore:
    """Score for a single evaluation dimension, with its reasoning trace."""
    dimension: str          # e.g. "tone", "completeness"
    score: float            # Normalized 0.0–1.0
    weight: float           # Weight in the final composite
    reasoning: str          # Human-readable explanation of how the score was derived
    passed: bool            # Convenience flag: did this dimension meet the threshold?


@dataclass
class ScorerResult:
    """Output contract: the full scored result returned to downstream systems."""
    composite_score: float
    passed: bool
    dimensions: list[DimensionScore] = field(default_factory=list)
    scorer_version: str = "1.0.0"

    def to_json(self) -> str:
        return json.dumps(asdict(self), indent=2)

Notice that ExtractedFields.__post_init__ validates invariants immediately. This is your first line of defense: bad data should raise loudly at the boundary, not produce a silently wrong score inside the engine.

Implementing the Rule Layer and Decision Tree

With the data model in place, the second code block builds the actual scoring logic. We implement tone and citation scoring as weighted rules, and completeness scoring as a decision tree. All three produce DimensionScore objects, which the aggregator combines.

from typing import Callable

## ── Constants ────────────────────────────────────────────────────────────────
PASS_THRESHOLD = 0.65       # Composite score must exceed this to "pass"
TOXICITY_HARD_CUTOFF = 0.4  # Any response above this fails tone outright


## ── Tone Scorer (rule-based) ──────────────────────────────────────────────────
def score_tone(fields: ExtractedFields) -> DimensionScore:
    """
    Hard-failure rule: toxic responses score 0.0 regardless of tone label.
    Otherwise, tone labels map to scores via a lookup table.
    """
    TONE_SCORES = {
        ToneLabel.PROFESSIONAL: 1.0,
        ToneLabel.NEUTRAL:      0.75,
        ToneLabel.INFORMAL:     0.40,
        ToneLabel.HOSTILE:      0.0,
    }

    # Hard rule: toxicity classifier overrides tone label
    if fields.toxicity_score >= TOXICITY_HARD_CUTOFF:
        return DimensionScore(
            dimension="tone",
            score=0.0,
            weight=0.35,
            reasoning=(
                f"HARD FAIL: toxicity_score={fields.toxicity_score:.2f} "
                f">= cutoff={TOXICITY_HARD_CUTOFF}. Tone label "
                f"'{fields.tone}' ignored."
            ),
            passed=False,
        )

    raw_score = TONE_SCORES[fields.tone]
    reasoning = (
        f"Tone label '{fields.tone}' maps to score {raw_score:.2f}. "
        f"Toxicity score {fields.toxicity_score:.2f} is below cutoff."
    )
    return DimensionScore(
        dimension="tone",
        score=raw_score,
        weight=0.35,
        reasoning=reasoning,
        passed=raw_score >= 0.5,
    )


## ── Citation Scorer (rule-based) ──────────────────────────────────────────────
def score_citation(fields: ExtractedFields) -> DimensionScore:
    """Binary rule: cited = 1.0, not cited = 0.5 (partial credit, not a hard fail)."""
    score = 1.0 if fields.cited_kb_article else 0.5
    reasoning = (
        "Response cited a knowledge-base article." if fields.cited_kb_article
        else "No knowledge-base citation found; partial credit applied."
    )
    return DimensionScore(
        dimension="citation",
        score=score,
        weight=0.25,
        reasoning=reasoning,
        passed=score >= 0.5,
    )


## ── Completeness Scorer (decision tree) ──────────────────────────────────────
def score_completeness(fields: ExtractedFields) -> DimensionScore:
    """
    Decision tree over (sub_questions_detected, sub_questions_addressed).

    Tree structure:
      Root: Were any sub-questions detected?
       ├── No (0 detected) → score 1.0  (nothing to address)
       └── Yes
            ├── All addressed (ratio = 1.0) → score 1.0
            ├── Most addressed (ratio >= 0.75) → score 0.75
            ├── Half addressed (ratio >= 0.50) → score 0.50
            └── Below half → score 0.25
    """
    detected = fields.sub_questions_detected
    addressed = fields.sub_questions_addressed

    # Leaf 1: no sub-questions detected — trivially complete
    if detected == 0:
        return DimensionScore(
            dimension="completeness",
            score=1.0,
            weight=0.40,
            reasoning="No sub-questions detected in the query; completeness is trivially satisfied.",
            passed=True,
        )

    ratio = addressed / detected

    # Leaf 2–5: branching on coverage ratio
    if ratio == 1.0:
        score, label = 1.0, "all"
    elif ratio >= 0.75:
        score, label = 0.75, "most"
    elif ratio >= 0.50:
        score, label = 0.50, "half"
    else:
        score, label = 0.25, "fewer than half"

    reasoning = (
        f"{label.capitalize()} sub-questions addressed "
        f"({addressed}/{detected} = {ratio:.0%}). "
        f"Completeness score: {score:.2f}."
    )
    return DimensionScore(
        dimension="completeness",
        score=score,
        weight=0.40,
        reasoning=reasoning,
        passed=score >= 0.5,
    )


## ── Aggregator ────────────────────────────────────────────────────────────────
def aggregate(dimensions: list[DimensionScore]) -> tuple[float, bool]:
    """Weighted average of dimension scores. Any hard-failed dimension can veto."""
    total_weight = sum(d.weight for d in dimensions)
    composite = sum(d.score * d.weight for d in dimensions) / total_weight
    passed = composite >= PASS_THRESHOLD
    return round(composite, 4), passed


## ── Top-Level Scorer Entry Point ──────────────────────────────────────────────
def score_response(fields: ExtractedFields) -> ScorerResult:
    """Run all dimension scorers and return a fully traced ScorerResult."""
    dims = [
        score_tone(fields),
        score_citation(fields),
        score_completeness(fields),
    ]
    composite, passed = aggregate(dims)
    return ScorerResult(
        composite_score=composite,
        passed=passed,
        dimensions=dims,
    )

The architecture is clean: each dimension scorer is a pure function from ExtractedFields to DimensionScore. The aggregator is also a pure function. You can swap, add, or remove dimension scorers without touching the others—exactly the modularity that makes this system maintainable.

💡 Pro Tip: Keep each dimension scorer in its own function (or even its own module). This makes it trivial to version individual scoring rules independently, which matters when you need to roll back a tone rule without affecting completeness logic.

Returning Scored Results with Reasoning Traces

A score without an explanation is a liability. When an evaluation system flags a response as failing, a developer or content reviewer needs to understand why—and they need to understand it without re-running the scorer or reading source code. This is why every DimensionScore carries a reasoning string, and why the ScorerResult aggregates them into a complete audit trail.

The reasoning trace is not documentation—it is a first-class output. Consider how the result looks when serialized:

## Example usage
example_fields = ExtractedFields(
    cited_kb_article=False,
    sub_questions_detected=3,
    sub_questions_addressed=2,
    tone=ToneLabel.NEUTRAL,
    toxicity_score=0.05,
)

result = score_response(example_fields)
print(result.to_json())

This produces output like:

{
  "composite_score": 0.7125,
  "passed": true,
  "dimensions": [
    {
      "dimension": "tone",
      "score": 0.75,
      "weight": 0.35,
      "reasoning": "Tone label 'neutral' maps to score 0.75. Toxicity score 0.05 is below cutoff.",
      "passed": true
    },
    {
      "dimension": "citation",
      "score": 0.5,
      "weight": 0.25,
      "reasoning": "No knowledge-base citation found; partial credit applied.",
      "passed": true
    },
    {
      "dimension": "completeness",
      "score": 0.75,
      "weight": 0.40,
      "reasoning": "Most sub-questions addressed (2/3 = 67%). Completeness score: 0.75.",
      "passed": true
    }
  ],
  "scorer_version": "1.0.0"
}

Every number has a sentence next to it. A reviewer can read this output and immediately understand what the scorer observed and how it reached its conclusion. This is what auditability looks like in practice.

⚠️ Common Mistake: Tempting as it is, do not use f-strings to embed the final score in the reasoning text. The reasoning should explain the logic applied, not echo the number—the number is already in the score field. Reasoning text that says "score is 0.75 because score is 0.75" provides zero information.

Unit-Testing Deterministic Scorers

Deterministic scorers are among the most testable artifacts in software engineering. Given the same input, they must always return the same output—which means every test is a specification, and every specification is a test. There are three complementary testing strategies you should apply.

Boundary Case Tests

Boundary cases are the inputs that sit exactly on a decision boundary—the ratio equal to exactly 0.75, the toxicity score equal to exactly 0.4, the case where zero sub-questions are detected. These are the inputs where bugs hide. For every threshold in your scorer, write a test for the value exactly at the threshold, one just above, and one just below.

import pytest
from your_module import (
    ExtractedFields, ToneLabel, score_response,
    score_completeness, score_tone, TOXICITY_HARD_CUTOFF
)


## ── Boundary: toxicity cutoff ─────────────────────────────────────────────────
class TestToneScorerBoundaries:
    def test_toxicity_exactly_at_cutoff_fails(self):
        fields = ExtractedFields(
            cited_kb_article=True,
            sub_questions_detected=0,
            sub_questions_addressed=0,
            tone=ToneLabel.PROFESSIONAL,
            toxicity_score=TOXICITY_HARD_CUTOFF,  # exactly 0.4
        )
        result = score_tone(fields)
        assert result.score == 0.0
        assert result.passed is False
        assert "HARD FAIL" in result.reasoning

    def test_toxicity_just_below_cutoff_passes(self):
        fields = ExtractedFields(
            cited_kb_article=True,
            sub_questions_detected=0,
            sub_questions_addressed=0,
            tone=ToneLabel.PROFESSIONAL,
            toxicity_score=TOXICITY_HARD_CUTOFF - 0.001,
        )
        result = score_tone(fields)
        assert result.score == 1.0
        assert result.passed is True


## ── Boundary: completeness ratio ──────────────────────────────────────────────
class TestCompletenessBoundaries:
    @pytest.mark.parametrize("detected,addressed,expected_score", [
        (4, 4, 1.0),   # all
        (4, 3, 0.75),  # exactly 75%
        (4, 2, 0.50),  # exactly 50%
        (4, 1, 0.25),  # below 50%
        (0, 0, 1.0),   # no sub-questions: trivially complete
    ])
    def test_completeness_score_boundaries(
        self, detected, addressed, expected_score
    ):
        fields = ExtractedFields(
            cited_kb_article=True,
            sub_questions_detected=detected,
            sub_questions_addressed=addressed,
            tone=ToneLabel.NEUTRAL,
            toxicity_score=0.0,
        )
        result = score_completeness(fields)
        assert result.score == expected_score


## ── Regression fixture ────────────────────────────────────────────────────────
class TestRegressionFixtures:
    """Pin known input→output pairs. Any change to scorer logic will break these,
    prompting an intentional review rather than a silent regression."""

    KNOWN_GOOD = {
        "composite_score": 0.7125,
        "passed": True,
    }

    def test_regression_known_input(self):
        fields = ExtractedFields(
            cited_kb_article=False,
            sub_questions_detected=3,
            sub_questions_addressed=2,
            tone=ToneLabel.NEUTRAL,
            toxicity_score=0.05,
        )
        result = score_response(fields)
        assert result.composite_score == self.KNOWN_GOOD["composite_score"]
        assert result.passed == self.KNOWN_GOOD["passed"]


## ── Input contract validation ─────────────────────────────────────────────────
class TestInputContractValidation:
    def test_toxicity_out_of_range_raises(self):
        with pytest.raises(ValueError, match="toxicity_score"):
            ExtractedFields(
                cited_kb_article=True,
                sub_questions_detected=1,
                sub_questions_addressed=1,
                tone=ToneLabel.NEUTRAL,
                toxicity_score=1.5,  # invalid
            )

    def test_addressed_exceeds_detected_raises(self):
        with pytest.raises(ValueError, match="Cannot address more"):
            ExtractedFields(
                cited_kb_article=True,
                sub_questions_detected=2,
                sub_questions_addressed=5,  # impossible
                tone=ToneLabel.NEUTRAL,
                toxicity_score=0.0,
            )

🎯 Key Principle: Regression fixtures are your change-detection system. When you modify a rule, the fixture test will fail—not because the new behavior is wrong, but because it is different. That forced pause is the point. Update the fixture only after deliberately confirming the new behavior is correct.

💡 Mental Model: Think of your boundary case tests as a map of all the decision boundaries in your scorer. If you can visualize the scorer as a set of regions (like tiles on a floor), every boundary test sits on a tile edge. Full boundary coverage means you have walked every edge.

Connecting the Scorer Output to Downstream Systems

A scorer that produces a ScorerResult object is useful inside a single Python process. To be useful across a pipeline—feeding a dashboard, triggering an alert, being stored in a database, or being consumed by a downstream evaluation aggregator—that result must cross a serialization boundary. This is where schema validation becomes essential.

The to_json() method we defined earlier is a starting point, but raw JSON without a schema is an implicit contract that will silently break when someone renames a field. The production-grade approach is to define a JSON Schema (or use a Pydantic model, which auto-generates one) so that any consumer can validate the payload they receive.

from pydantic import BaseModel, field_validator, model_validator
from typing import List


class DimensionScoreSchema(BaseModel):
    dimension: str
    score: float
    weight: float
    reasoning: str
    passed: bool

    @field_validator("score", "weight")
    @classmethod
    def must_be_unit_interval(cls, v: float) -> float:
        if not (0.0 <= v <= 1.0):
            raise ValueError(f"Expected value in [0, 1], got {v}")
        return v


class ScorerResultSchema(BaseModel):
    composite_score: float
    passed: bool
    dimensions: List[DimensionScoreSchema]
    scorer_version: str

    @model_validator(mode="after")
    def weights_must_sum_to_one(self) -> "ScorerResultSchema":
        total = sum(d.weight for d in self.dimensions)
        if abs(total - 1.0) > 1e-6:
            raise ValueError(
                f"Dimension weights must sum to 1.0, got {total:.6f}"
            )
        return self


def validated_score_response(fields: ExtractedFields) -> ScorerResultSchema:
    """Score a response and return a Pydantic-validated result."""
    raw = score_response(fields)
    # Convert dataclass to dict, then validate through Pydantic
    return ScorerResultSchema(**{
        "composite_score": raw.composite_score,
        "passed": raw.passed,
        "dimensions": [
            {
                "dimension": d.dimension,
                "score": d.score,
                "weight": d.weight,
                "reasoning": d.reasoning,
                "passed": d.passed,
            }
            for d in raw.dimensions
        ],
        "scorer_version": raw.scorer_version,
    })

With this wrapper, the weights-sum-to-one invariant is enforced at the output boundary, catching any accidental configuration drift where a developer added a new dimension without rebalancing the weights. The Pydantic model also generates a JSON Schema automatically (ScorerResultSchema.model_json_schema()), which you can commit to your repository and use in OpenAPI definitions, contract tests, or data platform schemas.

📋 Quick Reference Card: Scorer Output Integration Checklist

Step What to Do Why It Matters
🔧 Serialize Use model.model_dump_json() Produces validated JSON
🔒 Schema Commit model_json_schema() to repo Enables contract testing
🎯 Version Include scorer_version in output Enables result provenance
📚 Trace Include full dimensions array Powers audit and debugging
🧠 Validate Run Pydantic on ingest, not just on emit Catches upstream drift

🤔 Did you know? Pydantic V2's model_json_schema() generates a JSON Schema draft 2020-12 document that can be used directly with tools like jsonschema, ajv (JavaScript), or published as part of an OpenAPI specification—giving you cross-language contract enforcement for free.

Putting It All Together

Let's review the full data flow one more time, now that each piece is concrete:

Extracted Fields (validated dataclass)
         │
         ▼
 ┌───────────────────────────────────────────┐
 │            score_response()               │
 │                                           │
 │  score_tone()   ─────────────────────┐    │
 │  (rule-based)                        │    │
 │                                      ▼    │
 │  score_citation() ──────────── aggregate()│──▶ ScorerResult
 │  (rule-based)                      │      │    (dataclass)
 │                                    │      │        │
 │  score_completeness() ─────────────┘      │        ▼
 │  (decision tree)                          │  validated_score_response()
 └───────────────────────────────────────────┘  (Pydantic schema)
                                                      │
                                          ┌───────────┴───────────┐
                                          ▼                       ▼
                                     JSON output          Downstream
                                   (with traces)          systems

Every node in this flow is a pure function. Every edge carries a typed, validated object. The reasoning trace is produced inline, never as an afterthought. The output schema is a committed artifact, not an assumption.

Wrong thinking: "I'll add the reasoning traces later once the scoring logic is stable." ✅ Correct thinking: Reasoning traces are part of the scoring logic, not a presentation layer. Build them in from the first iteration.

💡 Real-World Example: A team at a content moderation company built a deterministic scorer for LLM-generated policy explanations. By embedding reasoning traces in every dimension score, they were able to present a human reviewer with a one-sentence justification for each failed check—reducing review time per item from four minutes to under ninety seconds, because reviewers no longer needed to re-derive why something failed.

With the full implementation in hand—input contracts, dimension scorers, reasoning traces, test suites, and validated output—you have a scoring engine that is reproducible by construction, auditable by design, and extensible without risk. The next section turns to what goes wrong in practice: the anti-patterns that undermine each of these properties, and how to recognize them before they reach production.

Common Pitfalls and Anti-Patterns in Deterministic Scoring

Deterministic scorers promise reproducibility, auditability, and debuggability — but only when they are built with discipline. In practice, engineers consistently fall into the same traps when constructing rule-based and graph-based evaluation systems. The irony is that the very properties that make deterministic scorers appealing (explicit logic, traceable paths, versioned rules) become liabilities when the underlying design is sloppy. A poorly maintained rule set is harder to debug than a probabilistic model, because you wrote every line of it and still cannot reason about it. This section catalogs the most common failure modes, explains exactly why they occur, and shows how to avoid them.


Pitfall 1: Rule Explosion

Rule explosion occurs when a rule set grows organically, without architectural discipline, until it becomes too large to reason about, too fragile to change, and internally contradictory. It is the single most common failure mode in production deterministic scorers, and it almost always begins with good intentions.

The pattern is predictable. A team starts with five clean rules for scoring LLM responses. A new edge case appears — a user complains that the scorer misclassified a particular output — so they add a rule to handle it. A few weeks later, a second edge case appears, and another rule is added. After six months, the rule set has forty-seven rules, several of which are logical negations of earlier rules added under time pressure. Nobody on the team can confidently predict the output of the scorer on a new input without running it.

💡 Real-World Example: A content moderation team built a deterministic scorer with rules like if contains_profanity AND NOT is_quoted_text THEN flag. Over eighteen months, the rule set grew to 200+ rules. Rule 47 said if is_educational_context THEN suppress_flag. Rule 183 said if contains_medical_terminology THEN suppress_flag. Rule 201 said if is_educational_context AND contains_medical_terminology THEN flag. Nobody noticed the contradiction until a medical education chatbot started producing inconsistent scores on identical inputs due to non-deterministic rule evaluation order.

The antidote to rule explosion is not fewer rules — it is rule governance. Every rule should have three mandatory fields: a unique identifier, a rationale comment, and a date. Rules that handle special cases should reference the general rule they override. Any rule that has been inactive (never fired) for more than ninety days should be reviewed for removal.

from dataclasses import dataclass, field
from typing import Callable, Optional
from datetime import date

@dataclass
class GovernedRule:
    """A rule with mandatory governance metadata."""
    rule_id: str                        # Unique, stable identifier
    description: str                    # Human-readable intent
    rationale: str                      # Why this rule exists
    added_date: date                    # When it was added
    added_by: str                       # Who is accountable
    overrides: Optional[str] = None     # Which rule_id this supersedes
    condition: Callable = field(default=None, repr=False)
    score_delta: float = 0.0
    fire_count: int = field(default=0, repr=False)  # Tracked at runtime

    def evaluate(self, features: dict) -> float:
        if self.condition(features):
            self.fire_count += 1
            return self.score_delta
        return 0.0

## Example governed rule
rule = GovernedRule(
    rule_id="CONTENT_001",
    description="Penalize responses containing unsupported medical claims",
    rationale="FDA compliance requirement, see ticket ENG-4421",
    added_date=date(2024, 3, 15),
    added_by="alice@example.com",
    overrides=None,
    condition=lambda f: f.get("contains_medical_claim") and not f.get("claim_is_cited"),
    score_delta=-0.4
)

This code enforces governance by making rationale and added_by mandatory fields at the data model level. If you cannot explain why a rule exists, you should not be allowed to ship it. Tracking fire_count at runtime also gives you the data you need to prune stale rules during periodic reviews.

⚠️ Common Mistake: Treating the rule set as append-only. Rules should have a lifecycle: proposed, active, deprecated, archived. Without a deprecation path, dead rules accumulate and contradict live ones.



Pitfall 2: Hardcoding Thresholds Without Documented Rationale

Threshold opacity is subtler than rule explosion but just as damaging. A threshold is any numeric boundary that converts a continuous measurement into a categorical judgment: response length > 500 tokens triggers a penalty, similarity score < 0.7 triggers a mismatch flag, latency > 2000ms triggers a quality deduction. These numbers feel precise and authoritative, but they are often arbitrary — chosen by a developer on a Tuesday afternoon who never wrote down why.

The problem surfaces during calibration. Six months after deployment, a stakeholder asks: "Why is the threshold 0.7 and not 0.65?" Nobody knows. The developer who set it has moved to another team. Changing the threshold without understanding why it was set risks breaking the scorer in non-obvious ways. So the threshold stays, even though the underlying model it was calibrated against has been replaced.

🎯 Key Principle: A threshold that cannot be explained is a threshold that cannot be safely changed. Unexplained thresholds are technical debt that compounds with every model update.

from dataclasses import dataclass
from datetime import date
from typing import Optional

@dataclass
class DocumentedThreshold:
    """
    A threshold with mandatory provenance documentation.
    Forces engineers to record why a value was chosen.
    """
    name: str
    value: float
    unit: str                        # e.g., 'tokens', 'cosine_similarity', 'ms'
    rationale: str                   # Why this specific value
    calibration_dataset: str        # What data was used to set it
    calibration_date: date
    owner: str
    review_date: Optional[date] = None  # When to re-examine

    def check(self, measured_value: float) -> bool:
        """Returns True if the measured value exceeds the threshold."""
        return measured_value > self.value

    def describe(self) -> str:
        return (
            f"{self.name} = {self.value} {self.unit}\n"
            f"  Rationale: {self.rationale}\n"
            f"  Calibrated on: {self.calibration_dataset} ({self.calibration_date})\n"
            f"  Owner: {self.owner}"
        )

## Example: a well-documented semantic similarity threshold
similarity_threshold = DocumentedThreshold(
    name="semantic_match_minimum",
    value=0.72,
    unit="cosine_similarity",
    rationale=(
        "P95 of human-labeled 'acceptable' pairs on eval set v3 "
        "was 0.69; added 0.03 margin to reduce false positives. "
        "See analysis: docs/thresholds/semantic_match_v3.md"
    ),
    calibration_dataset="eval_set_v3_human_labeled_500pairs",
    calibration_date=date(2024, 6, 1),
    owner="bob@example.com",
    review_date=date(2024, 12, 1)  # Re-examine if embedding model changes
)

print(similarity_threshold.describe())

This pattern makes the calibration story explicit and linkable. The review_date field is particularly important: thresholds should be tied to the conditions under which they were calibrated. When the embedding model changes, when the evaluation dataset is refreshed, or when the distribution of LLM outputs shifts, thresholds need to be revisited. A scheduled review date makes this visible in a way that a raw float in a config file never does.

💡 Pro Tip: Store your DocumentedThreshold objects in version control alongside your rule definitions. A diff that changes value=0.72 to value=0.68 but leaves rationale unchanged is a code review red flag — the rationale may no longer be valid.


Pitfall 3: Conflating Extraction Errors with Scoring Errors

One of the most insidious failure modes in deterministic scoring pipelines is error conflation: treating a bad score as evidence of a bad rule, when the actual problem is a bad extracted feature that was fed into an otherwise correct rule.

Consider the architecture of a typical deterministic scorer:

Raw LLM Output
      │
      ▼
┌─────────────┐
│  Extraction │  ← Parse fields: length, entities, citations, tone flags
│    Layer    │
└──────┬──────┘
       │  Extracted Features
       ▼
┌─────────────┐
│   Scoring   │  ← Apply rules, trees, or DAG to features
│    Layer    │
└──────┬──────┘
       │  Score + Explanation
       ▼
    Output

These two layers have completely different failure modes. The extraction layer can fail because a regex is too greedy, because an NLP model misclassifies an entity, or because the output format changed and the parser no longer handles it correctly. The scoring layer can fail because a rule has a logic error, a threshold is miscalibrated, or a DAG dependency is incorrect.

When engineers debug a bad score, they almost always start at the scoring layer — examining rules, checking thresholds, tracing DAG paths. This is backwards. The most common source of bad scores is a bad extracted feature, not a bad rule.

❌ Wrong thinking: "The score is wrong, so the rules are wrong." ✅ Correct thinking: "The score is wrong. First, print every extracted feature and verify it is correct. Then examine the rules."

The fix is to make the two layers independently observable:

from dataclasses import dataclass, field
from typing import Any

@dataclass
class ExtractionResult:
    """Holds extracted features with per-field provenance."""
    features: dict[str, Any]
    extraction_trace: dict[str, str] = field(default_factory=dict)  # feature -> method used
    extraction_warnings: list[str] = field(default_factory=list)

@dataclass
class ScoringResult:
    """Holds the final score with full audit trail."""
    score: float
    extraction: ExtractionResult      # The inputs that produced this score
    rules_fired: list[str]            # Which rule IDs contributed
    rule_contributions: dict[str, float]  # rule_id -> score delta
    scoring_warnings: list[str] = field(default_factory=list)

    def debug_report(self) -> str:
        """Produces a two-layer debug report separating extraction from scoring."""
        lines = ["=== EXTRACTION LAYER ==="]
        for feat, val in self.extraction.features.items():
            method = self.extraction.extraction_trace.get(feat, "unknown")
            lines.append(f"  {feat}: {val!r}  [via {method}]")
        if self.extraction.extraction_warnings:
            lines.append("  WARNINGS:")
            for w in self.extraction.extraction_warnings:
                lines.append(f"    ⚠️  {w}")

        lines.append("\n=== SCORING LAYER ===")
        for rule_id, delta in self.rule_contributions.items():
            lines.append(f"  {rule_id}: {delta:+.3f}")
        lines.append(f"  FINAL SCORE: {self.score:.3f}")
        if self.scoring_warnings:
            for w in self.scoring_warnings:
                lines.append(f"  ⚠️  {w}")

        return "\n".join(lines)

When a score is incorrect, call debug_report() and read the extraction layer first. If citation_count shows 0 but the response clearly contains citations, the extraction layer has a bug. Stop there — no amount of rule-tuning will fix an extractor that cannot see the citations.

⚠️ Common Mistake: Adding a compensating rule (e.g., if citation_count == 0 AND length > 800 THEN boost_score) to work around an extraction bug. This adds a rule that will misfire on genuinely uncited long responses and makes the system harder to reason about. Fix the extractor.



Pitfall 4: Cycle Detection Failure in Hand-Rolled DAGs

When engineers implement their own DAG-based scoring engines without using a library that provides topological sort guarantees, they frequently introduce cyclic dependencies — situations where metric A depends on metric B, which depends on metric C, which depends on metric A. The consequences range from obvious (an infinite loop that crashes the scorer) to silent (an incorrect evaluation order that produces wrong scores without any error).

The silent case is more dangerous. If your hand-rolled DAG evaluator resolves dependencies using a greedy traversal rather than a proper topological sort, it may evaluate nodes in an order that happens to work for your test cases but breaks for inputs with different feature distributions.

  Correct DAG (acyclic):        Broken DAG (cyclic):

  [clarity] ──►  [coherence]    [clarity] ──►  [coherence]
       │                ▼            │                ▼
       └──────► [final_score]    [fluency] ◄────── [fluency]
  [fluency] ──►  [final_score]       └──────────────►
                                     (cycle: coherence → fluency → coherence)

🤔 Did you know? Python's built-in graphlib.TopologicalSorter (added in Python 3.9) raises a graphlib.CycleError with the cycle's members if a cyclic dependency is detected. Using it costs you five lines of code and eliminates an entire class of bugs.

Here is a minimal pattern that makes cycle detection automatic:

import graphlib
from typing import Callable

class SafeDAGScorer:
    """
    A DAG-based scorer that validates acyclicity at construction time.
    Raises graphlib.CycleError immediately if a cycle is introduced,
    rather than failing silently at evaluation time.
    """

    def __init__(self):
        self._nodes: dict[str, Callable] = {}       # node_id -> compute function
        self._dependencies: dict[str, set] = {}     # node_id -> {dependency_ids}

    def add_node(self, node_id: str, dependencies: list[str], fn: Callable):
        """Add a scoring node. Validates the DAG remains acyclic after each addition."""
        self._nodes[node_id] = fn
        self._dependencies[node_id] = set(dependencies)
        self._validate_acyclic()  # Fail fast: check on every addition, not just at runtime

    def _validate_acyclic(self):
        """Raises graphlib.CycleError if any cycle exists in the current graph."""
        sorter = graphlib.TopologicalSorter(self._dependencies)
        # static_order() raises CycleError before yielding any nodes if a cycle exists
        list(sorter.static_order())

    def evaluate(self, base_features: dict) -> dict:
        """Evaluate all nodes in topologically sorted order."""
        sorter = graphlib.TopologicalSorter(self._dependencies)
        results = dict(base_features)  # seed with extracted features
        for node_id in sorter.static_order():
            if node_id in self._nodes:
                results[node_id] = self._nodes[node_id](results)
        return results

## Usage
scorer = SafeDAGScorer()
scorer.add_node("fluency", [], lambda f: 1.0 if f["grammar_errors"] == 0 else 0.5)
scorer.add_node("coherence", ["fluency"], lambda f: f["fluency"] * f["logical_flow"])
scorer.add_node("final", ["fluency", "coherence"], lambda f: 0.4*f["fluency"] + 0.6*f["coherence"])

## This would raise graphlib.CycleError immediately:
## scorer.add_node("fluency", ["coherence"], lambda f: ...)  # creates a cycle

The critical design choice here is fail fast: _validate_acyclic() is called on every add_node() invocation, not just when evaluate() is called. This means cycle errors surface at scorer construction time — in your tests, your CI pipeline, before any LLM output is ever evaluated. A cycle that would have silently corrupted thousands of production evaluations instead fails loudly the moment a developer makes the mistake.


Pitfall 5: Over-Engineering and Under-Engineering the Representation

The final pitfall is a mismatch between the complexity of your scoring logic and the complexity of your scoring representation. It manifests in two opposite directions.

Over-engineering happens when an engineer reaches for a DAG (or a decision tree) when a flat list of rules would be clearer, faster, and easier to maintain. DAGs are appropriate when metrics have genuine dependencies — when computing metric B requires the already-computed value of metric A. If your metrics are independent (each one reads only from extracted features, never from other computed metrics), a flat rule list with no dependency graph is both correct and simpler.

Flat rules are fine when:         DAGs are justified when:

feature_A ──► rule_1 ──► score   feature_A ──► metric_X ──┐
feature_B ──► rule_2 ──► score                              ├──► final_score
feature_C ──► rule_3 ──► score   metric_X  ──► metric_Y ──┘

(No inter-rule dependencies)      (metric_Y genuinely needs metric_X's output)

🧠 Mnemonic: "Reach for the graph when the metrics talk to each other." If your metrics only talk to extracted features, a list will do.

Under-engineering is the mirror failure: using a flat rule list to implement logic that is fundamentally branching and mutually exclusive. The clearest signal that you need a tree or DAG instead of a flat list is when you find yourself writing rules with interacting conditions that are hard to keep consistent:

## ❌ Under-engineered: flat rules trying to express a decision tree
rules = [
    Rule("if is_factual_query AND has_citation",       score=1.0),
    Rule("if is_factual_query AND NOT has_citation",    score=0.3),
    Rule("if NOT is_factual_query AND is_creative",     score=0.8),
    Rule("if NOT is_factual_query AND NOT is_creative", score=0.5),
    # What happens if is_factual_query AND is_creative are both True?
    # What's the score? Two rules fire. Now what?
]

## ✅ Better: a decision tree makes the mutual exclusivity explicit
##
##            [is_factual_query?]
##           /                  \
##         YES                   NO
##          |                     |
##    [has_citation?]      [is_creative?]
##     /         \           /         \
##   YES          NO       YES          NO
## score=1.0  score=0.3  score=0.8   score=0.5

When multiple flat rules can fire simultaneously on the same input and their combined effect is not well-defined, you are implementing a decision tree badly. The flat-rule representation forces you to manually maintain the invariant that exactly one "branch" fires for any given input — an invariant the structure cannot enforce. A decision tree makes the branching structure explicit and impossible to violate.

📋 Quick Reference Card: Choosing the Right Representation

🔧 Signal 🎯 Right Choice
📚 Independent conditions, additive scoring Flat rule list
🔒 Mutually exclusive branches, single path per input Decision tree
🧠 Metrics depend on other metrics DAG
🎯 Fewer than ~10 rules, no branching Flat rule list
🔧 Need to explain every decision path Decision tree or DAG with trace
📚 Logic is growing but mostly additive Governed flat rule list

💡 Mental Model: Think of the three representations as tools with different fit. A flat rule list is a checklist — great for independent criteria. A decision tree is a flowchart — great for mutually exclusive paths. A DAG is a dependency graph — great for metrics that build on each other. Using a dependency graph to implement a checklist is over-engineering. Using a checklist to implement a flowchart is under-engineering.



Putting It Together: A Diagnostic Checklist

When your deterministic scorer produces unexpected results, work through this checklist before changing any rules or thresholds:

🔧 Step 1: Check the extraction layer. Print every extracted feature for the failing input. Verify each one is correct before touching the scoring layer.

📚 Step 2: Check for conflicting rules. Search for rules that could simultaneously fire on this input and produce contradictory score adjustments. If found, check the governance metadata — which rule was added later and why?

🎯 Step 3: Verify threshold provenance. If a threshold boundary is causing the issue, check its DocumentedThreshold record. Is the calibration dataset still valid? Has the underlying model changed?

🔒 Step 4: Run cycle detection. If using a hand-rolled DAG, run topological sort explicitly and verify the output order before diagnosing score values.

🧠 Step 5: Audit representation fit. Is the logic you're trying to express genuinely suited to the representation you're using? If flat rules are encoding branching logic, the representation is fighting the problem.

Deterministic scorers are powerful precisely because they are fully inspectable. Every one of these pitfalls is ultimately a form of lost inspectability — rule sets that are too large to inspect, thresholds with no paper trail, features that are wrong but look right, graphs that are invalid but run anyway, and representations that obscure the logic they are meant to encode. Build your scorer so that any correct score is easy to explain and any incorrect score is easy to diagnose, and you will have avoided the most costly mistakes in this space.

Key Takeaways and What Comes Next

You've traveled a significant distance in this lesson. You started with the question of why reproducibility matters in LLM evaluation, moved through three distinct families of deterministic scoring engines, built a working scorer end-to-end, and mapped the failure modes that trip up even experienced teams. Before moving forward, it's worth pausing to consolidate what you now understand — and to chart where the remaining lessons in this roadmap will take you.

The central insight of this lesson is deceptively simple: deterministic scoring engines and LLM-based extractors are complementary, not competing. The LLM handles ambiguity — parsing natural language, resolving intent, extracting structured signals. The deterministic engine handles judgment — applying rules, weights, and dependency logic to those signals with perfect reproducibility. Neither half works well without the other.


The Three Engine Families at a Glance

Before diving into decision guidance, let's crystallize the key properties of each engine family side by side. This table is designed as a reference you can return to whenever you're scoping a new evaluation task.

Dimension 🔷 Rule Engines 🌲 Decision Trees 🔗 DAG Metrics
Reproducibility ✅ Perfect — same inputs always yield same outputs ✅ Perfect — deterministic traversal ✅ Perfect — topological order is fixed
Expressiveness 🟡 Moderate — Boolean and weighted logic; struggles with sequential dependencies 🟡 Moderate — Excellent for branching; poor at fan-in (multiple prerequisites) ✅ High — Handles complex dependencies, multi-path scoring, and conditional aggregation
Debuggability ✅ High — each rule fires or doesn't; easy to audit per-rule ✅ High — trace the path taken through the tree 🟡 Moderate — requires explicit trace logging; DAG structure can obscure flow
Versionability ✅ Easy — rules are declarative text or dicts ✅ Easy — tree structure serializes cleanly to JSON/YAML 🟡 Medium — graph topology must be carefully versioned alongside node logic
Implementation complexity 🟢 Low — a few Python functions suffice 🟢 Low-to-medium — recursive traversal is straightforward 🔴 Medium-to-high — requires topological sort, node registry, dependency resolution
Best fit Independent, combinable criteria with clear weights Mutually exclusive branches based on a primary signal Hierarchical metrics with explicit upstream/downstream dependencies

🎯 Key Principle: Expressiveness and debuggability trade off against each other as you move from rules → trees → DAGs. Choose the simplest engine that can faithfully represent your rubric — complexity you don't need is complexity that will bite you during an incident.


Decision Guide: Matching the Engine to the Task

Choosing the right engine family is less about technical preference and more about the shape of your rubric. The following guide walks through the key questions you should ask before committing to an architecture.

START HERE
│
├─ Are all your scoring criteria independent of each other?
│   (i.e., criterion A doesn't affect whether criterion B applies)
│   │
│   ├─ YES → Do you need weighted combination of multiple criteria?
│   │         │
│   │         ├─ YES → 🔷 RULE ENGINE
│   │         │         (Weighted boolean rules, easy to audit per-criterion)
│   │         │
│   │         └─ NO  → 🔷 RULE ENGINE (simple)
│   │                   (Unweighted pass/fail rules are just a special case)
│   │
│   └─ NO → Do dependencies form a strict hierarchy
│            (one upstream determines which downstream logic applies)?
│            │
│            ├─ YES → Is the branching factor high (>3 branches per node)?
│            │         │
│            │         ├─ YES → 🔗 DAG METRICS
│            │         │         (Trees become unwieldy at high branching factors)
│            │         │
│            │         └─ NO  → 🌲 DECISION TREE
│            │                   (Readable branching logic, easy to trace)
│            │
│            └─ NO → Do multiple upstream metrics contribute to one downstream?
│                     (fan-in pattern)
│                     │
│                     ├─ YES → 🔗 DAG METRICS
│                     │         (DAGs handle fan-in natively; trees cannot)
│                     │
│                     └─ NO  → Reconsider your rubric structure;
│                               you may have implicit dependencies
│                               that need to be made explicit

💡 Mental Model: Think of rule engines as spreadsheet formulas (each cell is independent), decision trees as flowcharts (one path through), and DAGs as dependency graphs (like a Makefile — a target is only built when all its prerequisites are satisfied).

Auditability Pressure Overrides Expressiveness Preference

There's one situation where the decision guide above should be overridden: when your evaluation system will be reviewed by non-engineers — compliance teams, product managers, legal stakeholders. In that context, favor the engine with the highest human-readability, even if it means accepting some architectural awkwardness.

A decision tree that a product manager can read in a YAML file is worth more in a regulated environment than a DAG that's technically correct but requires a graph-visualization tool to explain. Debuggability is not just a developer concern — it's an organizational trust concern.

⚠️ Common Mistake: Choosing DAGs because they feel more sophisticated, then discovering that no one outside the team can interpret the audit logs. Match engine complexity to your actual auditability requirements, not your aesthetic preferences.


The Deterministic Scorer as One Half of the Hybrid Pattern

Everything in this lesson assumes a critical upstream dependency: the LLM extraction step has already done its job. The deterministic scorer receives clean, schema-validated, structured data. It doesn't know how that data was produced — it only knows how to score it.

This clean separation is architectural, not incidental. It means:

  • 🔧 The scorer can be tested independently using synthetic fixture data, without ever calling an LLM.
  • 🔒 The scorer's behavior is fully reproducible even as the upstream LLM model version changes — as long as the extraction schema stays stable.
  • 📚 The scorer can be swapped without touching the extraction logic, and vice versa.

The diagram below shows how the two halves connect:

┌─────────────────────────────────────────────────────────────────────────┐
│                        HYBRID EVALUATION PIPELINE                       │
├──────────────────────────────┬──────────────────────────────────────────┤
│   PROBABILISTIC HALF         │   DETERMINISTIC HALF                     │
│                              │                                          │
│  Raw LLM Output              │  Structured Fields                       │
│       │                      │       │                                  │
│       ▼                      │       ▼                                  │
│  ┌──────────┐                │  ┌──────────────────────────────┐        │
│  │   LLM    │ ─── JSON ───►  │  │  Rule Engine / Tree / DAG    │        │
│  │Extractor │                │  │                              │        │
│  └──────────┘                │  │  - Applies versioned rubric  │        │
│                              │  │  - Produces trace log        │        │
│  Handles:                    │  │  - Emits final score         │        │
│  • Ambiguity                 │  └──────────────────────────────┘        │
│  • Natural language          │                                          │
│  • Implicit signals          │  Handles:                                │
│                              │  • Deterministic judgment                │
│                              │  • Weighted aggregation                  │
│                              │  • Auditability                          │
├──────────────────────────────┴──────────────────────────────────────────┤
│  Schema validation at the boundary is the contract between both halves  │
└─────────────────────────────────────────────────────────────────────────┘

🎯 Key Principle: The schema at the boundary between the two halves is the contract. Break the schema, break the pipeline. This is why schema validation at ingestion — not as an afterthought — is non-negotiable in production.


Production-Ready Deterministic Scorer Checklist

Before you ship a deterministic scorer, run through this checklist. Each item represents a lesson learned from real production failures.

## Example: A production-ready scorer module structure
## This illustrates the checklist items as code artifacts

from dataclasses import dataclass, field
from typing import Any
import jsonschema
import json

## ✅ CHECKLIST ITEM 1: Schema-validated inputs
EXTRACTION_SCHEMA = {
    "type": "object",
    "required": ["has_citation", "citation_count", "tone", "word_count"],
    "properties": {
        "has_citation": {"type": "boolean"},
        "citation_count": {"type": "integer", "minimum": 0},
        "tone": {"type": "string", "enum": ["formal", "neutral", "informal"]},
        "word_count": {"type": "integer", "minimum": 0},
    },
    "additionalProperties": False,  # Reject unexpected fields
}

## ✅ CHECKLIST ITEM 2: Versioned rules (version is first-class)
RULES_VERSION = "v2.3.1"  # Bump this when logic changes

RULES = [
    {"name": "has_citation", "weight": 0.4, "field": "has_citation", "expected": True},
    {"name": "adequate_citations", "weight": 0.2, "field": "citation_count", "min": 2},
    {"name": "formal_tone", "weight": 0.2, "field": "tone", "expected": "formal"},
    {"name": "sufficient_length", "weight": 0.2, "field": "word_count", "min": 150},
]

@dataclass
class ScorerResult:
    score: float
    passed: bool
    rules_version: str
    # ✅ CHECKLIST ITEM 3: Trace output — every decision is logged
    trace: list[dict] = field(default_factory=list)

def score_response(raw_fields: dict[str, Any]) -> ScorerResult:
    # ✅ CHECKLIST ITEM 4: Validate at entry — never trust upstream silently
    try:
        jsonschema.validate(instance=raw_fields, schema=EXTRACTION_SCHEMA)
    except jsonschema.ValidationError as e:
        raise ValueError(f"Schema validation failed: {e.message}")

    trace = []
    total_score = 0.0

    for rule in RULES:
        field_val = raw_fields[rule["field"]]
        if "expected" in rule:
            fired = field_val == rule["expected"]
        elif "min" in rule:
            fired = field_val >= rule["min"]
        else:
            fired = False

        contribution = rule["weight"] if fired else 0.0
        total_score += contribution

        # ✅ Trace: record every rule's input, outcome, and contribution
        trace.append({
            "rule": rule["name"],
            "input_value": field_val,
            "fired": fired,
            "weight": rule["weight"],
            "contribution": contribution,
        })

    return ScorerResult(
        score=round(total_score, 4),
        passed=total_score >= 0.7,
        rules_version=RULES_VERSION,
        trace=trace,
    )

This module demonstrates four of the five checklist items in working code. The fifth — unit tests — lives outside the module itself:

## ✅ CHECKLIST ITEM 5: Unit tests for every scoring path
import pytest

def test_full_pass():
    result = score_response({
        "has_citation": True,
        "citation_count": 3,
        "tone": "formal",
        "word_count": 200,
    })
    assert result.score == 1.0
    assert result.passed is True
    assert result.rules_version == "v2.3.1"

def test_missing_citation_fails():
    result = score_response({
        "has_citation": False,
        "citation_count": 0,
        "tone": "formal",
        "word_count": 200,
    })
    # has_citation (0.4) and adequate_citations (0.2) both fail
    assert result.score == pytest.approx(0.4)  # only tone + length pass
    assert result.passed is False

def test_schema_violation_raises():
    with pytest.raises(ValueError, match="Schema validation failed"):
        score_response({
            "has_citation": "yes",  # Wrong type — should be boolean
            "citation_count": 2,
            "tone": "formal",
            "word_count": 200,
        })

def test_trace_completeness():
    result = score_response({
        "has_citation": True,
        "citation_count": 1,  # Below minimum of 2
        "tone": "neutral",     # Not "formal"
        "word_count": 160,
    })
    # Verify trace has one entry per rule
    assert len(result.trace) == 4
    # Verify the unfired rules are correctly recorded
    unfired = [r for r in result.trace if not r["fired"]]
    assert len(unfired) == 2

Unit tests serve a second function beyond correctness: they document the intended behavior of the rubric. When a future engineer asks "why does a word count of 149 fail but 150 passes?", the test test_missing_citation_fails and its siblings answer that question with executable precision.

💡 Pro Tip: Version your rules with semantic versioning (MAJOR.MINOR.PATCH). A MAJOR bump means scores from v1.x and v2.x are not directly comparable — downstream dashboards must segment by version. A MINOR bump is backward-compatible rubric expansion. A PATCH is a bug fix that doesn't change intended behavior. Store rules_version in every score record in your database.


What You Now Understand That You Didn't Before

Let's be explicit about the conceptual shifts this lesson has produced:

Before this lesson, you might have assumed that LLM-as-judge meant using an LLM to generate scores — and that deterministic methods were too rigid to handle the nuance of real evaluation tasks.

After this lesson, you understand:

  • 🧠 Determinism is a design choice, not a limitation. Rule engines, decision trees, and DAGs are not simpler than LLM judges — they are differently powerful. Their power is in reproducibility, auditability, and the ability to unit-test your rubric.
  • 📚 The hybrid pattern separates concerns correctly. LLMs handle the extraction of meaning from language. Deterministic engines handle the application of judgment. Mixing these responsibilities into a single LLM call sacrifices both debuggability and reproducibility.
  • 🔧 Engine choice is a rubric structure question. You don't choose a DAG because it's cool — you choose it because your rubric has fan-in dependencies that a rule engine cannot represent cleanly.
  • 🎯 Trace output is not optional in production. A score without a trace is a black box. Every deterministic scorer must emit a human-readable record of which rules fired, what inputs they saw, and how contributions aggregated.
  • 🔒 Schema validation at the boundary is the load-bearing wall. It's the contract that lets both halves of the pipeline evolve independently without breaking each other silently.

Preview: What the Upcoming Lessons Cover

This lesson gave you the engine families and the architectural pattern. The remaining lessons in this roadmap go deeper on three fronts:

Encoding Rubrics as Formal Rules

The next lesson focuses on the craft of rubric translation — taking a natural-language evaluation standard ("the response should be accurate, concise, and properly cited") and encoding it as a set of formal, testable rules. This is harder than it sounds. Natural language rubrics are ambiguous; formal rules are not. The lesson covers techniques for surfacing hidden assumptions in rubrics, resolving conflicts between criteria, and keeping rules maintainable as product requirements evolve.

Building Audit Trails

A later lesson goes deep on what "audit trail" means in practice for LLM evaluation systems — how to structure trace logs so they're queryable, how to attach them to score records in a database, and how to build tooling that lets a human reviewer reconstruct exactly why a particular output received a particular score. This becomes critical when your evaluation system is used to make high-stakes decisions (content moderation, automated grading, compliance checks).

Recognizing When the Hybrid Pattern Is Overkill

Not every LLM evaluation task warrants the full hybrid architecture. A later lesson in the roadmap covers the conditions under which a simpler approach — a single LLM-as-judge call with a well-designed prompt — is the right call. The hybrid pattern has real costs: two pipeline stages, a schema to maintain, two sets of tests. Understanding when those costs are justified (and when they aren't) is as important as knowing how to build the pattern correctly.

💡 Real-World Example: A team building an internal chatbot for HR questions might start with a simple LLM judge prompt. When that system moves to handling compliance-adjacent queries — and the legal team asks "how exactly did you score this response?" — that's the moment the hybrid pattern earns its complexity budget.


The Production-Ready Scorer Checklist (Quick Reference)

📋 Quick Reference Card: Deterministic Scorer Production Checklist

# ✅ Checklist Item Why It Matters Red Flag if Missing
1 🔒 Schema-validated inputs Prevents silent corruption from upstream changes Scores drift mysteriously after LLM model update
2 📦 Versioned rules Enables score comparability across time Regression analysis produces meaningless comparisons
3 🔍 Trace output per score Enables human review and debugging Can't explain why a score changed between runs
4 🧪 Unit tests for all scoring paths Catches regressions when rules change A rubric change silently breaks edge cases
5 📝 Documented schema contract Enables independent evolution of both pipeline halves Teams step on each other's changes

🧠 Mnemonic: S-V-T-U-DSchema, Version, Trace, Unit tests, Documented contract. Or: "Some Very Thorough Users Debug".


Final Critical Points to Carry Forward

⚠️ The deterministic scorer is only as good as its inputs. If the LLM extractor produces malformed, inconsistent, or hallucinated field values, no amount of elegant rule logic will save your scores. Invest in extraction quality validation before you invest in scoring sophistication.

⚠️ Reproducibility means the same result given the same inputs — not the same result given the same raw LLM output. If the extractor is non-deterministic (as all LLMs are, at temperature > 0), two runs of the full pipeline on the same raw output may still produce different scores. The deterministic scorer eliminates variance within its own stage; it cannot eliminate upstream variance. Account for this in your evaluation methodology.

⚠️ Rules, trees, and DAGs are not static artifacts. They are living documents that must be maintained, tested, and versioned with the same discipline as production code. A rubric that isn't updated when the product requirement changes is actively misleading — it will score outputs against a standard that no longer reflects what the team cares about.


Three Practical Next Steps

Where should you go from here? Choose the next step that matches where you are:

🔧 If you're starting a new evaluation system: Begin with rule engines. Write the simplest version of your rubric as a Python dict of weighted conditions. Get it unit-tested and schema-validated. Only add a tree or DAG structure if you hit an expressiveness limit that you can't solve within the rule engine.

📚 If you have an existing LLM-as-judge setup: Identify the evaluation dimensions where you most need reproducibility (compliance criteria, safety checks, factual accuracy flags). Extract those dimensions from the LLM judge prompt into a deterministic rule layer. Leave the subjective dimensions (tone quality, helpfulness) in the LLM judge for now. Hybridize incrementally.

🎯 If you're preparing an evaluation system for production review: Run through the five-item checklist above. If any item is missing, treat it as a blocking issue before launch. Specifically, if you don't have trace output, you will not be able to answer the first hard question a stakeholder asks about a disputed score.

The deterministic scoring pattern you've learned in this lesson is one of the most durable tools in the LLM evaluation toolkit. Unlike prompt-based judges, it doesn't degrade when models are updated, doesn't require a live API call to audit, and doesn't leave stakeholders wondering how a decision was made. Used correctly, it turns your evaluation rubric into a transparent, testable, versionable artifact — one that earns trust precisely because it has no mystery.