Match each Datalog concept to its correct description in an LLM evaluation pipeline:

!MATCH[["EDB","Ground-truth facts asserted by the extraction layer"],["IDB","Derived relations computed by applying rubric rules"],["Negation-as-failure","Treating an unprovable fact as false under the closed-world assumption"],["Unification","The process of binding rule variables to matching fact values"],["Stratified evaluation","The ordering of rule layers that ensures termination with negation"]]

Datalog for Deterministic Scoring

atalog is a logic programming language with guaranteed termination, no side effects, and a facts-plus-rules model that maps directly onto extracted evidence and rubric criteria. Encode your scoring logic as Datalog rules and every score becomes reproducible, auditable, and safe to run on LLM-generated inputs. Covers why Datalog fits this problem where SQL, Prolog, and hand-rolled boolean logic fall short. Differentiator-forward: The scoring layer most LLM-as-judge tutorials skip. Datalog gives you a rule engine with three properties that matter for production eval: termination is guaranteed by the language itself, rules compose without surprising interactions, and the evaluation trace is a natural audit log. Learn why these properties aren't nice-to-haves — they're what makes a scoring layer defensible when the decision has real consequences. Problem-first: Once you've extracted structured facts from an LLM output, something has to turn those facts into a score. A second LLM call reintroduces the variance you just removed. A hand-rolled rules engine gets unwieldy past a dozen criteria. Datalog sits in the middle: declarative enough to read like a rubric, restricted enough to guarantee termination, expressive enough to handle real scoring logic with thresholds, levels, and aggregation.

Last generated Apr 23, 2026 UTC

Why Your Scoring Layer Is the Weakest Link

You've built an LLM evaluation pipeline. You're proud of it. The extraction step pulls structured facts from model outputs with impressive reliability, and you have free flashcards ready to drill the concepts as you go. But here's the question that should keep you up at night: what happens to those facts after you extract them? The answer — the mechanism that converts structured evidence into a score — is almost certainly the weakest link in your system. It's the part most tutorials skip, the part most teams bolt on as an afterthought, and paradoxically, it's the part that determines whether your entire evaluation pipeline is trustworthy or just expensive theater.

This section is about that gap. We'll examine why the scoring layer deserves as much engineering attention as the extraction layer, why the most obvious fixes make the problem worse, and how Datalog — a logic programming language with roots in database theory — turns out to be the right tool for the job.

The Extraction-Plus-Scoring Split

When you evaluate an LLM's output, you're implicitly doing two things, whether you realize it or not. First, you're making claims about what the output contains — did the model cite a source? Did it stay within a word count? Did it use a threatening tone? Second, you're judging those claims against a rubric — does citing two sources pass? Does a mildly negative tone disqualify an otherwise perfect response?

Most naive pipelines smash these two concerns together. A single LLM prompt receives the raw output and returns a score, handling both extraction and judgment in one opaque step. This feels efficient. It's actually catastrophic for reproducibility.

Separation of concerns is the architectural principle that fixes this. When you split extraction from scoring, you create a stable contract between two systems: the extractor produces a structured record of facts about the output, and the scorer applies deterministic rules to that record. The extractor can be an LLM — and often should be, because natural language understanding is genuinely hard. But once you have facts, the scoring should never require another round of inference.

Raw LLM Output
      │
      ▼
┌─────────────┐
│  EXTRACTOR  │  ← LLM or structured parser
│  (can be    │     Produces facts from output
│   an LLM)  │
└─────────────┘
      │
      │  {citation_count: 3,
      │   tone: "neutral",
      │   word_count: 412, ...}
      │
      ▼
┌─────────────┐
│   SCORER    │  ← Deterministic rules engine
│  (must be   │     Converts facts → score
│   NOT LLM) │
└─────────────┘
      │
      ▼
  Score + Audit Trail

This split delivers a concrete benefit: your extraction can improve over time without touching your scoring logic, and your rubric can evolve without retraining your extractor. More importantly, it creates a reproducibility boundary. Given identical facts, the scorer must always return identical scores. That's only achievable if the scorer is deterministic — and no second LLM call can give you that.

💡 Mental Model: Think of the extractor as a witness and the scorer as a judge. The witness observes and reports facts. The judge applies fixed rules to those facts. You wouldn't ask the witness to decide the verdict.

Why a Second LLM Call Defeats the Purpose

Here's a failure mode that's more common than it should be. A team builds a careful extraction step, gets structured JSON out of the model, and then... passes that JSON back to an LLM to score it. "Is this a good response given these facts? Score 1–10." It feels principled because at least the scoring is operating on structured data. But it has quietly reintroduced everything the extraction step was supposed to eliminate.

LLM variance is the core problem. The same prompt, given to the same model, at the same temperature, can return different scores on different runs. Not dramatically different — but different enough to make your evaluation pipeline non-deterministic. If you're running nightly eval sweeps, you might see a model's score drift by 0.3 points between Monday and Wednesday without the model or the rubric changing at all. You'll spend hours debugging a phantom regression.

🤔 Did you know? Benchmark contamination research has shown that popular LLMs can score themselves inconsistently by as much as 15% across repeated runs on identical prompts, even at temperature=0, because non-zero residual variance exists in how models sample from their distributions at inference time.

Beyond variance, there's the opacity problem. When an LLM scores a set of facts, you cannot reliably explain why it gave that score. You can ask it to explain, but that explanation is itself generated by the model — it may be post-hoc rationalization rather than a genuine account of the scoring logic. In a production context where a low score means a response gets suppressed, a document gets flagged, or a vendor gets penalized, "the model felt like it" is not an acceptable audit trail.

Finally, LLM scoring is expensive. Every score requires an API call. For large-scale evaluations — thousands of responses, run repeatedly — the cost is non-trivial. Deterministic rule execution is essentially free by comparison.

❌ Wrong thinking: "Using an LLM to score LLM outputs is fine because the scoring LLM is smarter and more flexible than hard-coded rules."

✅ Correct thinking: "Flexibility in the scoring layer is a liability, not an asset. Scoring needs to be predictable, auditable, and cheap. LLMs are none of those things."

The Scaling Problem with Hand-Rolled Logic

Okay, so no second LLM call. The natural next move is to write the scoring logic yourself — a Python function that takes the extracted facts dict and returns a score. This works fine for a rubric with three or four criteria. It stops working around criterion number twelve.

Here's why. Imagine a rubric that scores a model response on five independent criteria: citation quality, tone, factual accuracy, response length, and format compliance. Each criterion has a pass/fail outcome. Your scoring function might look like this:

def score_response(facts: dict) -> dict:
    """
    Hand-rolled scoring logic for a 5-criterion rubric.
    Works fine at this scale. Note how criteria interact.
    """
    citations_ok = facts["citation_count"] >= 2
    tone_ok = facts["tone"] in {"neutral", "positive"}
    accurate = facts["accuracy_score"] >= 0.85
    length_ok = 200 <= facts["word_count"] <= 600
    format_ok = facts["has_headers"] and facts["has_summary"]

    # Simple pass/fail: all criteria must pass
    passed = all([citations_ok, tone_ok, accurate, length_ok, format_ok])

    return {
        "score": "pass" if passed else "fail",
        "criteria": {
            "citations": citations_ok,
            "tone": tone_ok,
            "accuracy": accurate,
            "length": length_ok,
            "format": format_ok,
        }
    }

This is readable. A colleague can audit it in two minutes. But now your product manager adds seven more criteria. Some criteria should only apply in certain contexts — format compliance matters for customer-facing responses but not internal summaries. Some criteria have weighted importance — accuracy failures should always fail the response regardless of other scores, but a minor length overage should only fail if two other criteria also fail. Some criteria have interactions — tone is evaluated differently when the topic is a complaint.

Your function grows. Conditionals nest. You add flags like is_customer_facing and topic_type and the boolean logic becomes a tangle that nobody wants to touch. A year later, you have a 400-line function with seven layers of nesting, and when a stakeholder asks "why did this response fail?", the honest answer is "I need to trace through the logic manually to find out."

⚠️ Common Mistake 1: Treating the scoring function as "just a few if statements" and never refactoring it. Boolean logic for rubric scoring has combinatorial complexity — \(2^n\) possible combinations of \(n\) binary criteria, each potentially subject to different interaction rules. A rubric with 15 criteria has 32,768 possible combinations. Hand-rolled conditionals cannot express or test this space reliably.

The deeper problem is that hand-rolled logic lacks compositional safety. When you add a new rule, you cannot easily reason about whether it interacts with existing rules in unexpected ways. You find out at runtime, usually in production, usually at the worst possible moment.

## What this looks like after 12 months of feature additions.
## This is a cautionary example, not a model to follow.

def score_response_v12(facts: dict, context: dict) -> dict:
    citations_ok = facts["citation_count"] >= 2
    if context.get("is_technical"):
        citations_ok = facts["citation_count"] >= 4  # Technical docs need more
    
    tone_ok = facts["tone"] in {"neutral", "positive"}
    if context.get("topic") == "complaint":
        # Complaints can be neutral-negative but not hostile
        tone_ok = facts["tone"] in {"neutral", "neutral-negative"}
        if facts.get("escalation_flag"):
            tone_ok = True  # Escalations are exempt from tone scoring??
    
    # ... 300 more lines of this
    # Bug introduced on line 247 because someone didn't know
    # about the escalation_flag exemption when they added
    # the new 'professional_services' context type
    pass

This isn't a hypothetical. This is what production scoring logic looks like at most companies that don't invest in a proper rules engine.

💡 Real-World Example: A major enterprise AI platform ran a postmortem on a scoring regression that had incorrectly passed hundreds of non-compliant responses over three weeks. The root cause was a conditional branch added by one engineer that inadvertently short-circuited a safety check added by a different engineer the previous quarter. Neither engineer had documentation of the other's change. The fix required a full audit of 400 lines of nested boolean logic.

What 'Defensible' Scoring Means in Production

Let's be concrete about what's at stake. In a research prototype, a flaky scoring layer is annoying. In production, it can be genuinely damaging.

Consider a few scenarios where your scoring layer makes decisions with real consequences:

🔧 Content moderation: An LLM-generated response scores above the threshold and gets published. If someone is harmed by that content, "our scoring model gave it 7.4 out of 10" is not a defensible explanation to regulators, users, or a court.

🎯 Vendor evaluation: You're using LLM eval to score outputs from multiple model providers and decide which one gets a contract renewal. A non-deterministic scoring layer means the comparison is meaningless — you're not measuring model quality, you're measuring the noise floor of your evaluator.

📚 Automated grading: Educational platforms are using LLM evaluation to score student assignments. A score that changes between runs is not a grade — it's a lottery.

🔒 Compliance checking: A financial services firm uses LLM evaluation to verify that generated documents comply with disclosure requirements. "We checked and it passed" needs to mean something reproducible and auditable.

In all of these contexts, defensibility means three things: the score can be reproduced from the same inputs, the reasoning behind the score can be explained to a non-technical stakeholder, and the scoring logic can be audited by a third party. No second LLM call satisfies any of these. Hand-rolled boolean logic satisfies the first, fails the second, and makes the third painful.

🎯 Key Principle: A scoring layer is defensible when any score can be traced from the input facts through the applied rules to the output, with no gaps or probabilistic steps. This requires a rule engine, not an LLM and not unstructured code.

The Three Datalog Properties That Solve This

This is where Datalog enters the picture. Datalog is a declarative logic programming language — a restricted subset of Prolog designed specifically for database-style reasoning. It was developed in the 1970s and 80s as a foundation for deductive databases, and it has a set of properties that turn out to be precisely what a scoring layer needs.

We'll cover each property in depth in later sections. Here's the preview that motivates the rest of the lesson:

Property 1: Guaranteed Termination

Datalog programs always terminate. This is not a claim about typical behavior or best-case performance — it is a mathematical guarantee enforced by the language's design. Datalog forbids function symbols, which eliminates the possibility of infinitely growing terms. Combined with its stratified semantics, this means any valid Datalog program will finish executing in bounded time.

For a scoring layer, this matters enormously. You don't want your evaluation pipeline to hang on a pathological input. You don't want to write timeout logic around your scoring function. You want a language where the question "will this terminate?" has a guaranteed answer of yes.

Property 2: Composable Rules Without Surprising Interactions

Datalog rules compose predictably. When you add a new rule, it can reference existing rules and extend the inference chain, but it cannot modify or shadow existing rules. The semantics are monotonic: adding facts or rules can only derive more conclusions, never fewer. This means you can extend a Datalog scoring program with confidence — new criteria add new derivations without breaking existing ones.

This is the property that hand-rolled boolean logic lacks. In Python, adding a new conditional can accidentally short-circuit an existing one. In Datalog, that's structurally impossible.

Property 3: The Evaluation Trace Is a Natural Audit Log

When Datalog derives a conclusion, it does so through a chain of rule applications. Every fact in the output was derived from facts in the input through explicitly named rules. This derivation history is not an optional debug feature — it's inherent to how Datalog evaluation works. With a small amount of instrumentation, you can surface the complete reasoning chain for any score as a structured audit log.

This is the explainability story that neither LLM scoring nor hand-rolled logic can match. "Response failed because citation_count = 1 failed to satisfy citation_rule_technical which requires citation_count >= 4 given is_technical = true" is something a compliance officer can read and verify.

📋 Quick Reference Card:

🔧 Approach	📊 Reproducible	💡 Explainable	🔒 Safe to Extend	⚡ Cost
🤖 Second LLM Call	❌ Non-deterministic	❌ Post-hoc only	❌ Opaque	💸 High
🐍 Hand-rolled Python	✅ Deterministic	⚠️ Manual tracing	⚠️ Fragile at scale	✅ Low
🗄️ SQL Rules	✅ Deterministic	⚠️ Query-by-query	⚠️ Joins get complex	✅ Low
📐 Datalog	✅ Deterministic	✅ Trace is derivation	✅ Monotonic composition	✅ Low

Why Not SQL? Why Not Prolog?

Two alternatives deserve a quick dismissal before we proceed.

SQL can express many scoring rules, especially when your extracted facts live in a database. But SQL's declarative power breaks down when rules need to reference other rules recursively (though recursive CTEs help), and SQL provides no natural derivation trace. More fundamentally, SQL is a query language for data — the mental model is "filter and aggregate rows," not "derive conclusions from evidence." Rubrics are not filters. They're inference chains.

Prolog is Datalog's more powerful ancestor. The problem is that power cuts both ways. Prolog allows arbitrary recursion, which means Prolog programs can loop infinitely. For a scoring layer processing untrusted LLM output, feeding a Prolog engine facts that trigger an infinite loop is a denial-of-service vulnerability. Datalog is Prolog with the dangerous parts removed — you lose Turing completeness but you gain the termination guarantee that makes it safe to run on adversarial inputs.

🧠 Mnemonic: SAFE — Strictly terminates, Audit trail built-in, Facts-plus-rules model, Extensible without breakage. That's Datalog for scoring.

Setting Up the Problem for the Rest of the Lesson

Everything in this lesson flows from the problem we've just established. You have extracted facts. You need a score. The naive approaches — another LLM call, a hand-rolled function — fail in predictable and serious ways. Datalog offers three properties that directly address those failures: termination guarantees, composable rules, and built-in audit traces.

In the sections that follow, we'll move from motivation to mechanics. You'll learn the core Datalog building blocks with a focus on the mental model most useful for scoring pipelines. You'll see how to translate real rubric structures — pass/fail gates, multi-level quality tiers, threshold aggregation — into runnable Datalog rules using a Python-accessible engine. You'll learn how to surface the derivation trace as a structured audit log. And you'll see the common pitfalls that trip practitioners up when they first adopt this approach.

💡 Pro Tip: Before moving on, take stock of your current evaluation pipeline. Where does the scoring happen? Is it deterministic? Could you reproduce any historical score from stored inputs alone? If the answer to either of those last questions is "not reliably," the rest of this lesson is written specifically for you.

The scoring layer is where trust in your evaluation system is won or lost. It's time to build it right.

Datalog Fundamentals: Facts, Rules, and Queries

Before you can trust a score, you need to trust the mechanism that produces it. In the previous section we established why the scoring layer is where evaluation pipelines tend to break down. Now we build the foundation: a working mental model of Datalog that will carry you through the rest of this lesson. No prior logic programming experience is assumed. If you have written SQL queries or used pandas to filter a dataframe, you already have enough intuition to follow along.

The Mental Model: A Database That Knows the Rules

The single most useful way to think about Datalog in a scoring context is this: Datalog is a database where you store extracted evidence, and the rubric lives in the query layer. Every fact your LLM extractor produces becomes a row in that database. Every criterion in your rubric becomes a rule. When you run the engine, it derives every score that follows logically from those inputs.

This maps directly onto the two-layer architecture you want in a production evaluation pipeline:

┌─────────────────────────────────────────────────────┐
│                  DATALOG ENGINE                     │
│                                                     │
│  ┌─────────────────────┐   ┌─────────────────────┐ │
│  │   EDB (Evidence)    │   │   IDB (Rubric)      │ │
│  │                     │   │                     │ │
│  │  extracted by LLM   │   │  written by you     │ │
│  │  or classical NLP   │──▶│  rules that derive  │ │
│  │                     │   │  scores from facts  │ │
│  │  Ground truth only. │   │  No hard-coded      │ │
│  │  No inference here. │   │  data. Pure logic.  │ │
│  └─────────────────────┘   └─────────────────────┘ │
│                                    │                │
│                                    ▼                │
│                          ┌──────────────────┐       │
│                          │  Query Results   │       │
│                          │  (scores +       │       │
│                          │   derivations)   │       │
│                          └──────────────────┘       │
└─────────────────────────────────────────────────────┘

Datalog formalizes this separation with two named components. The EDB (Extensional Database) contains facts — ground-truth assertions that you supply directly. The IDB (Intensional Database) contains rules — derived relations that the engine computes from the EDB. This distinction is not cosmetic. It enforces a discipline that hand-rolled scoring code almost never has: the data layer and the logic layer cannot accidentally bleed into each other.

Facts: The Extensional Database

A fact in Datalog is an unconditional assertion about the world. It takes the form of a relation name followed by one or more arguments in parentheses. There is no body, no condition — it is simply true.

In an LLM evaluation context, your LLM extractor or a classical NLP step produces structured output that you load directly into the EDB as facts. Here is what that looks like for a response-quality scoring pipeline:

% Facts extracted from LLM output (the EDB)
% response_length(ResponseId, WordCount)
response_length(resp_001, 312).
response_length(resp_002, 47).

% cites_source(ResponseId, SourceId)
cites_source(resp_001, src_a).
cites_source(resp_001, src_b).

% contains_disclaimer(ResponseId)
contains_disclaimer(resp_001).

% toxicity_score(ResponseId, Score)   -- 0.0 to 1.0
toxicity_score(resp_001, 0.02).
toxicity_score(resp_002, 0.71).

Each line ending in a period is one fact. response_length(resp_001, 312) asserts that the response identified as resp_001 has a word count of 312. Notice that each fact is ground — it contains no variables, only concrete values. This is the defining characteristic of EDB facts: they represent direct observations, not derived conclusions.

💡 Mental Model: Think of EDB facts as rows in a database table. response_length is a two-column table with a response ID column and a word count column. cites_source is a two-column table with a response ID and a source ID. The relation name is the table name. This analogy holds almost perfectly for the EDB layer.

🎯 Key Principle: Every piece of evidence your LLM extractor produces should map to exactly one fact relation. If your extractor returns a JSON object with five fields, you may need five separate relations — one per logical claim. Bundling multiple claims into a single relation conflates things that your rules will want to reason about independently.

Rules: The Intensional Database

A rule in Datalog has two parts separated by the symbol :-, which you should read as "if" or "provided that". The left side is the head — the relation being derived. The right side is the body — a conjunction of conditions that must all be satisfied for the head to hold.

head(X, Y) :- condition_one(X, Z), condition_two(Z, Y).

Read this as: "For any X and Y, head(X, Y) is true if there exists some Z such that condition_one(X, Z) is true and condition_two(Z, Y) is true."

Let's translate a real rubric criterion. Suppose your rubric says: A response passes the length check if it contains at least 100 words. In Datalog:

% Rules (the IDB)
% A response passes the length gate if its word count is >= 100
passes_length_gate(ResponseId) :-
    response_length(ResponseId, WordCount),
    WordCount >= 100.

% A response is safe if its toxicity score is below 0.1
response_is_safe(ResponseId) :-
    toxicity_score(ResponseId, Score),
    Score < 0.1.

% A response has adequate citations if it cites at least one source
has_citation(ResponseId) :-
    cites_source(ResponseId, _).

% A response passes basic quality checks if it passes all three gates
passes_basic_quality(ResponseId) :-
    passes_length_gate(ResponseId),
    response_is_safe(ResponseId),
    has_citation(ResponseId).

Several things are worth unpacking here. First, notice how closely the rules read like the original rubric language. passes_length_gate(ResponseId) is almost self-documenting. This is a deliberate design property of Datalog — the restricted syntax forces a degree of clarity that general-purpose code rarely achieves.

Second, notice _ in cites_source(ResponseId, _). The anonymous variable (underscore) means "there exists some source ID here, but I don't care what it is." This is Datalog's way of expressing existence without binding a name.

Third, the rule passes_basic_quality composes the three earlier rules without re-specifying their internals. This composability is central to maintaining rubrics as they grow. Each rule is a named, reusable building block.

Variables, Unification, and Pattern Matching

Variables in Datalog are tokens that begin with an uppercase letter (in most syntaxes). They do not refer to mutable storage slots the way variables do in Python or Java. Instead, they are placeholders that the engine fills in by unification — the process of finding values that make all conditions in a rule body simultaneously true.

When the Datalog engine evaluates passes_length_gate(ResponseId), it:

Looks at every fact in the response_length relation
For each fact response_length(id, count), binds ResponseId = id and WordCount = count
Checks whether count >= 100
If yes, adds passes_length_gate(id) to the derived facts

This process — scanning a relation and binding variables to matching values — is called unification. It is the engine beneath all Datalog evaluation, and it is what makes rules composable without explicit iteration.

💡 Real-World Example: Suppose your extraction step produces fifty response facts. You write one rule: passes_length_gate(ResponseId) :- response_length(ResponseId, W), W >= 100. The engine automatically applies that rule to all fifty responses. You write the rule once; the engine handles the iteration. Compare this to hand-rolled Python where you would write a loop, a conditional, and then wonder whether you handled edge cases consistently across all fifty.

🧠 Mnemonic: Think of variables in Datalog rules as column filters with capture. When you write response_length(ResponseId, WordCount), you are saying "match any row in the response_length table and capture both columns under these names for use in the rest of the rule."

How Datalog Differs from SQL

If Datalog's EDB looks like a relational database, you might wonder why you shouldn't just use SQL. The differences are real and they matter for scoring pipelines specifically.

NULL handling. SQL's three-valued logic (true / false / NULL) is a perennial source of subtle bugs. A NULL in a JOIN condition silently drops rows. A NULL in an aggregate changes the result in ways that surprise even experienced SQL users. Datalog has no NULL. Every value in the EDB is a concrete term. If a fact is absent, it is simply absent — which brings us to the next point.

Recursive aggregation. SQL supports recursion via WITH RECURSIVE, but it requires careful handling to avoid infinite loops and has historically had quirks in how different databases implement it. Datalog's evaluation model guarantees termination for a large and well-defined class of programs (specifically, Datalog without function symbols). You cannot accidentally write a scoring rule that runs forever.

Declarative semantics. SQL SELECT statements are procedural in character — you specify join order, and query planners can surprise you. Datalog rules have a clean declarative semantics: a rule defines a relation, not a procedure for computing it. The engine chooses the evaluation strategy.

⚠️ Common Mistake — Mistake 1: Trying to replicate SQL-style LEFT OUTER JOIN behavior in Datalog using negation. In SQL, LEFT JOIN returns rows from the left table even when there is no match on the right. In Datalog, the equivalent requires negation as failure, which has specific safety restrictions. We cover this properly in section five. For now, resist the impulse to reach for negation as a direct substitute for outer joins.

How Datalog Differs from Prolog

Datalog is often described as a subset of Prolog, which is technically accurate but practically misleading for the audience writing scoring rules. The differences are features, not limitations.

Guaranteed termination. Prolog can execute programs that never halt. Search can follow infinite chains of inference. Datalog's restrictions — no function symbols that build new compound terms, and stratified evaluation — guarantee that every valid Datalog program terminates. For a scoring system running in a CI pipeline or responding to API calls, non-termination is not a theoretical concern; it is a production incident waiting to happen.

No procedural escape hatches. Prolog has cut (!), a control operator that prunes the search tree for performance or to implement negation. Cut makes Prolog programs order-dependent: the same program produces different results depending on clause ordering. Datalog has no cut. Rule evaluation is order-independent. You can add a new scoring rule anywhere in your file without worrying about whether it interacts with the evaluation order of existing rules.

No side effects. Prolog programs can execute arbitrary I/O as a side effect of evaluation. This means a malicious or malformed input fact could theoretically trigger side effects if your rules were written carelessly. Datalog is purely declarative — evaluation produces derived facts and nothing else. Running your scoring rules on untrusted LLM output cannot cause your scoring engine to write to disk or make network calls.

📋 Quick Reference Card: Datalog vs. SQL vs. Prolog

Feature	🔒 Datalog	🗄️ SQL	🧩 Prolog
🔄 Recursion safety	Guaranteed termination	Requires care with RECURSIVE	Can infinite-loop
❓ Missing values	Closed-world (absent = false)	NULL / three-valued logic	Unification failure
📋 Evaluation order	Order-independent	Planner-dependent	Clause-order-dependent
🔧 Side effects	None	Possible (triggers, etc.)	Possible (assert, I/O)
🎯 Termination guarantee	Yes (no function symbols)	No	No
🔒 Safe on untrusted input	Yes	Depends on implementation	No

The Closed-World Assumption

Datalog operates under the closed-world assumption (CWA): anything not explicitly asserted or derivable from the rules is considered false. This is the opposite of the open-world assumption used in systems like OWL ontologies, where absence of information means "unknown" rather than "false."

❌ Wrong thinking: "The closed-world assumption is a limitation because the LLM might have said something we failed to extract."

✅ Correct thinking: "The closed-world assumption is what makes scores deterministic and auditable. If the extractor did not produce a fact, we score as if the criterion was not met. This is a policy choice, and it should be a deliberate one."

In a scoring pipeline, the closed-world assumption translates to a specific and defensible stance: a response only gets credit for criteria that are evidenced in the extracted facts. If your extractor missed a citation because the LLM output was ambiguously formatted, the score reflects that miss. This surfaces extraction failures rather than hiding them behind lenient scoring.

💡 Pro Tip: Design your extraction schema to be conservative — extract only what you can assert with high confidence. The CWA then ensures that uncertain extractions don't silently inflate scores. A fact you don't assert costs you a score on that criterion. A fact you assert incorrectly can cause a false pass. The asymmetry favors under-extraction over over-extraction when you are operating under the CWA.

Here is a complete worked example that ties together facts, rules, the CWA, and variable unification in a single runnable snippet using pyDatalog, a Python-accessible Datalog engine:

from pyDatalog import pyDatalog

## Declare all relation names before use
pyDatalog.create_terms(
    'ResponseId, WordCount, Score, SourceId',
    'response_length, toxicity_score, cites_source, contains_disclaimer',
    'passes_length_gate, response_is_safe, has_citation, passes_basic_quality'
)

## ── EDB: load extracted facts ──────────────────────────────────────────────
## These would come from your LLM extraction step in production
+ response_length('resp_001', 312)   # resp_001 has 312 words
+ response_length('resp_002', 47)    # resp_002 has only 47 words

+ toxicity_score('resp_001', 0.02)   # safe
+ toxicity_score('resp_002', 0.71)   # unsafe

+ cites_source('resp_001', 'src_a')  # resp_001 cites source A
+ cites_source('resp_001', 'src_b')  # resp_001 also cites source B
## resp_002 cites nothing — absent facts are false under the CWA

## ── IDB: scoring rules (the rubric) ───────────────────────────────────────
passes_length_gate(ResponseId) <= (
    response_length(ResponseId, WordCount) &
    (WordCount >= 100)
)

response_is_safe(ResponseId) <= (
    toxicity_score(ResponseId, Score) &
    (Score < 0.1)
)

has_citation(ResponseId) <= (
    cites_source(ResponseId, SourceId)  # SourceId exists but value unused
)

passes_basic_quality(ResponseId) <= (
    passes_length_gate(ResponseId) &
    response_is_safe(ResponseId) &
    has_citation(ResponseId)
)

## ── Queries ────────────────────────────────────────────────────────────────
print("Responses passing basic quality check:")
print(passes_basic_quality(ResponseId))  # Should return only resp_001

print("\nResponses passing length gate:")
print(passes_length_gate(ResponseId))    # resp_001 only

print("\nSafe responses:")
print(response_is_safe(ResponseId))      # resp_001 only

Running this produces:

Responses passing basic quality check:
ResponseId
resp_001

Responses passing length gate:
ResponseId
resp_001

Safe responses:
ResponseId
resp_001

Notice that resp_002 fails silently but traceably: it fails passes_length_gate because 47 < 100, it fails response_is_safe because 0.71 ≥ 0.1, and it fails has_citation because no cites_source facts exist for it (CWA: absence means false). The composite rule passes_basic_quality correctly excludes it. Every derivation is fully traceable — a property we will exploit in section four to build audit logs.

🤔 Did you know? Datalog was developed in the 1970s and 1980s primarily as a database query language that could express recursive queries cleanly. It was long considered an academic curiosity because production databases chose SQL. In recent years it has seen a revival in program analysis tools (Doop, Soufflé), distributed systems (Bloom, Dedalus), and now LLM evaluation pipelines — precisely because its restrictions that once seemed limiting turn out to be exactly what you want when correctness and auditability matter.

Putting It Together: The Scoring Pipeline Data Flow

With facts, rules, variables, and the closed-world assumption in place, the full data flow of a Datalog-based scoring pipeline becomes clear:

LLM Response Text
       │
       ▼
┌─────────────────┐
│  LLM Extractor  │  (structured extraction prompt)
│  or NLP step    │
└────────┬────────┘
         │  JSON / structured output
         ▼
┌─────────────────────────────────────┐
│  EDB Loader                         │
│  Convert extracted fields → facts   │
│  e.g. {"word_count": 312} →         │
│       response_length(id, 312).     │
└────────┬────────────────────────────┘
         │  Ground facts
         ▼
┌─────────────────────────────────────┐
│  Datalog Engine                     │
│  EDB (facts) + IDB (rubric rules)   │
│  → derives score relations          │
└────────┬────────────────────────────┘
         │  Derived facts (scores + traces)
         ▼
┌─────────────────────────────────────┐
│  Score Consumer                     │
│  CI pass/fail, dashboard, audit log │
└─────────────────────────────────────┘

The critical insight is that the Datalog engine is the only place where inference happens. The extractor produces facts — it makes no scoring decisions. The score consumer reads results — it does no inference. All logic is centralized, declarative, and auditable in the IDB.

This architecture means that when a score changes unexpectedly, you have exactly two places to look: the extraction step (did the facts change?) and the rules (did the rubric change?). There is no third place — no implicit behavior hiding in control flow, no NULL cascading through a join, no Prolog search path that chose a different clause ordering.

💡 Pro Tip: When onboarding a new rubric criterion, start by writing the Datalog rule first, before touching the extractor. The rule will make explicit exactly what facts it needs. Then implement the extractor to produce those facts. Working in this order prevents the common mistake of extracting rich structure that the scoring rules never actually use.

In the next section, we will move from these fundamentals to the full vocabulary of scoring logic: thresholds, multi-level quality tiers, and aggregation over multiple evidence facts — all expressed as composable Datalog rules with runnable code examples.

Encoding Scoring Logic: Thresholds, Levels, and Aggregation

At this point in the pipeline, you have done the hard work: an LLM has responded to a prompt, a structured extraction step has pulled out verifiable facts, and those facts are sitting in a clean, typed representation. Now something has to turn that evidence into a score. This section is about doing that conversion in a way that is deterministic, readable, and composable — using Datalog rules that map almost directly onto the language of a rubric.

The path from "extracted facts" to "final score" sounds deceptively simple. In practice, most teams fall into one of two traps. They either write a second LLM call to "judge" the first output (reintroducing all the variance they just removed), or they write a Python function full of nested if-elif branches that becomes unmaintainable past a dozen criteria. Datalog offers a third path: a declarative rule engine where each rubric criterion becomes a rule, rules compose without surprising interactions, and the evaluation itself is guaranteed to terminate.

Setting Up a Datalog Engine in Python

For the examples in this section we will use pyDatalog, a pure-Python Datalog implementation that requires no external processes and integrates naturally into an evaluation pipeline. Where performance matters at scale, Soufflé (a compiled Datalog engine from Oracle Labs) is an excellent alternative accessible via subprocess; the rule syntax is nearly identical, so the conceptual translation is straightforward.

Install pyDatalog with a single pip command:

pip install pyDatalog

With pyDatalog, facts and rules are expressed directly in Python using a domain-specific syntax. The key mental model is that you are populating an in-memory fact database and then asking queries against it — exactly as you would with a relational database, except that the query language lets you write recursive, composable rules rather than flat SQL.

Here is a minimal setup that demonstrates the pattern we will expand throughout the section:

from pyDatalog import pyDatalog

## Declare all term names that will be used as predicates or variables
pyDatalog.create_terms(
    'criterion, response_id, satisfied, score_level, final_verdict,\
     X, Y, Z, N'
)

## Load extracted facts — in a real pipeline these come from your
## extraction layer, not from hand-coding
+ satisfied('resp_001', 'has_citation')
+ satisfied('resp_001', 'word_count_ok')
+ satisfied('resp_001', 'no_hallucinated_entity')
## deliberately omitting 'structured_summary' and 'hedges_uncertainty'
## to simulate a partial response

## Verify the facts loaded correctly
print(satisfied('resp_001', X))  # prints all satisfied criteria for resp_001

The + operator inserts a fact into the Datalog store. Removing a fact uses -. This is how your extraction adapter will feed evidence into the scoring engine: parse the LLM output, derive boolean or numeric facts, and assert them before querying.

💡 Pro Tip: Keep your fact-loading code in a dedicated adapter function that takes your extraction schema as input and returns nothing — its only side effect is populating the Datalog store. This keeps the boundary between extraction and scoring explicit and testable.

Binary Pass/Fail Rules: Single-Criterion Gates

The simplest scoring structure is a hard gate: a response either satisfies a criterion or it does not, and failure on any gate disqualifies the response regardless of how well it does elsewhere. Citation requirements in legal document review and toxicity checks in content moderation are archetypal examples.

In Datalog, a hard gate is a rule that derives a fails_gate fact whenever a required criterion is absent:

from pyDatalog import pyDatalog

pyDatalog.create_terms(
    'satisfied, fails_gate, passes_all_gates,\
     required_criterion, response_id,\
     R, C'
)

## Define which criteria are mandatory gates
+ required_criterion('has_citation')
+ required_criterion('no_hallucinated_entity')
+ required_criterion('no_toxic_content')

## Rule: a response fails a gate if a required criterion is not satisfied
fails_gate(R, C) <= required_criterion(C) & ~satisfied(R, C)

## Rule: a response passes all gates only if it fails none of them
## (Datalog negation-as-failure: passes when no fails_gate fact can be derived)
passes_all_gates(R) <= ~fails_gate(R, C)  # simplified; see warning below

## Load facts for two responses
+ satisfied('resp_001', 'has_citation')
+ satisfied('resp_001', 'no_hallucinated_entity')
## resp_001 is missing 'no_toxic_content'

+ satisfied('resp_002', 'has_citation')
+ satisfied('resp_002', 'no_hallucinated_entity')
+ satisfied('resp_002', 'no_toxic_content')

## Query
print("Gates failed by resp_001:", fails_gate('resp_001', C))
print("Gates failed by resp_002:", fails_gate('resp_002', C))
print("Passes all gates:", passes_all_gates(R))

⚠️ Common Mistake — Negation and Unbound Variables: The passes_all_gates rule above is written for illustration. In strict Datalog (and in Soufflé), negation over an unbound variable requires the variable to appear positively elsewhere in the rule body — otherwise the engine cannot enumerate the domain. In pyDatalog you would typically write passes_all_gates(R) <= response(R) & ~fails_gate(R, C) where response(R) is a fact that enumerates all response IDs. Always ground your negated predicates.

The power of rule chaining becomes visible when you compose gate checks. A disqualified rule can call fails_gate, and a route_to_human_review rule can call disqualified. Each layer adds meaning without modifying the layers below — this is the composability property that makes Datalog rubrics maintainable as they grow.

Ordinal Scoring Levels: Representing Quality Tiers Without if-elif Ladders

Many rubrics are not binary — they assign quality tiers such as Bronze/Silver/Gold or scores on a 1-to-5 scale. In Python, the natural reflex is an if-elif ladder: check the highest threshold first, fall through to lower ones. This approach works but becomes brittle as criteria multiply, because the order of branches carries implicit logic that is easy to break when editing.

Datalog handles ordinal levels through stratified rules: separate rules for each tier, each expressing the conditions for that tier independently. The engine evaluates all rules and derives whichever tiers are supported by the evidence — no branching, no order dependency.

Consider a rubric for an AI-generated research summary with three quality tiers:

Bronze: The response satisfies at least two of five criteria.
Silver: The response satisfies at least four of five criteria.
Gold: The response satisfies all five criteria.

from pyDatalog import pyDatalog

pyDatalog.create_terms(
    'satisfied, criterion_count, quality_tier,\
     response_id, all_criteria,\
     R, N'
)

## Enumerate all five rubric criteria
+ all_criteria('has_citation')
+ all_criteria('word_count_ok')
+ all_criteria('no_hallucinated_entity')
+ all_criteria('structured_summary')
+ all_criteria('hedges_uncertainty')

## Aggregation rule: count how many criteria are satisfied for response R
## pyDatalog uses _len for count aggregation
criterion_count(R, N) <= satisfied(R, C) & (N == len_(C))

## Tier rules: each tier checks the count independently
## Note: in pyDatalog, numeric comparisons use standard Python operators
quality_tier(R, 'bronze') <= criterion_count(R, N) & (N >= 2)
quality_tier(R, 'silver') <= criterion_count(R, N) & (N >= 4)
quality_tier(R, 'gold')   <= criterion_count(R, N) & (N == 5)

## Load facts — resp_001 satisfies 3 of 5 criteria
+ satisfied('resp_001', 'has_citation')
+ satisfied('resp_001', 'word_count_ok')
+ satisfied('resp_001', 'no_hallucinated_entity')

## resp_002 satisfies all 5
+ satisfied('resp_002', 'has_citation')
+ satisfied('resp_002', 'word_count_ok')
+ satisfied('resp_002', 'no_hallucinated_entity')
+ satisfied('resp_002', 'structured_summary')
+ satisfied('resp_002', 'hedges_uncertainty')

print("resp_001 tiers:", quality_tier('resp_001', X))  # bronze only
print("resp_002 tiers:", quality_tier('resp_002', X))  # bronze, silver, gold

Notice that resp_002 will match all three tier rules simultaneously — Bronze, Silver, and Gold are all derived. This is correct behavior: the tiers are cumulative labels, and your downstream logic selects the highest one. The key insight is that the rules are not mutually exclusive by default — you choose the maximum in a query, not by writing exclusion logic into the rules.

💡 Mental Model: Think of each tier rule as a spotlight, not a filter. Bronze lights up whenever its condition is true. Silver lights up independently. Gold lights up independently. Your query asks "what is the highest tier that is lit?" The rules never need to know about each other.

🎯 Key Principle: Separating tier conditions from tier selection keeps each rule readable and individually testable. You can unit-test the Silver rule without worrying about whether the Gold rule interferes.

Aggregation Rules: Counting, Weighting, and Deriving Final Verdicts

Real rubrics rarely treat all criteria equally. A citation requirement might be worth more than a word-count check. Weighted scoring requires aggregating across criteria with different numeric contributions and comparing the total against a threshold.

Datalog supports aggregation through built-in functions (sum_, len_, min_, max_ in pyDatalog; sum, count, min, max in Soufflé). The pattern is to first define criterion weights as facts, then aggregate with a rule.

Here is a complete weighted scoring example using a five-criterion rubric:

from pyDatalog import pyDatalog

pyDatalog.create_terms(
    'satisfied, criterion_weight, weighted_score,\
     passes_weighted_threshold, final_verdict,\
    response_id, criterion,\
    R, C, W, S'
)

## --- Rubric definition: each criterion and its weight ---
## Weights sum to 10; threshold for passing is 7
+ criterion_weight('has_citation',          3)
+ criterion_weight('word_count_ok',         1)
+ criterion_weight('no_hallucinated_entity',3)
+ criterion_weight('structured_summary',    2)
+ criterion_weight('hedges_uncertainty',    1)

## --- Aggregation: sum weights of satisfied criteria per response ---
## A criterion contributes its weight only if it is satisfied
weighted_score(R, S) <= (
    satisfied(R, C) &
    criterion_weight(C, W) &
    (S == sum_(W, by=[R, C]))
)

## --- Threshold gate: pass if weighted score >= 7 ---
passes_weighted_threshold(R) <= weighted_score(R, S) & (S >= 7)

## --- Final verdict combines gate check and threshold ---
final_verdict(R, 'PASS') <= passes_weighted_threshold(R)
final_verdict(R, 'FAIL') <= ~passes_weighted_threshold(R)

## --- Load extracted facts for two responses ---
## resp_001: citation + no hallucination (weight 3+3=6, below threshold)
+ satisfied('resp_001', 'has_citation')
+ satisfied('resp_001', 'no_hallucinated_entity')

## resp_002: citation + no hallucination + summary (3+3+2=8, above threshold)
+ satisfied('resp_002', 'has_citation')
+ satisfied('resp_002', 'no_hallucinated_entity')
+ satisfied('resp_002', 'structured_summary')

print("resp_001 score:", weighted_score('resp_001', S))
print("resp_001 verdict:", final_verdict('resp_001', X))
print("resp_002 score:", weighted_score('resp_002', S))
print("resp_002 verdict:", final_verdict('resp_002', X))

The sum_(W, by=[R, C]) expression tells pyDatalog to sum W grouped by the combination of R and C — effectively a GROUP BY in SQL terms, but expressed inline in the rule. The result is a single derived fact per response containing its total weighted score.

⚠️ Common Mistake — Double-Counting in Aggregations: If a criterion appears multiple times in your fact database (possible if your extraction layer is not idempotent), it will be counted or summed multiple times. Use a deduplication step before loading facts, or add a rule that selects distinct criterion-response pairs before aggregating.

🤔 Did you know? Soufflé, the compiled Datalog engine used by the Doop program analysis framework, can evaluate aggregation rules over millions of facts in seconds. For large-scale LLM evaluation pipelines scoring thousands of responses, switching from pyDatalog to Soufflé via subprocess can reduce scoring time by two orders of magnitude — with no changes to the rule logic itself.

End-to-End Worked Example: Five-Criterion Rubric Scoring

Let us now walk through a complete pipeline — from raw LLM output to final score — to show how all the pieces fit together. The scenario: you are evaluating AI-generated medical literature summaries against a five-criterion rubric used by a clinical research team.

The rubric criteria and their weights are:

┌─────────────────────────────────┬────────┬───────────────────────────────┐
│ Criterion                       │ Weight │ How Extracted                 │
├─────────────────────────────────┼────────┼───────────────────────────────┤
│ has_citation                    │   3    │ regex: [Author, Year] pattern │
│ word_count_ok (150-400 words)   │   1    │ len(response.split())         │
│ no_hallucinated_entity          │   3    │ NER + knowledge base lookup   │
│ structured_summary (has headers)│   2    │ markdown heading detection    │
│ hedges_uncertainty              │   1    │ keyword list match            │
└─────────────────────────────────┴────────┴───────────────────────────────┘
Pass threshold: weighted score >= 7
Gold tier: all 5 criteria satisfied
Silver tier: score >= 7
Bronze tier: score >= 4

The architecture of the scoring pipeline looks like this:

  LLM Response (raw text)
          │
          ▼
  ┌───────────────────┐
  │  Extraction Layer │  ← regex, NER, word count, keyword match
  └────────┬──────────┘
           │  boolean + numeric facts
           ▼
  ┌───────────────────┐
  │  Datalog Store    │  ← satisfied('resp_X', 'criterion_Y')
  └────────┬──────────┘
           │
   ┌───────┴────────┐
   │  Rule Engine   │  ← weights, thresholds, tier rules
   └───────┬────────┘
           │
   ┌───────┴────────────────────┐
   │  Derived Facts             │
   │  weighted_score(R, 8)      │
   │  quality_tier(R, 'silver') │
   │  final_verdict(R, 'PASS')  │
   └────────────────────────────┘

Here is the complete Python implementation of this pipeline, combining extraction simulation with the full Datalog scoring layer:

from pyDatalog import pyDatalog

## ── Term declarations ──────────────────────────────────────────────────────
pyDatalog.create_terms(
    'satisfied, criterion_weight, all_responses,\
     weighted_score, quality_tier, final_verdict,\
     passes_threshold, fails_hard_gate, required_criterion,\
     R, C, W, S, N, T'
)

def load_rubric():
    """Assert rubric definition facts (weights and hard gates)."""
    # Criterion weights
    + criterion_weight('has_citation',           3)
    + criterion_weight('word_count_ok',          1)
    + criterion_weight('no_hallucinated_entity', 3)
    + criterion_weight('structured_summary',     2)
    + criterion_weight('hedges_uncertainty',     1)

    # Hard gates (failure disqualifies regardless of score)
    + required_criterion('no_hallucinated_entity')

def load_rules():
    """Define all Datalog scoring rules."""
    # Hard gate failure
    fails_hard_gate(R) <= (
        required_criterion(C) & ~satisfied(R, C)
    )

    # Weighted score aggregation
    weighted_score(R, S) <= (
        satisfied(R, C) &
        criterion_weight(C, W) &
        (S == sum_(W, by=[R, C]))
    )

    # Threshold check (only if no hard gate failure)
    passes_threshold(R) <= (
        ~fails_hard_gate(R) &
        weighted_score(R, S) &
        (S >= 7)
    )

    # Quality tiers (cumulative — query for highest)
    quality_tier(R, 'bronze') <= weighted_score(R, S) & (S >= 4)
    quality_tier(R, 'silver') <= weighted_score(R, S) & (S >= 7)
    quality_tier(R, 'gold')   <= weighted_score(R, S) & (S == 10)

    # Final verdict
    final_verdict(R, 'PASS') <= passes_threshold(R)
    final_verdict(R, 'FAIL') <= ~passes_threshold(R) & all_responses(R)

def extract_and_load_facts(response_id: str, response_text: str):
    """Simulate extraction layer — in production this calls real extractors."""
    + all_responses(response_id)

    # Word count check (150-400 words)
    word_count = len(response_text.split())
    if 150 <= word_count <= 400:
        + satisfied(response_id, 'word_count_ok')

    # Citation check (simplified pattern)
    if '[' in response_text and ']' in response_text:
        + satisfied(response_id, 'has_citation')

    # Hallucination check (simulated — assume clean for resp_002)
    if 'HALLUCINATION_FLAG' not in response_text:
        + satisfied(response_id, 'no_hallucinated_entity')

    # Structured summary (markdown heading)
    if '##' in response_text or '**' in response_text:
        + satisfied(response_id, 'structured_summary')

    # Hedging language
    hedge_words = ['may', 'might', 'suggests', 'indicates', 'appears']
    if any(w in response_text.lower() for w in hedge_words):
        + satisfied(response_id, 'hedges_uncertainty')

def score_response(response_id: str) -> dict:
    """Query derived facts and return a scoring report."""
    score = weighted_score(response_id, S)
    score_value = score[0][0] if score else 0

    tiers = [row[0] for row in quality_tier(response_id, T)]
    highest_tier = max(tiers, key=['bronze','silver','gold'].index) \
                   if tiers else 'none'

    verdict = final_verdict(response_id, T)
    verdict_value = verdict[0][0] if verdict else 'FAIL'

    return {
        'response_id': response_id,
        'weighted_score': score_value,
        'highest_tier': highest_tier,
        'verdict': verdict_value,
        'failed_hard_gate': bool(fails_hard_gate(response_id))
    }

## ── Run the pipeline ───────────────────────────────────────────────────────
load_rubric()
load_rules()

## Two sample responses (text abbreviated for illustration)
resp_001_text = """## Summary
This study [Smith, 2023] suggests that the intervention may reduce
marker levels. Results indicate a possible effect on outcomes.
"""  # Short — will fail word_count_ok; has citation, summary, hedges, no hallucination

resp_002_text = """## Clinical Summary
The randomized controlled trial [Jones et al., 2022] examined outcomes
across 500 patients. Results indicate that treatment may reduce
primary endpoint events by 23%. The study appears to demonstrate
clinical significance, though longer follow-up periods are warranted.
Methodological limitations include selection bias and short follow-up.
Further research is needed to confirm these findings in diverse populations.
"""  # ~80 words shown; assume full version is 200 words in practice

extract_and_load_facts('resp_001', resp_001_text)
extract_and_load_facts('resp_002', resp_002_text)

for rid in ['resp_001', 'resp_002']:
    report = score_response(rid)
    print(f"\n=== {report['response_id']} ===")
    print(f"  Weighted Score : {report['weighted_score']} / 10")
    print(f"  Highest Tier   : {report['highest_tier']}")
    print(f"  Hard Gate Fail : {report['failed_hard_gate']}")
    print(f"  Final Verdict  : {report['verdict']}")

Running this pipeline produces structured, deterministic output for every response. The same facts will always produce the same score. There is no sampling temperature, no prompt variation, no model-version drift — the score is a logical consequence of the facts and the rules, full stop.

💡 Real-World Example: A legal technology company evaluating AI contract summaries uses exactly this pattern. Their extraction layer runs specialized NLP models to detect defined terms, party names, and obligation clauses. The Datalog scoring layer then applies a 12-criterion rubric with three hard gates (all key parties identified, no invented clause numbers, governing law stated). Because the scoring layer is pure Datalog, their compliance team can read the rules directly — no machine learning black box sits between the evidence and the score.

Putting It Together: What the Architecture Gives You

The combination of fact loading, hard gate rules, ordinal tier rules, and weighted aggregation rules gives you a scoring layer with properties that no other common approach provides simultaneously:

📋 Quick Reference Card: Scoring Approaches Compared

	🔧 Second LLM Call	🔧 if-elif Python	🎯 Datalog Rules
🔒 Deterministic	❌ No	✅ Yes	✅ Yes
📚 Readable as rubric	⚠️ Partially	❌ Degrades fast	✅ Yes
🧠 Composable rules	❌ No	⚠️ With effort	✅ Structural
🎯 Audit trail built-in	❌ No	❌ No	✅ Yes
🔒 Termination guaranteed	❌ No	⚠️ By convention	✅ Language-level

Each rule you write is simultaneously a scoring mechanism, a specification document, and a unit-testable assertion. When a stakeholder asks "why did this response receive a Silver instead of Gold?" you can point them to the quality_tier rule and the weighted_score fact — no interpretation required.

The next section shows how to surface the derivation chain of these rules as a structured audit log — turning Datalog's internal reasoning process into an explanation that stakeholders and debugging workflows can consume directly.

The Audit Log Is the Trace: Making Scores Explainable

Every scoring decision your evaluation pipeline makes will eventually face a challenge: Why did this response score a 2 instead of a 3? The challenge might come from a product manager auditing a regression, a developer debugging a sudden drop in pass rates, or a compliance officer who needs to demonstrate that your automated evaluation meets a defensibility standard. In most LLM-as-judge setups, the honest answer to "why" is some version of "the model said so" — which is not an answer that survives scrutiny.

Datalog changes this relationship fundamentally. Because Datalog scoring is purely declarative — a fixed set of rules applied to a fixed set of facts — every score is the direct product of a derivation tree: a traceable chain from raw evidence facts through intermediate conclusions to the final verdict. That derivation tree is not a separate artifact you have to instrument. It is the computation. Your only job is to capture it and format it usefully.

This section shows you how to do exactly that: how to extract the derivation trace from a Datalog evaluation, format it as a structured audit log, and use that log to separate extraction failures from logic failures — the two most common and most different failure modes in a scoring pipeline.

How Derivation Trees Map to Score Explanations

Recall the mental model from earlier in this lesson: Datalog operates on a set of base facts (the extracted evidence from your LLM output) and a set of rules (your rubric, encoded declaratively). When you issue a query — say, score(Response, Level) — Datalog's backward-chaining or semi-naive forward-chaining evaluation constructs a proof that the query is true. That proof is the derivation tree.

A derivation tree looks like this conceptually:

score(response_42, "meets_expectations")
├── quality_tier(response_42, "meets_expectations")
│   ├── accuracy_ok(response_42)
│   │   ├── cited_sources(response_42, 3)          ← base fact
│   │   └── threshold_citations(3)                 ← base fact (constant)
│   ├── completeness_ok(response_42)
│   │   ├── addressed_criteria(response_42, 4)     ← base fact
│   │   └── required_criteria(4)                   ← base fact (constant)
│   └── NOT flagged_harmful(response_42)           ← negation-as-failure
│       └── [no base fact harm_signal(response_42) exists]

Each leaf of this tree is either a base fact (something extracted from the LLM output) or a constant (something baked into your rubric). Each internal node is a derived fact — a conclusion your rules generated. The root is the final score.

This structure is a human-readable explanation for free. "The response scored meets_expectations because it had 3 cited sources (at or above the threshold of 3), addressed all 4 required criteria, and no harm signals were detected." That sentence is a mechanical translation of the derivation tree. No LLM needed to write it; no extra instrumentation needed to capture it.

🎯 Key Principle: In Datalog, the explanation is the computation. You are not adding explainability on top of a black-box scorer — you are reading the scorer's own work.

Capturing Which Rules Fired

Python-accessible Datalog engines — such as pyDatalog, souffle (via Python bindings), or the lightweight dataliog library — expose derivation information in different ways, but the pattern is consistent: you query not just for the result but for the supporting facts that made the result true.

Here is a practical pattern using pyDatalog that captures rule firings explicitly by structuring your rules to emit provenance tuples alongside scores:

from pyDatalog import pyDatalog

## Declare all terms we'll use
pyDatalog.create_terms(
    'Response, Level, N, Threshold, Rule',
    'score, quality_tier, accuracy_ok, completeness_ok',
    'cited_sources, required_citations',
    'addressed_criteria, required_criteria',
    'flagged_harmful, rule_fired'
)

## --- Base facts for response_42 ---
+cited_sources('response_42', 3)
+addressed_criteria('response_42', 4)
+required_citations(3)   # rubric constant
+required_criteria(4)    # rubric constant
## NOTE: no flagged_harmful('response_42') fact → negation-as-failure succeeds

## --- Scoring rules ---
accuracy_ok(Response) <= (
    cited_sources(Response, N) &
    required_citations(Threshold) &
    (N >= Threshold)
)

completeness_ok(Response) <= (
    addressed_criteria(Response, N) &
    required_criteria(Threshold) &
    (N >= Threshold)
)

quality_tier(Response, 'meets_expectations') <= (
    accuracy_ok(Response) &
    completeness_ok(Response) &
    ~flagged_harmful(Response)
)

score(Response, Level) <= quality_tier(Response, Level)

## --- Provenance rules: emit a record for every rule that fires ---
rule_fired(Response, 'accuracy_ok', 'cited_sources_threshold') <= accuracy_ok(Response)
rule_fired(Response, 'completeness_ok', 'criteria_threshold') <= completeness_ok(Response)
rule_fired(Response, 'quality_tier_meets', 'combined_gate') <= (
    quality_tier(Response, 'meets_expectations')
)

## --- Queries ---
print(score('response_42', Level))       # → meets_expectations
print(rule_fired('response_42', Rule, N))  # all fired rules with labels

The rule_fired relation is the key pattern here. By adding a provenance rule for every substantive scoring rule, you get a flat list of every rule that contributed to the final score — without any runtime instrumentation of the engine itself. This works because Datalog rules are additive: adding rule_fired clauses cannot change the values of score or quality_tier. The provenance is a read-only shadow of the computation.

💡 Pro Tip: Name your provenance labels with a two-part convention: (rule_name, rubric_criterion). This makes the audit log self-documenting — a reader can look up the rubric criterion by name without having to trace back through your rule file.

Formatting the Derivation Trace as a Structured Artifact

A list of fired rules is useful for debugging but not for stakeholders. What stakeholders need is a structured artifact that pairs the numeric score with a narrative explanation, the raw facts that drove it, and enough metadata to reconstruct the evaluation later. Here is a practical Python function that assembles all three layers into a single JSON artifact:

import json
from datetime import datetime, timezone

def build_audit_record(response_id: str, datalog_session) -> dict:
    """
    Assemble a structured audit record from a completed Datalog scoring session.
    
    Parameters
    ----------
    response_id : str
        The identifier of the LLM response being scored.
    datalog_session : module
        The pyDatalog module after all facts and rules have been loaded.
    
    Returns
    -------
    dict  A fully structured audit record ready for JSON serialization.
    """
    from pyDatalog import pyDatalog
    pyDatalog.create_terms('Response, Level, Rule, Criterion, N, Threshold')

    # 1. Collect the final score
    score_result = score(response_id, Level)  # pyDatalog query
    final_score = score_result[0][0] if score_result else "no_score"

    # 2. Collect all base facts relevant to this response
    base_facts = {
        "cited_sources": cited_sources(response_id, N).v()[0] if cited_sources(response_id, N) else None,
        "addressed_criteria": addressed_criteria(response_id, N).v()[0] if addressed_criteria(response_id, N) else None,
    }

    # 3. Collect all rules that fired
    fired = [
        {"rule": r[0], "criterion": r[1]}
        for r in rule_fired(response_id, Rule, Criterion).v()
    ]

    # 4. Build human-readable explanation from fired rules
    explanations = {
        "accuracy_ok": "Response met the citation threshold.",
        "completeness_ok": "Response addressed all required criteria.",
        "quality_tier_meets": "All gate conditions satisfied; no harm signals detected.",
    }
    narrative_steps = [
        explanations.get(entry["rule"], f"Rule '{entry['rule']}' fired.")
        for entry in fired
    ]

    return {
        "response_id": response_id,
        "score": final_score,
        "evaluated_at": datetime.now(timezone.utc).isoformat(),
        "base_facts": base_facts,
        "rules_fired": fired,
        "explanation": " ".join(narrative_steps),
        "schema_version": "1.0"
    }

## Example output
record = build_audit_record('response_42', None)
print(json.dumps(record, indent=2))

This produces output like:

{
  "response_id": "response_42",
  "score": "meets_expectations",
  "evaluated_at": "2024-11-15T14:32:07+00:00",
  "base_facts": {
    "cited_sources": 3,
    "addressed_criteria": 4
  },
  "rules_fired": [
    {"rule": "accuracy_ok", "criterion": "cited_sources_threshold"},
    {"rule": "completeness_ok", "criterion": "criteria_threshold"},
    {"rule": "quality_tier_meets", "criterion": "combined_gate"}
  ],
  "explanation": "Response met the citation threshold. Response addressed all required criteria. All gate conditions satisfied; no harm signals detected.",
  "schema_version": "1.0"
}

Notice the schema_version field. As your rubric evolves, the rules that can fire will change. Versioning your audit schema alongside your rule set means historical records remain interpretable even after you ship rubric updates.

Separating Extraction Errors from Logic Errors

This is where the audit log earns its place in production. When a score looks wrong, there are exactly two places the problem can live:

 LLM Output
     │
     ▼
┌─────────────┐    extraction error lives here
│  Extractor  │  ← (wrong facts, missing facts, schema mismatch)
└─────────────┘
     │
     ▼  base facts
┌─────────────┐    logic error lives here  
│  Datalog    │  ← (wrong rules, wrong thresholds, wrong negation)
│  Scorer     │
└─────────────┘
     │
     ▼
   Score

Extraction errors mean the base facts in your audit record don't accurately reflect the LLM output. Logic errors mean the base facts are correct but the rules turn them into the wrong score. These require completely different fixes — extraction errors need changes to your parsing or extraction prompt; logic errors need changes to your Datalog rules.

Without an audit log, diagnosing which type of failure you have requires re-running the whole pipeline manually and inspecting intermediate state. With the audit log, you can answer the question immediately:

Look at base_facts in the audit record. Do they accurately represent what the LLM output actually said? If cited_sources: 2 but the response clearly contained 4 citations, you have an extraction error. Fix your extractor.
Look at rules_fired. Given the base facts are correct, did the right rules fire? If cited_sources: 4 is accurate but accuracy_ok didn't fire when your threshold is 3, you have a logic error — perhaps a threshold constant was set incorrectly or a rule condition has a typo.

🎯 Key Principle: The audit log is a seam between your extraction layer and your scoring layer. It lets you test each layer independently by treating the base_facts snapshot as a test fixture.

💡 Real-World Example: Imagine your eval pipeline starts showing a 15% drop in meets_expectations scores after a new LLM model is deployed. Without an audit log, you might assume the new model is worse. With the audit log, you sample 20 failing records and check base_facts: you find that cited_sources is consistently null — the new model changed its citation formatting, breaking your extractor's regex. The Datalog rules are fine. The fix is one line in your parser, not a rubric overhaul.

⚠️ Common Mistake — Mistake 1: Treating an unexpected score as a logic error without checking the base facts first. Always audit the facts before auditing the rules. The extraction layer is more brittle than the scoring layer and breaks more often.

Storing Facts and Traces for Historical Reconstruction

Audit logs are only useful if they're complete enough to reconstruct any historical score from scratch. This means you need to store not just the output of the scoring run but the full inputs to the Datalog engine. The practical storage pattern has three components:

┌────────────────────────────────────────────────────────┐
│                    Score Store                         │
│                                                        │
│  record_id → {                                         │
│    response_id,       ← what was scored               │
│    score,             ← the result                    │
│    base_facts,        ← full snapshot of input facts  │
│    rules_fired,       ← provenance trace              │
│    rule_set_hash,     ← hash of the .dl rule file     │
│    evaluated_at,      ← UTC timestamp                 │
│    schema_version     ← audit schema version          │
│  }                                                     │
└────────────────────────────────────────────────────────┘

The rule set hash is the element most teams forget. Your Datalog rules are code — they change. If you store only the rules_fired list (rule names) without a hash of the actual rule file, you cannot verify that the rules named in an old audit record are the same rules you have today. Hashing the rule file at evaluation time and storing that hash alongside each record closes this gap. To reconstruct a historical score, you check out the rule file at the commit whose hash matches, load the stored base_facts, and re-run the engine. The score must be identical.

import hashlib
import pathlib

def hash_rule_file(path: str) -> str:
    """Compute a stable SHA-256 hash of a Datalog rule file."""
    content = pathlib.Path(path).read_bytes()
    return hashlib.sha256(content).hexdigest()[:16]  # short hash for readability

def build_full_audit_record(
    response_id: str,
    base_facts: dict,
    rules_fired: list,
    final_score: str,
    rule_file_path: str = "rubric.dl"
) -> dict:
    """
    Build a complete, self-contained audit record that supports
    full reconstruction of any historical score.
    """
    return {
        "response_id": response_id,
        "score": final_score,
        "evaluated_at": datetime.now(timezone.utc).isoformat(),
        "base_facts": base_facts,          # full input snapshot
        "rules_fired": rules_fired,        # provenance trace
        "rule_set_hash": hash_rule_file(rule_file_path),  # rule version pin
        "schema_version": "1.0"
    }

## Reconstruction check: given a historical record, re-score and compare
def verify_historical_score(record: dict, rule_file_path: str) -> bool:
    """
    Verify that re-running the Datalog scorer on the stored base_facts
    produces the same score as the stored result.
    Returns True if score matches, False if a discrepancy is detected.
    """
    current_hash = hash_rule_file(rule_file_path)
    if current_hash != record["rule_set_hash"]:
        print(f"⚠️  Rule set has changed since this record was created.")
        print(f"    Stored hash:  {record['rule_set_hash']}")
        print(f"    Current hash: {current_hash}")
        # In a real system: check out the correct rule file version before re-running
        return False

    # Re-run scoring from base_facts (simplified: assume loader function exists)
    # replayed_score = run_datalog_from_facts(record["base_facts"], rule_file_path)
    # return replayed_score == record["score"]
    print("✅ Rule set matches. Ready for reconstruction.")
    return True

💡 Pro Tip: Store audit records in an append-only log (a database table with no UPDATE or DELETE permissions for the scoring process, or an immutable object store like S3 with versioning enabled). This makes the audit log legally and operationally defensible — no score can be quietly changed after the fact.

Making the Audit Log Consumable for Stakeholders

A JSON blob of rule names and fact values is sufficient for a developer but opaque to a product manager or compliance reviewer. The final step is building a lightweight rendering layer that translates the structured audit record into something a non-technical reader can understand and sign off on.

The key insight is that this rendering is entirely deterministic — you're not asking an LLM to explain the score, you're mapping rule names to pre-written sentences. This keeps the explanation layer as auditable as the scoring layer itself:

## Rubric-level explanation templates — maintained alongside the rule file
RULE_EXPLANATIONS = {
    "accuracy_ok": (
        "Citation requirement satisfied: response provided {cited_sources} sources "
        "(minimum required: {required_citations})."
    ),
    "completeness_ok": (
        "Completeness requirement satisfied: response addressed {addressed_criteria} "
        "of {required_criteria} required criteria."
    ),
    "quality_tier_meets": (
        "All quality gate conditions were satisfied and no harmful content was detected. "
        "Score: meets_expectations."
    ),
    "quality_tier_below": (
        "One or more quality gate conditions were not satisfied. "
        "Score: below_expectations."
    ),
}

def render_audit_narrative(record: dict) -> str:
    """Render a human-readable narrative from a structured audit record."""
    facts = record["base_facts"]
    lines = [
        f"Evaluation Report — Response ID: {record['response_id']}",
        f"Score: {record['score'].upper()}",
        f"Evaluated at: {record['evaluated_at']}",
        "",
        "Scoring Steps:",
    ]
    for i, entry in enumerate(record["rules_fired"], start=1):
        template = RULE_EXPLANATIONS.get(entry["rule"], f"Rule '{entry['rule']}' applied.")
        # Fill template with available fact values (missing keys fall back gracefully)
        try:
            explanation = template.format(**facts)
        except KeyError:
            explanation = template
        lines.append(f"  {i}. {explanation}")

    lines += [
        "",
        f"Rule set version: {record['rule_set_hash']}",
        f"Audit schema: {record['schema_version']}",
    ]
    return "\n".join(lines)

This produces output a stakeholder can read in thirty seconds:

Evaluation Report — Response ID: response_42
Score: MEETS_EXPECTATIONS
Evaluated at: 2024-11-15T14:32:07+00:00

Scoring Steps:
  1. Citation requirement satisfied: response provided 3 sources (minimum required: 3).
  2. Completeness requirement satisfied: response addressed 4 of 4 required criteria.
  3. All quality gate conditions were satisfied and no harmful content was detected.
     Score: meets_expectations.

Rule set version: 3a9f12bc
Audit schema: 1.0

⚠️ Common Mistake — Mistake 2: Writing explanation templates that are too generic to be useful. "Rule accuracy_ok fired" tells a developer nothing a stakeholder can act on. Templates should read like a rubric evaluator wrote them, not like a log message. Invest time in these sentences — they are the face of your evaluation system to everyone who isn't reading the Datalog file.

The Complete Picture

Putting the full pipeline together, the flow from LLM output to defensible score looks like this:

LLM Output
    │
    ▼
┌──────────────────────────────────────┐
│ Extractor                            │
│ (structured fact extraction)         │
└──────────────┬───────────────────────┘
               │  base_facts dict
               ▼
┌──────────────────────────────────────┐
│ Datalog Scorer                       │
│ (rules + queries + provenance rules) │
└──────────────┬───────────────────────┘
               │  score + rules_fired
               ▼
┌──────────────────────────────────────┐
│ Audit Record Builder                 │
│ (base_facts + trace + hash + time)   │
└──────────────┬───────────────────────┘
               │  structured JSON record
    ┌──────────┴──────────┐
    ▼                     ▼
 Append-only          Narrative
 Score Store          Renderer
 (reconstruction)     (stakeholders)

Every component in this pipeline is deterministic and independently testable. The extractor can be unit-tested against fixture LLM outputs. The Datalog rules can be tested against fixture fact sets. The audit record builder can be tested against fixture rule traces. The narrative renderer can be tested against fixture audit records. Nothing in this pipeline requires an LLM to explain itself.

🧠 Mnemonic: F-R-H-T — Facts, Rules, Hash, Time. These four things in every audit record are what make a score reconstructible, defensible, and debuggable.

📋 Quick Reference Card:

🔧 Component	📚 What It Stores	🎯 What It Enables
🔒 base_facts	Extracted evidence snapshot	Detect extraction errors
🔒 rules_fired	Provenance trace	Detect logic errors
🔒 rule_set_hash	Hash of .dl rule file	Historical reconstruction
🔒 evaluated_at	UTC timestamp	Timeline auditing
🔒 schema_version	Audit format version	Long-term interpretability
🔒 explanation	Rendered narrative	Stakeholder communication

The audit log is not a feature you add to your scoring pipeline after it works. It is the mechanism by which your pipeline demonstrates that it works — to the developer fixing a regression at midnight, to the product manager questioning a benchmark, and to the compliance reviewer who needs to sign off on a consequential automated decision. Datalog gives it to you for free. All you have to do is capture it.

Common Pitfalls When Using Datalog for Scoring

By the time you've encoded your rubric as Datalog rules and wired them into an LLM evaluation pipeline, it's tempting to believe the hard work is done. The logic is declarative. The scoring is deterministic. What could go wrong? Quite a lot, as it turns out — and the failures are particularly insidious because Datalog doesn't crash when things go wrong. It just quietly returns an empty result set, which your pipeline happily interprets as a zero score.

This section is a field guide to the mistakes that practitioners make repeatedly when adopting Datalog for LLM scoring. Most of them don't show up in toy examples. They emerge when real LLM outputs meet real rubric logic, and they can corrupt your evaluation results for days before anyone notices.

Pitfall 1: Schema Mismatch and the Silent Zero

The most common failure mode in a Datalog scoring pipeline is the schema mismatch: the facts your extraction step produces don't match the shape that your rules expect. Datalog doesn't throw a type error. It simply finds no facts that satisfy the rule's body, derives nothing, and the score defaults to zero.

Consider a rule that expects:

% Rule expects: citation(ResponseId, CitationId, Source, Page)
citation_present(R) :- citation(R, _, _, _).

But your LLM extraction step returns facts shaped like:

## Extracted facts from LLM output
facts = [
    ("has_citation", "resp_001", "Smith 2021", "12"),  # 4-tuple
]

When you load these into your Datalog engine, the predicate name is has_citation, not citation. Your citation_present rule never fires. The response gets a zero on citation coverage. No error. No warning. Just silence.

⚠️ Common Mistake — Mistake 1: Assuming that because extraction and scoring are separate steps, their interfaces are automatically compatible. They are not. The predicate names, arities (number of arguments), and argument positions between extracted facts and rule heads must be explicitly agreed upon and validated.

The fix requires treating the boundary between extraction and scoring as a contract. Define your fact schema before writing either the extraction prompt or the scoring rules, and validate facts against that schema at load time:

import pyDatalog
from dataclasses import dataclass
from typing import Tuple, List

## Define the schema as a dataclass — the contract
@dataclass
class CitationFact:
    response_id: str
    citation_id: str
    source: str
    page: int  # Note: must be int, not string

def validate_and_load_citations(
    raw_facts: List[Tuple], 
    engine
) -> Tuple[int, List[str]]:
    """Validate facts against schema before loading into Datalog engine.
    Returns (loaded_count, list_of_errors)."""
    loaded = 0
    errors = []
    
    for i, fact in enumerate(raw_facts):
        if len(fact) != 4:
            errors.append(f"Fact {i}: expected 4 args, got {len(fact)}")
            continue
        try:
            validated = CitationFact(
                response_id=str(fact[0]),
                citation_id=str(fact[1]),
                source=str(fact[2]),
                page=int(fact[3])  # Will raise if not numeric
            )
            # Load into engine only after validation
            engine.assert_fact(
                'citation',
                validated.response_id,
                validated.citation_id, 
                validated.source,
                validated.page
            )
            loaded += 1
        except (ValueError, IndexError) as e:
            errors.append(f"Fact {i}: {e}")
    
    return loaded, errors

This approach means a schema mismatch produces an explicit error log rather than a silent zero. When your pipeline reports a suspiciously low score on a criterion, the first place to look is always the validation log.

💡 Pro Tip: Build a schema registry — a single Python module that defines all fact shapes as dataclasses or Pydantic models. Both your extraction prompt templates and your Datalog rule files import from this registry. If the contract changes, it changes in one place and validation failures surface immediately during CI.

Pitfall 2: Negation-as-Failure and the Missing Evidence Problem

Datalog uses negation-as-failure (NAF): if a fact cannot be derived, it is treated as false. This is a closed-world assumption — the database is assumed to be complete. In a controlled database, this is fine. In an LLM evaluation pipeline, it creates a dangerous ambiguity.

Consider two very different situations:

Situation A: The LLM response does NOT contain any citations.
Situation B: The extraction step failed to parse the citations section.

In both cases, your Datalog database contains zero citation/4 facts for that response. Under negation-as-failure, your rule citation_absent(R) :- not citation(R, _, _, _) fires for both situations. You cannot tell the difference between a response that genuinely has no citations and one where the extraction was broken.

Extraction outcome                  Datalog sees
─────────────────────────────────   ─────────────────────────
LLM response: no citations    ───►  citation/4: empty  ───► citation_absent fires ✓
Extraction parse error        ───►  citation/4: empty  ───► citation_absent fires ✗
Extraction skipped (timeout)  ───►  citation/4: empty  ───► citation_absent fires ✗

⚠️ Common Mistake — Mistake 2: Conflating "criterion not met" with "criterion not extractable." If your extraction step had any uncertainty, timeout, or parse failure, the absence of a fact is epistemically meaningless — but Datalog will treat it identically to a confirmed absence.

The solution is to make extraction status a first-class fact. Every extraction attempt should assert an outcome fact, not just the extracted content:

% Extraction status facts — always asserted, regardless of content found
% extraction_status(ResponseId, Domain, Status)
% Status ∈ {ok, parse_error, timeout, skipped}

% Only penalize for absent citations if extraction succeeded
citation_criterion_failed(R) :-
    extraction_status(R, citations, ok),
    not citation(R, _, _, _).

% Flag as unevaluable if extraction failed
citation_criterion_unevaluable(R) :-
    extraction_status(R, citations, parse_error).

citation_criterion_unevaluable(R) :-
    extraction_status(R, citations, timeout).

% Score only when we have confident signal
citation_score(R, 0) :- citation_criterion_failed(R).
citation_score(R, 1) :- citation(R, _, _, _).  -- at least one citation found
% No citation_score fact for unevaluable responses — don't silently score them

This pattern means your scoring layer can distinguish three outcomes — pass, fail, and unevaluable — and your reporting layer can flag unevaluable responses for human review rather than silently assigning them a zero.

🎯 Key Principle: In an LLM evaluation pipeline, the closed-world assumption of Datalog's negation-as-failure is always a modeling choice, never a ground truth. Make that choice explicit by encoding extraction confidence as facts.

Pitfall 3: Rule Interaction Surprises

Datalog rules compose cleanly in theory. In practice, when multiple rules can derive the same predicate with different conditions, the interactions can produce scores you didn't intend.

The classic case is overlapping rules that each contribute to an aggregate. Suppose you have a quality level predicate and you add rules incrementally as the rubric evolves:

% Original rule: high quality if 3+ citations
quality_level(R, high) :- 
    citation_count(R, N), N >= 3.

% Added later: high quality if expert sources cited
quality_level(R, high) :- 
    has_expert_source(R).

% Added by different team member: high quality if length threshold met
quality_level(R, high) :-
    response_length(R, L), L >= 500.

Now consider an aggregation rule downstream:

% Count how many "high" derivations exist — WRONG
high_quality_count(R, Count) :-
    Count = #count { R2 : quality_level(R2, high), R2 = R }.

Because Datalog derives quality_level(R, high) three times for a response that satisfies all three conditions, and because the semantics are set-based (each unique fact is stored once), the count is still one. But if you later refactor rules to parametrize which criterion triggered the high rating:

quality_level(R, high, citations) :- citation_count(R, N), N >= 3.
quality_level(R, high, sources)   :- has_expert_source(R).
quality_level(R, high, length)    :- response_length(R, L), L >= 500.

Now a response that satisfies all three gets three quality_level facts, and any aggregation downstream sees all three. If your score weights high-quality derivations, you've inadvertently triple-counted.

💡 Mental Model: Think of each Datalog rule as a generator. Multiple rules generating the same head predicate produce a union of results. For categorical outcomes (like quality levels), union is usually what you want. For anything that feeds into counting or summation, unintended union is a multiplication bug.

The discipline that prevents this is single-derivation predicates for scores: keep the final score predicate derived by exactly one rule, and ensure all multi-source logic is resolved into a single canonical form before that final rule fires:

% Resolve multiple quality signals into a single canonical judgment
any_high_quality_signal(R) :- citation_count(R, N), N >= 3.
any_high_quality_signal(R) :- has_expert_source(R).
any_high_quality_signal(R) :- response_length(R, L), L >= 500.

% Single rule derives the score predicate
quality_score(R, 2) :- any_high_quality_signal(R).  % 2 = high tier
quality_score(R, 1) :- not any_high_quality_signal(R), has_basic_structure(R).
quality_score(R, 0) :- not any_high_quality_signal(R), not has_basic_structure(R).

Now quality_score is derived at most once per response, and the multi-source logic is cleanly encapsulated in any_high_quality_signal.

Pitfall 4: Over-Normalizing Facts into Too Many Relations

Practitioners with a relational database background often apply aggressive normalization to their fact schemas. What feels like good design discipline — breaking facts into minimal, non-redundant relations — creates Datalog rules that are nearly impossible to read or maintain.

Imagine a response evaluation where you want to score whether an answer addresses the user's question. A heavily normalized schema might look like:

% Over-normalized: response metadata split across 6 relations
response(R, text).                          % just the id and a text marker
response_metadata(R, field, value).         % EAV-style: (id, 'length', '450')
response_segment(R, seg_id).               % each sentence is its own entity
segment_text(seg_id, text).
segment_label(seg_id, label_type, label).  % ('sent_1', 'relevance', 'on-topic')
topic_alignment(R, topic_id).

A rule that checks whether the response is on-topic and long enough now requires joining five relations:

adequate_response(R) :-
    response(R, _),
    response_metadata(R, length, L_str), atom_number(L_str, L), L >= 300,
    response_segment(R, Seg),
    segment_label(Seg, relevance, on-topic),
    topic_alignment(R, _).

This is not readable as a rubric criterion. A non-engineer stakeholder reviewing the scoring logic cannot map this back to the original intent. Every modification requires tracing through five relation definitions.

❌ Wrong thinking: "More normalized = more correct and maintainable." ✅ Correct thinking: "Normalized enough to avoid redundancy, but not so normalized that the rules lose their semantic clarity."

A pragmatic approach is to use domain-meaningful composite facts — facts whose predicate names and argument positions directly reflect the concepts in your rubric:

% Domain-meaningful: one fact captures the evaluative unit
response_evaluated(
    R,           % response id
    Length,      % character count
    OnTopic,     % boolean atom: on_topic or off_topic  
    TopicId      % which topic
).

% Now the rule reads like the rubric
adequate_response(R) :-
    response_evaluated(R, Length, on_topic, _),
    Length >= 300.

The rule now reads like the criterion it encodes. A rubric change (say, raising the length threshold to 400) requires changing exactly one number in exactly one place.

🧠 Mnemonic: The Rubric Readability Test — if a domain expert who knows the rubric but not Datalog cannot read a rule and confirm it matches the criterion it's supposed to encode, the rule is over-normalized. Simplify the schema until the rule becomes self-evident.

Pitfall 5: Performance Traps with Large Fact Sets

Datalog engines are designed to handle recursive queries efficiently, but they're not designed to perform arbitrary joins over millions of facts without guidance. In LLM evaluation pipelines that run batch scoring over large corpora, naive rule design can produce quadratic or worse join behavior.

The most common performance trap is unconstrained cross-joins that appear in rules intended to compare responses:

% DANGEROUS: compares every response pair — O(n²) behavior
similar_response_pair(R1, R2) :-
    response_topic(R1, T),
    response_topic(R2, T),
    R1 != R2.

For a batch of 10,000 responses sharing a common topic, this rule generates 100 million pairs. The Datalog engine will attempt to enumerate them all.

The fix is to bound joins early using more selective predicates, and to avoid pairwise comparison patterns entirely where possible:

% SAFE: aggregate at topic level instead of generating pairs
topic_response_count(T, Count) :-
    Count = #count { R : response_topic(R, T) }.

% Check if a response is in a high-competition topic
high_competition_topic_response(R) :-
    response_topic(R, T),
    topic_response_count(T, Count),
    Count >= 10.

This achieves the same downstream goal (identifying responses in crowded topics) with linear rather than quadratic work.

⚠️ Common Mistake — Mistake 3: Writing rules that feel natural in SQL (where query planners handle join ordering) but create catastrophic plans in Datalog (where join ordering is often determined by rule body order). In most Datalog engines, predicates listed earlier in a rule body are evaluated first. Put the most selective (smallest result set) predicates first.

Testing Strategy: Making Your Scoring Rules Defensible

None of the above pitfalls are theoretical. They emerge in production, often after a rubric change introduces a new rule that interacts unexpectedly with existing ones. The only sustainable defense is a testing discipline built specifically for Datalog scoring rules.

Property-Based Tests for Rules

Property-based testing generates many random fact combinations and checks that rules satisfy invariants. For scoring rules, the key invariants are typically: scores are within expected ranges, every response gets exactly one score value per criterion, and negation rules don't fire when extraction status is non-ok.

import hypothesis
from hypothesis import given, strategies as st
from your_datalog_engine import DatalogEngine, load_rules

engine = DatalogEngine()
load_rules(engine, "scoring_rules.dl")

@given(
    response_id=st.text(min_size=1, max_size=20),
    citation_count=st.integers(min_value=0, max_value=20),
    extraction_ok=st.booleans()
)
def test_citation_score_invariants(response_id, citation_count, extraction_ok):
    """Citation scores must be 0 or 1, never absent for ok extractions."""
    engine.reset()
    
    status = "ok" if extraction_ok else "parse_error"
    engine.assert_fact("extraction_status", response_id, "citations", status)
    
    if extraction_ok:
        for i in range(citation_count):
            engine.assert_fact("citation", response_id, f"cit_{i}", "source", i)
    
    scores = engine.query("citation_score", response_id, "?")
    
    if extraction_ok:
        # Must have exactly one score
        assert len(scores) == 1, f"Expected 1 score, got {len(scores)}"
        score_value = scores[0][1]
        assert score_value in (0, 1), f"Score {score_value} out of range"
        if citation_count > 0:
            assert score_value == 1
        else:
            assert score_value == 0
    else:
        # Must have no score — unevaluable responses are not scored
        assert len(scores) == 0, "Unevaluable response should not receive a score"

Fact Fixture Libraries

A fact fixture library is a collection of named, curated fact sets that represent canonical scenarios — the minimum facts needed to trigger each scoring outcome. These serve as the ground truth for unit tests and as documentation for what the rules mean:

## fixtures/citation_fixtures.py
CITATION_FIXTURES = {
    "high_citation_quality": [
        ("extraction_status", "resp_1", "citations", "ok"),
        ("citation", "resp_1", "c1", "Smith 2021", 12),
        ("citation", "resp_1", "c2", "Jones 2019", 45),
        ("citation", "resp_1", "c3", "Park 2023", 8),
    ],
    "zero_citations_ok_extraction": [
        ("extraction_status", "resp_2", "citations", "ok"),
        # No citation facts — confirmed empty
    ],
    "extraction_failed": [
        ("extraction_status", "resp_3", "citations", "parse_error"),
        # No citation facts — but extraction failed
    ],
}

## Tests that use fixtures are readable and maintainable
def test_high_citation_quality_scores_one():
    engine = load_fixture("high_citation_quality")
    scores = engine.query("citation_score", "resp_1", "?")
    assert scores == [("resp_1", 1)]

def test_extraction_failure_produces_no_score():
    engine = load_fixture("extraction_failed")
    scores = engine.query("citation_score", "resp_3", "?")
    assert scores == []  # No score, not a zero

Regression Suites for Rubric Changes

Every time a rubric changes — a threshold shifts, a criterion is added, a weight is adjusted — the change should be validated against a regression suite of historical scored responses. The key question is not just "does the new rule produce valid output" but "which previously-scored responses change their score, and are those changes intentional?"

Rubric Change Workflow

  1. Propose rubric change
         │
         ▼
  2. Run regression suite against historical scored facts
         │
         ▼
  3. Generate diff report:
     ┌────────────────────────────────────┐
     │ Responses whose score changed: 47  │
     │   Score increased (0→1): 31        │
     │   Score decreased (1→0): 16        │
     │ Responses unchanged: 9,953         │
     └────────────────────────────────────┘
         │
         ▼
  4. Human review of changed scores:
     Are the 16 decreases intentional?
     Are the 31 increases intentional?
         │
         ▼
  5. Approve → merge → update baseline

This workflow transforms rubric changes from ad-hoc edits into auditable, reviewable diffs. The scoring rules — and their tests — become the specification of what your rubric means.

💡 Real-World Example: An evaluation team at a content moderation company discovered that a rubric threshold change intended to catch more policy violations was also flagging a category of technical documentation that had always been compliant. The regression suite caught 340 spurious flag changes in the test corpus before the change was deployed. Without the suite, those regressions would have surfaced as unexplained spikes in human review queue volume.

Putting It Together: A Pitfall Checklist

📋 Quick Reference Card: Datalog Scoring Pitfall Prevention

	Pitfall	Detection Signal	Prevention
🔒	Schema mismatch	Silent zero scores	Schema registry + load-time validation
⚠️	NAF misuse	Failed extractions score as zero	Extraction status as first-class fact
🔧	Rule interaction	Scores exceed expected range	Single-derivation score predicates
📚	Over-normalization	Rules unreadable to domain experts	Rubric readability test
🎯	Performance traps	Batch scoring timeouts	Bound joins early; avoid pairwise rules
🧠	No test suite	Rubric changes break old scores silently	Fixtures + property tests + regression suite

The consistent thread across all these pitfalls is that Datalog's silence is deceptive. The language guarantees termination and determinism, but it does not guarantee that your rules mean what you think they mean. The discipline that makes Datalog scoring defensible in production is the same discipline that makes any formal system trustworthy: explicit contracts at every boundary, test coverage that exercises the semantics you care about, and a workflow that forces rubric changes to confront their consequences before deployment.

🤔 Did you know? The hardest bugs in Datalog scoring pipelines are almost never logic errors in individual rules — they're boundary condition failures: the fact that wasn't asserted when extraction returned an empty string, the rule that fires on the empty set in a way that was never tested, the aggregate that silently returns zero because no facts matched. Investing in boundary condition tests pays off disproportionately.

Key Takeaways and Extending the Pattern

You started this lesson with a scoring layer that was the weakest link in your evaluation pipeline — a second LLM call that reintroduced variance, or a hand-rolled boolean tangle that nobody wanted to touch after the third edge case. You end it with a pattern that is declarative enough to read like a rubric, restricted enough to guarantee termination, and structured enough to produce an audit log your stakeholders can actually inspect.

This final section consolidates what you've learned, gives you a decision framework for when to reach for Datalog versus lighter or heavier tools, and maps out the natural extensions that will carry this pattern from prototype into production.

The Three Non-Negotiable Properties, Revisited

Everything in this lesson flows from three properties that Datalog brings to your scoring layer. They are worth naming precisely one more time, because the temptation to compromise on any one of them is real — and the cost of doing so compounds.

🎯 Key Principle: Termination guarantees safety. Datalog's stratified evaluation model means the engine will always halt. There are no infinite loops hiding in a rubric rule, no recursive descent that blows a call stack when an LLM response is malformed. When your input is adversarial or simply unexpected, the evaluator doesn't hang — it produces a result. This is the property that makes Datalog defensible in production, especially when the scores drive consequential decisions.

🎯 Key Principle: Composability controls complexity. Rules in Datalog don't share mutable state. A rule that defines quality_tier cannot silently modify the fact that defines citation_count. This means you can add a new rubric criterion — a new rule — without auditing every existing rule for surprise interactions. The rule file scales like a rubric document, not like a codebase.

🎯 Key Principle: Traces enable accountability. Every derived fact has a proof tree. The score isn't a number that appeared; it's a conclusion that follows from specific base facts through named rules. That proof tree is your audit log. It answers "why did this response score a 2?" with a structured, reproducible answer — which is exactly what you need when a score is challenged.

🧠 Mnemonic: T-C-T — Termination, Composability, Traceability. If a scoring tool can't give you all three, you're trading one class of problem for another.

When Datalog Is the Right Fit

Datalog is not the answer to every scoring problem. Part of using it well is knowing when it earns its keep and when it doesn't.

📋 Quick Reference Card: Scoring Tool Decision Guide

🔍 Situation	✅ Reach For	💬 Reason
📋 Rubric has 3–4 binary gates, no aggregation	Simple boolean logic or SQL WHERE	Overhead of Datalog isn't justified
📊 Rubric has thresholds, tiers, and aggregated counts	Datalog	Composable rules, safe aggregation, readable logic
🔁 Scoring requires recursive relationships (e.g., dependency graphs)	Datalog (with care) or a graph engine	Datalog handles stratified recursion; deeper needs may warrant a dedicated graph tool
🎲 Score is inherently probabilistic (semantic similarity, fluency)	Embedding similarity + threshold gate in Datalog	Let a model produce the numeric score; let Datalog decide what that score means
🏗️ Scoring logic changes every sprint and is owned by non-engineers	Datalog rule files versioned with model configs	Declarative syntax is readable; rule files are reviewable in a PR
🌐 Need to query persistent historical facts across many evaluations	Xtdb or Datomic with Datalog query layer	Adds bitemporal history and persistence without leaving the Datalog paradigm

💡 Mental Model: Think of Datalog as the rule engine layer in a three-layer stack. The bottom layer is your LLM extractor, producing structured facts. The middle layer is Datalog, turning facts into scores via rubric rules. The top layer is your reporting and decision system, consuming scored results. The moment you blur the line between any two layers — asking the extractor to score, or asking the decision system to reason — you lose the properties that made each layer trustworthy.

The Quick-Reference Pipeline Pattern

The end-to-end pattern you've built across this lesson has five steps. Every implementation you write should map cleanly onto these five steps, even if the tooling changes.

┌─────────────────────────────────────────────────────────────────┐
│                  DATALOG SCORING PIPELINE                       │
│                                                                 │
│  1. EXTRACTION SCHEMA                                           │
│     LLM output → structured JSON via extraction prompt          │
│     Defines which facts are available downstream                │
│                          │                                      │
│                          ▼                                      │
│  2. FACT LOADER                                                 │
│     JSON → Datalog base facts (EDB)                             │
│     Validates schema; surfaces missing evidence early           │
│                          │                                      │
│                          ▼                                      │
│  3. RULE FILE                                                   │
│     Rubric criteria encoded as Datalog rules (IDB)              │
│     Versioned alongside model and prompt configs                │
│                          │                                      │
│                          ▼                                      │
│  4. QUERY                                                       │
│     Ask for derived facts: scores, tiers, violations            │
│     Engine evaluates; guaranteed to terminate                   │
│                          │                                      │
│                          ▼                                      │
│  5. SCORED RESULT + TRACE                                       │
│     Score + the derivation path that produced it                │
│     Structured audit log, ready for CI or human review          │
└─────────────────────────────────────────────────────────────────┘

Here is a minimal but complete implementation of this pattern using pyDatalog, the Python-accessible Datalog engine suited for prototyping and moderate-scale pipelines:

from pyDatalog import pyDatalog
import json

## ── Step 1 & 2: Load extraction output as Datalog base facts ──────────────────

## Simulated extraction schema output from an LLM judge call
extracted = {
    "response_id": "resp_001",
    "citation_count": 3,
    "has_disclaimer": True,
    "toxicity_flag": False,
    "word_count": 420,
    "source_diversity": 2  # number of distinct source domains cited
}

## Declare Datalog terms
pyDatalog.create_terms(
    'X, N, B,'
    'citation_count, has_disclaimer, toxicity_flag, word_count, source_diversity,'
    'passes_citation_gate, passes_safety_gate, passes_length_gate,'
    'quality_tier, final_score'
)

def load_facts(doc: dict):
    """Convert extraction output dict into Datalog base facts (EDB)."""
    rid = doc["response_id"]
    + citation_count(rid, doc["citation_count"])
    + has_disclaimer(rid, doc["has_disclaimer"])
    + toxicity_flag(rid, doc["toxicity_flag"])
    + word_count(rid, doc["word_count"])
    + source_diversity(rid, doc["source_diversity"])

load_facts(extracted)

## ── Step 3: Rule file — rubric criteria as Datalog rules (IDB) ────────────────

## Gate 1: must have at least 2 citations
passes_citation_gate(X) <= (
    citation_count(X, N) &
    (N >= 2)
)

## Gate 2: must not be flagged for toxicity
passes_safety_gate(X) <= (
    toxicity_flag(X, False)
)

## Gate 3: response must be between 100 and 800 words
passes_length_gate(X) <= (
    word_count(X, N) &
    (N >= 100) &
    (N <= 800)
)

## Quality tier: GOOD requires all gates plus >= 2 distinct sources
quality_tier(X, 'GOOD') <= (
    passes_citation_gate(X) &
    passes_safety_gate(X) &
    passes_length_gate(X) &
    source_diversity(X, N) &
    (N >= 2)
)

## Quality tier: ACCEPTABLE requires core gates only
quality_tier(X, 'ACCEPTABLE') <= (
    passes_citation_gate(X) &
    passes_safety_gate(X) &
    passes_length_gate(X) &
    ~quality_tier(X, 'GOOD')  # stratified negation: not already GOOD
)

## Quality tier: FAILING if any core gate fails
quality_tier(X, 'FAILING') <= (
    ~passes_citation_gate(X) |
    ~passes_safety_gate(X) |
    ~passes_length_gate(X)
)

## ── Step 4 & 5: Query and surface scored result ───────────────────────────────

rid = "resp_001"
result = quality_tier(rid, X)

print(f"Response: {rid}")
print(f"Quality tier: {result}")

## Surface which gates passed (the natural trace)
gates = {
    "citation": bool(passes_citation_gate(rid)),
    "safety": bool(passes_safety_gate(rid)),
    "length": bool(passes_length_gate(rid)),
}
print(f"Gate trace: {json.dumps(gates, indent=2)}")

This code block does exactly what the pipeline diagram describes. The load_facts function is your fact loader — it validates the schema implicitly by asserting each field as a typed Datalog fact. The rule block is your rule file — in a production system, these rules would live in a .dl file loaded at startup, versioned in source control alongside your prompt config. The query at the bottom is your evaluation step, and the gate trace is the beginning of your audit log.

Extension Directions

Probabilistic Scoring Layers on Top of Deterministic Gates

Not every scoring criterion is binary. Semantic quality, fluency, and relevance are continuous — they're best measured by embedding similarity or a calibrated LLM probe that returns a float. The pattern that works in production is: let the probabilistic layer produce numbers, let Datalog decide what those numbers mean.

Concretely, your extraction schema includes a field like semantic_relevance_score: 0.73. Your fact loader asserts semantic_relevance(resp_001, 0.73). Your Datalog rule translates the float into a discrete tier:

## Probabilistic score from embedding similarity already extracted as a fact
## semantic_relevance(ResponseID, Score) — loaded by fact loader

pyDatalog.create_terms(
    'passes_relevance_gate, relevance_tier'
)

## Hard gate: any score below 0.5 is a hard failure regardless of other criteria
passes_relevance_gate(X) <= (
    semantic_relevance(X, N) &
    (N >= 0.5)
)

## Tier classification: the Datalog layer makes the policy decision
relevance_tier(X, 'HIGH') <= (
    semantic_relevance(X, N) & (N >= 0.85)
)
relevance_tier(X, 'MEDIUM') <= (
    semantic_relevance(X, N) & (N >= 0.5) & (N < 0.85)
)
relevance_tier(X, 'LOW') <= (
    semantic_relevance(X, N) & (N < 0.5)
)

## Compose with other gates — the score from the probabilistic model is now
## just another fact; the rubric logic remains fully deterministic
quality_tier(X, 'GOOD') <= (
    passes_citation_gate(X) &
    passes_safety_gate(X) &
    passes_length_gate(X) &
    relevance_tier(X, 'HIGH')
)

This pattern gives you the best of both worlds. The probabilistic model contributes numeric signal. The Datalog layer makes the policy decision — where the threshold sits, what tier a number maps to — in a way that is auditable, version-controlled, and adjustable without retraining anything.

💡 Pro Tip: Store the raw probabilistic score in the audit log alongside the tier. When you later recalibrate a threshold (e.g., moving the MEDIUM/HIGH boundary from 0.85 to 0.80), you can replay the Datalog evaluation over historical facts without re-running the embedding model.

Versioning Rubric Rule Files Alongside Model Versions

Your rubric is a policy document. When the model changes, the rubric may need to change with it. When the rubric changes, historical scores computed under the old rubric become incomparable with new scores. The fix is to treat your .dl rule file as a first-class artifact in your version control system — committed, tagged, and referenced by the evaluation run that used it.

❌ Wrong thinking: "The rule file is config; I'll just edit it in place."

✅ Correct thinking: "The rule file is policy; every version is immutable and tagged. A run record references a rule file version, not a mutable path."

A practical convention: name your rule files rubric_v{major}.{minor}.dl and store them under evals/rubrics/ in the same repository as your prompt configs. Your CI evaluation job pins both a prompt version and a rubric version. Score regressions are now debuggable: you can diff rubric_v1.2.dl against rubric_v1.3.dl and see exactly which criteria changed.

CI-Based Evaluation Pipelines

The five-step pattern maps directly onto a CI job. Each step becomes a stage:

CI JOB: eval_on_pull_request
│
├── Stage 1: Generate responses
│   Run model against eval dataset; save raw outputs to artifacts/
│
├── Stage 2: Extract facts
│   Run extraction prompt over raw outputs; produce facts.jsonl
│
├── Stage 3: Load facts + run Datalog
│   Load facts.jsonl; apply rubric_v{version}.dl; produce scores.jsonl
│
├── Stage 4: Aggregate + threshold check
│   Compute pass rate, tier distribution; compare to baseline
│
└── Stage 5: Gate or report
    Fail the PR if pass rate drops below threshold;
    Post score summary as a PR comment

The Datalog layer makes Stage 3 fast, deterministic, and cheap — no LLM calls, no network dependencies, no variance between runs on the same facts. That's what makes the CI gate meaningful: if the score changes, it's because the model's output changed or the rubric changed, not because the evaluator rolled differently.

Tooling Options for Moving Beyond the Prototype

Not all Datalog engines are equal. The right choice depends on where you are in the maturity curve.

🔧 pyDatalog — Best for prototyping and integration with Python-native pipelines. The syntax embeds directly in Python, making it easy to mix fact loading, rule definition, and result processing in a single script. Performance is adequate for single-document scoring up to hundreds of facts. Not the choice for high-throughput batch evaluation.

🔧 Soufflé — The performance choice. Soufflé is a Datalog engine developed at Oracle Labs and widely used in program analysis. It compiles Datalog rules to parallel C++ and can evaluate over millions of facts in seconds. For batch evaluation runs over large datasets, Soufflé is the right tool. Rules are written in a .dl file; facts are loaded from CSV or SQLite. The integration boundary with Python is a subprocess call or a compiled shared library.

🔧 Xtdb (formerly CRUX) — The persistent-fact-store choice. Xtdb is a bitemporal database with a Datalog query layer. Every fact has a valid time and a transaction time, which means you can query "what did the evaluator believe about this document as of last Tuesday's model version" with no additional infrastructure. This is the right choice when you need to accumulate evaluation history across many runs and query it retroactively — for drift detection, regression analysis, or compliance reporting.

📋 Quick Reference Card: Tooling Decision Matrix

🔧 Tool	📊 Scale	🏗️ Persistence	🎯 Best For
🐍 pyDatalog	Single doc to small batch	In-memory only	Prototyping, notebook exploration
⚡ Soufflé	Millions of facts	CSV / SQLite input	High-throughput batch eval in CI
🗄️ Xtdb	Large, accumulating datasets	Bitemporal, persistent	Historical queries, drift detection

🤔 Did you know? Soufflé is used in production at several large technology companies to analyze billions of program facts for security vulnerability detection — the same termination and composability properties that make it useful there make it well-suited for LLM evaluation rubrics.

What You Now Know That You Didn't Before

Let's be precise about the shift this lesson represents.

Before this lesson, the common pattern for scoring LLM outputs was one of three things: a second LLM call (cheap to write, expensive to trust), a SQL query over extracted fields (works until the rubric gets complex), or a hand-rolled Python function (works until it doesn't, and then nobody knows why). All three approaches make the scoring layer opaque — you get a number, but the path from evidence to conclusion is either probabilistic, implicit, or buried in imperative code.

After this lesson, you have a pattern where:

🧠 The scoring logic is declared, not programmed. Anyone who can read a rubric can read a Datalog rule file.

📚 The evaluation is reproducible. The same facts plus the same rules always produce the same score. No temperature, no sampling, no variance.

🔧 The trace is structural. The audit log isn't a comment in a Python function — it's the derivation tree of the proof, queryable and serializable.

🎯 The system is safe to extend. Adding a new rubric criterion means adding a new rule. It doesn't require auditing existing rules for side effects.

🔒 The pipeline is version-controlled end to end. Model config, prompt config, extraction schema, and rubric rule file are all artifacts with versions. A score regression is debuggable because you can diff any layer.

⚠️ Critical point to remember: The most common failure mode after adopting this pattern is letting the boundary between extraction and scoring blur. If your extraction prompt starts making evaluative judgments — "the response cited three relevant sources" instead of "the response cited three sources" — you've moved scoring logic into the LLM layer where it is no longer reproducible. Keep extraction schema fields factual and atomic. Let the Datalog rules do the evaluating.

⚠️ Critical point to remember: Rubric rule files are policy documents. Treat them with the same rigor you apply to code. Review them in pull requests. Tag them at release. Reference them by version in run records. A score that can't be traced to a specific rubric version is not an auditable score.

⚠️ Critical point to remember: Datalog's stratified negation is powerful but requires care. If your rules use negation (~), verify that the negated predicates are fully grounded before the rule fires. Negation over incomplete facts produces false confidence — a common source of phantom PASSING scores for responses that simply had no evidence extracted.

Recommended Next Steps

Here are three concrete actions to take this from lesson to practice:

1. Instrument one existing evaluation with this pattern. Take a rubric you're already running — even a simple one — and re-implement it as a Datalog rule file using pyDatalog. Don't change the rubric logic yet. Just translate it. The exercise will surface every implicit assumption baked into your current scoring code, and the resulting rule file will be your baseline for future changes.

2. Add the rule file to your version control and CI. Even before you have a full CI evaluation pipeline, commit the rule file alongside your prompt configs. Write a short README that explains what version of the model and prompt the rubric was designed against. This single step makes your evaluation history navigable.

3. Design your extraction schema before your rules. The most productive order of operations is: write the rubric in plain English → identify the atomic facts each criterion requires → design the extraction schema to surface those facts → write the Datalog rules. Working in this order prevents the schema-rule mismatch pitfalls covered in the previous section and forces clarity about what the model is actually being asked to produce.

💡 Real-World Example: A team running automated evaluation for a customer-support response generator used this exact sequence. They started with seven rubric criteria encoded as a Python function (200 lines, no tests). After translating to Datalog, the rule file was 40 lines and readable by their QA team without Python knowledge. More importantly, when a new model version changed the citation behavior, the score regression was diagnosed in one PR diff — the rule file hadn't changed, the extraction schema hadn't changed, and the facts showed fewer citations. The root cause was unambiguous.

The scoring layer is no longer the weakest link. It is the most transparent part of your evaluation pipeline — and that changes what it means to trust a score.

📝

Ready to practice?

This lesson has 15 questions to help you learn