← Back to Roadmaps

LLM as Judge: Reproducible Evaluation for LLM Systems - Learning Roadmap | Nemorize

Loading roadmap...

Learning Topics

This roadmap covers the following topics:

Why Rigorous Eval Exists
  • ⚪ Classical Metrics Failed
    • ⚪ BLEU and ROUGE Failure Cases
    • ⚪ Human Eval Doesn't Scale
  • ⚪ Cost-of-Being-Wrong Framework
    • ⚪ Defining Cost Tiers
    • ⚪ Eval as an Architecture Decision
The LLM Judge Premise
  • ⚪ Where LLM Judges Shine
    • ⚪ Strengths With Evidence
    • ⚪ Where LLM Judges Struggle
  • ⚪ The Right Tool Decision
    • ⚪ Judge vs. Metric vs. Pipeline
Core Judging Patterns
  • ⚪ Rubric Design and Criteria Decomposition
    • ⚪ Atomic Criteria vs. Holistic Rubrics
    • ⚪ Chain-of-Thought Scoring
    • ⚪ Rubric Drift
  • ⚪ Pointwise, Pairwise, and Reference-Based Modes
    • ⚪ Pointwise Scoring and Its Biases
    • ⚪ Pairwise Comparison and Position Bias
    • ⚪ Reference-Based Judging
G-Eval and Structured Output
  • ⚪ G-Eval: Architecture and Variants
    • ⚪ Token Probability Scoring
    • ⚪ FActScoring and Fact Decomposition
  • ⚪ Structured Output for Judges
    • ⚪ Schema Design for Eval Payloads
    • ⚪ Constrained Decoding and Tool-Use Patterns
Systematic Failure Modes
  • ⚪ Self-Preference and Verbosity Bias
    • ⚪ Detecting Self-Preference
    • ⚪ Verbosity Bias in Practice
  • ⚪ Bias-to-Mode Mapping
    • ⚪ Position Bias Measurement
    • ⚪ Rubric Drift Over Time
The Hybrid Pattern: Extraction Plus Deterministic Rules
  • ⚪ Why Split Extraction from Judgment
    • ⚪ Designing the Fact Schema
    • ⚪ Extraction Failure Modes
  • ⚪ Deterministic Scoring: Rules, Trees, and DAGs
    • ⚪ Encoding Rubrics as Rules
    • ⚪ Audit Trails and Reproducibility
    • ⚪ When the Hybrid Pattern Is Overkill
  • Datalog for Deterministic Scoring
Meta-Evaluation and Production
  • ⚪ Meta-Evaluation: Testing the Judge
    • ⚪ Human Correlation and Benchmark Suites
    • ⚪ Adversarial Test Cases
  • ⚪ Production: Cost, Latency, and Drift
    • ⚪ Cost Architecture and Model Tiering
    • ⚪ Drift Monitoring and Judge Maintenance
    • ⚪ When LLM-as-Judge Is the Wrong Tool
Building Your Eval Stack
  • ⚪ CI, Nightly, and Audit Pipeline Design
    • ⚪ What Goes in Each Layer
    • ⚪ The Eval Stack Decision Framework
  • ⚪ Putting It Into Production
    • ⚪ Eval as Living Infrastructure
    • ⚪ From Demo to Defensible System

Community Feedback

Share your thoughts and rate this roadmap

Loading comments...