← Back to Roadmaps

LLM as Judge: Reproducible Evaluation for LLM Systems - Learning Roadmap | Nemorize

Loading roadmap...

Learning Topics

This roadmap covers the following topics:

⚪ Classical Metrics Failed
- ⚪ BLEU and ROUGE Failure Cases
- ⚪ Human Eval Doesn't Scale
⚪ Cost-of-Being-Wrong Framework
- ⚪ Defining Cost Tiers
- ⚪ Eval as an Architecture Decision

⚪ Where LLM Judges Shine
- ⚪ Strengths With Evidence
- ⚪ Where LLM Judges Struggle
⚪ The Right Tool Decision
- ⚪ Judge vs. Metric vs. Pipeline

⚪ Rubric Design and Criteria Decomposition
- ⚪ Atomic Criteria vs. Holistic Rubrics
- ⚪ Chain-of-Thought Scoring
- ⚪ Rubric Drift
⚪ Pointwise, Pairwise, and Reference-Based Modes
- ⚪ Pointwise Scoring and Its Biases
- ⚪ Pairwise Comparison and Position Bias
- ⚪ Reference-Based Judging

⚪ G-Eval: Architecture and Variants
- ⚪ Token Probability Scoring
- ⚪ FActScoring and Fact Decomposition
⚪ Structured Output for Judges
- ⚪ Schema Design for Eval Payloads
- ⚪ Constrained Decoding and Tool-Use Patterns

⚪ Systematic Failure Modes

⚪ Self-Preference and Verbosity Bias
- ⚪ Detecting Self-Preference
- ⚪ Verbosity Bias in Practice
⚪ Bias-to-Mode Mapping
- ⚪ Position Bias Measurement
- ⚪ Rubric Drift Over Time

⚪ The Hybrid Pattern: Extraction Plus Deterministic Rules

⚪ Why Split Extraction from Judgment
- ⚪ Designing the Fact Schema
- ⚪ Extraction Failure Modes
⚪ Deterministic Scoring: Rules, Trees, and DAGs
- ⚪ Encoding Rubrics as Rules
- ⚪ Audit Trails and Reproducibility
- ⚪ When the Hybrid Pattern Is Overkill
✅ Datalog for Deterministic Scoring

⚪ Meta-Evaluation and Production

⚪ Meta-Evaluation: Testing the Judge
- ⚪ Human Correlation and Benchmark Suites
- ⚪ Adversarial Test Cases
⚪ Production: Cost, Latency, and Drift
- ⚪ Cost Architecture and Model Tiering
- ⚪ Drift Monitoring and Judge Maintenance
- ⚪ When LLM-as-Judge Is the Wrong Tool

⚪ Building Your Eval Stack

⚪ CI, Nightly, and Audit Pipeline Design
- ⚪ What Goes in Each Layer
- ⚪ The Eval Stack Decision Framework
⚪ Putting It Into Production
- ⚪ Eval as Living Infrastructure
- ⚪ From Demo to Defensible System

Share your thoughts and rate this roadmap

Loading comments...