← Back to Roadmaps
LLM as Judge: Reproducible Evaluation for LLM Systems - Learning Roadmap | Nemorize
Loading roadmap...
Learning Topics
This roadmap covers the following topics:
✅ Why Rigorous Eval Exists
- ⚪ Classical Metrics Failed
- ⚪ BLEU and ROUGE Failure Cases
- ⚪ Human Eval Doesn't Scale
- ⚪ Cost-of-Being-Wrong Framework
- ⚪ Defining Cost Tiers
- ⚪ Eval as an Architecture Decision
✅ The LLM Judge Premise
- ⚪ Where LLM Judges Shine
- ⚪ Strengths With Evidence
- ⚪ Where LLM Judges Struggle
- ⚪ The Right Tool Decision
- ⚪ Judge vs. Metric vs. Pipeline
✅ Core Judging Patterns
- ⚪ Rubric Design and Criteria Decomposition
- ⚪ Atomic Criteria vs. Holistic Rubrics
- ⚪ Chain-of-Thought Scoring
- ⚪ Rubric Drift
- ⚪ Pointwise, Pairwise, and Reference-Based Modes
- ⚪ Pointwise Scoring and Its Biases
- ⚪ Pairwise Comparison and Position Bias
- ⚪ Reference-Based Judging
✅ G-Eval and Structured Output
- ⚪ G-Eval: Architecture and Variants
- ⚪ Token Probability Scoring
- ⚪ FActScoring and Fact Decomposition
- ⚪ Structured Output for Judges
- ⚪ Schema Design for Eval Payloads
- ⚪ Constrained Decoding and Tool-Use Patterns
⚪ Systematic Failure Modes
- ⚪ Self-Preference and Verbosity Bias
- ⚪ Detecting Self-Preference
- ⚪ Verbosity Bias in Practice
- ⚪ Bias-to-Mode Mapping
- ⚪ Position Bias Measurement
- ⚪ Rubric Drift Over Time
⚪ The Hybrid Pattern: Extraction Plus Deterministic Rules
- ⚪ Why Split Extraction from Judgment
- ⚪ Designing the Fact Schema
- ⚪ Extraction Failure Modes
- ⚪ Deterministic Scoring: Rules, Trees, and DAGs
- ⚪ Encoding Rubrics as Rules
- ⚪ Audit Trails and Reproducibility
- ⚪ When the Hybrid Pattern Is Overkill
- ✅ Datalog for Deterministic Scoring
⚪ Meta-Evaluation and Production
- ⚪ Meta-Evaluation: Testing the Judge
- ⚪ Human Correlation and Benchmark Suites
- ⚪ Adversarial Test Cases
- ⚪ Production: Cost, Latency, and Drift
- ⚪ Cost Architecture and Model Tiering
- ⚪ Drift Monitoring and Judge Maintenance
- ⚪ When LLM-as-Judge Is the Wrong Tool
⚪ Building Your Eval Stack
- ⚪ CI, Nightly, and Audit Pipeline Design
- ⚪ What Goes in Each Layer
- ⚪ The Eval Stack Decision Framework
- ⚪ Putting It Into Production
- ⚪ Eval as Living Infrastructure
- ⚪ From Demo to Defensible System
Sign in to share your feedback and rate this roadmap
Loading comments...
Community Feedback
Share your thoughts and rate this roadmap