Evaluation & Quality Metrics
Establish comprehensive evaluation frameworks for retrieval quality, generation accuracy, and end-to-end performance.
Evaluation & Quality Metrics for AI Search & RAG
Master the evaluation and quality metrics for AI Search and Retrieval-Augmented Generation (RAG) systems with free flashcards and spaced repetition practice. This lesson covers retrieval metrics, generation quality assessment, end-to-end evaluation frameworks, and production monitoring strategiesโessential concepts for building reliable RAG systems that deliver accurate, relevant results.
Welcome to RAG Evaluation ๐ฏ
Building a RAG system is one thing; knowing whether it works well is entirely another. Without proper evaluation metrics, you're flying blindโunable to detect when your system retrieves irrelevant documents, generates hallucinated content, or fails to answer user questions accurately.
In 2026, evaluation has become the cornerstone of RAG development. Organizations have learned the hard way that production RAG systems can silently degrade over time as data distributions shift, models update, or user behavior changes. Quality metrics serve as your early warning system, helping you catch problems before users do.
This lesson breaks down the complete evaluation landscape into digestible components: retrieval metrics (how well you find relevant information), generation metrics (how well you produce answers), and end-to-end metrics (how well the whole system performs). You'll learn practical techniques used by leading AI teams to ensure their RAG systems maintain high quality in production.
Core Concepts: The Evaluation Landscape ๐บ๏ธ
The Three Pillars of RAG Evaluation
RAG systems have three distinct evaluation surfaces, each requiring different metrics:
| Evaluation Surface | What It Measures | Key Metrics |
|---|---|---|
| ๐ Retrieval Quality | How well relevant documents are found | Precision@K, Recall@K, MRR, NDCG |
| โ๏ธ Generation Quality | How accurate and useful the generated answer is | Faithfulness, Relevance, Coherence |
| ๐ฏ End-to-End Quality | Overall system performance from user perspective | Answer Accuracy, Latency, User Satisfaction |
Why separate them? Because a RAG system can fail in different ways at different stages:
- โ Retrieval fails โ right documents never surface
- โ Generation fails โ wrong answer despite right documents
- โ Integration fails โ slow response time ruins user experience
Retrieval Metrics Deep Dive ๐
Retrieval evaluation assumes you have ground truth relevance judgmentsโknowledge of which documents should be retrieved for each query. Let's explore the core metrics:
Precision@K and Recall@K
Precision@K measures what percentage of your top K retrieved documents are actually relevant:
Precision@K = (# relevant docs in top K) / K
Recall@K measures what percentage of all relevant documents you captured in your top K:
Recall@K = (# relevant docs in top K) / (total # relevant docs)
Example: For query "What is vector search?"
- Total relevant documents in corpus: 10
- Top 5 retrieved: 3 are relevant, 2 are not
- Precision@5 = 3/5 = 0.60 (60% of retrieved docs are relevant)
- Recall@5 = 3/10 = 0.30 (captured 30% of all relevant docs)
๐ก Pro Tip: These metrics trade off! Higher K usually increases recall but decreases precision. Choose K based on your generation model's context window and processing capability.
Mean Reciprocal Rank (MRR)
MRR focuses on the rank position of the first relevant document:
MRR = average(1 / rank_of_first_relevant_doc)
If the first relevant doc is at position 1, you get 1.0. At position 2, you get 0.5. At position 10, you get 0.1.
Why it matters: For RAG systems, the first few documents often matter most. If relevant content is buried at position 20, your LLM might never use it effectively.
Example calculation:
| Query | First Relevant at Position | Reciprocal Rank |
|---|---|---|
| Query 1 | 1 | 1.0 |
| Query 2 | 3 | 0.333 |
| Query 3 | 2 | 0.5 |
| MRR | 0.611 | |
Normalized Discounted Cumulative Gain (NDCG)
NDCG is the most sophisticated retrieval metric, accounting for:
- Relevance grades (not just binary relevant/not-relevant)
- Position bias (higher ranked docs matter more)
Formula breakdown:
## DCG (Discounted Cumulative Gain)
DCG@K = sum(relevance[i] / log2(i + 1) for i in range(1, K+1))
## NDCG (Normalized DCG)
NDCG@K = DCG@K / IDCG@K
## where IDCG@K = DCG of the ideal ranking
Real example: Query: "Python memory management" Relevance scale: 0 (not relevant) to 3 (highly relevant)
| Position | Retrieved Doc | Relevance | Discount (logโ(i+1)) | Contribution |
|---|---|---|---|---|
| 1 | Doc A | 3 | 1.0 | 3.0 |
| 2 | Doc B | 2 | 1.585 | 1.26 |
| 3 | Doc C | 0 | 2.0 | 0.0 |
| 4 | Doc D | 1 | 2.322 | 0.43 |
| DCG@4 | 4.69 | |||
Ideal ranking would be [3,2,1,0] โ IDCG@4 = 6.26 NDCG@4 = 4.69 / 6.26 = 0.75
โ ๏ธ Important: NDCG requires graded relevance judgments, which are expensive to collect. Many teams start with binary relevance (relevant/not) and upgrade to graded judgments for critical queries.
Generation Quality Metrics โ๏ธ
Once you've retrieved documents, the LLM must generate a high-quality answer. Generation metrics evaluate different quality dimensions:
Faithfulness (Groundedness)
Faithfulness measures whether the generated answer is supported by the retrieved documentsโessentially, does your system hallucinate?
Calculation approaches:
- Claim-based verification:
## Pseudo-code for faithfulness scoring
def calculate_faithfulness(answer, retrieved_docs):
claims = extract_claims(answer) # Break answer into atomic claims
supported = 0
for claim in claims:
if is_supported_by_docs(claim, retrieved_docs):
supported += 1
return supported / len(claims)
- NLI-based verification: Use Natural Language Inference models to check if retrieved docs entail the answer:
faithfulness_scores = []
for doc in retrieved_docs:
score = nli_model.predict(premise=doc, hypothesis=answer)
faithfulness_scores.append(score)
faithfulness = max(faithfulness_scores) # At least one doc supports it
Example evaluation:
Query: "When was Python created?"
Retrieved doc: "Python was created by Guido van Rossum and first released in 1991."
Generated answer: "Python was created in 1991 by Guido van Rossum."
Faithfulness: โ 1.0 (fully supported)
Generated answer: "Python was created in 1989 by Guido van Rossum."
Faithfulness: โ 0.5 (date is hallucinated, creator is correct)
Answer Relevance
Answer Relevance measures whether the generated answer actually addresses the user's question:
def calculate_relevance(question, answer):
# Method 1: Embedding similarity
q_embedding = embed(question)
a_embedding = embed(answer)
return cosine_similarity(q_embedding, a_embedding)
# Method 2: LLM-as-judge
prompt = f"""Rate how well this answer addresses the question (0-1):
Question: {question}
Answer: {answer}
Score:"""
return llm.score(prompt)
Example:
- Query: "What are the benefits of vector databases?"
- Answer 1: "Vector databases enable fast similarity search, support high-dimensional data, and scale to billions of vectors." โ โ Relevance: 0.95
- Answer 2: "Databases store data in tables with rows and columns." โ โ Relevance: 0.3 (talks about databases but misses vector-specific benefits)
Coherence and Fluency
Coherence evaluates logical flow and structure:
- Do sentences connect logically?
- Is the answer well-organized?
- Are there contradictions?
Fluency evaluates language quality:
- Grammatically correct?
- Natural phrasing?
- Appropriate vocabulary?
These are typically measured via:
- LLM-as-judge scoring (GPT-4, Claude)
- Specialized evaluation models (e.g., fine-tuned BERT classifiers)
- Human evaluation (gold standard but expensive)
End-to-End RAG Metrics ๐ฏ
Context Precision and Context Recall
These metrics bridge retrieval and generation:
Context Precision = Are the retrieved documents relevant to generating the correct answer?
def context_precision(retrieved_docs, ground_truth_answer):
relevant_docs = [doc for doc in retrieved_docs
if is_useful_for_answer(doc, ground_truth_answer)]
return len(relevant_docs) / len(retrieved_docs)
Context Recall = Did we retrieve all necessary information to answer correctly?
def context_recall(retrieved_docs, ground_truth_answer):
required_facts = extract_facts(ground_truth_answer)
covered_facts = [fact for fact in required_facts
if any(contains(doc, fact) for doc in retrieved_docs)]
return len(covered_facts) / len(required_facts)
Answer Correctness
The ultimate metric: Is the answer correct?
Measurement approaches:
- Exact match (for factoid questions):
ground_truth = "1991"
generated = "Python was created in 1991"
correct = ground_truth in generated # True
- Semantic similarity (for open-ended questions):
from sentence_transformers import util
gt_embedding = model.encode(ground_truth_answer)
gen_embedding = model.encode(generated_answer)
correctness = util.cos_sim(gt_embedding, gen_embedding)
- LLM-based evaluation:
prompt = f"""Compare the generated answer with the ground truth.
Score correctness from 0 (completely wrong) to 1 (perfectly correct).
Ground truth: {ground_truth}
Generated: {generated_answer}
Score:"""
correctness = llm.score(prompt)
Advanced Evaluation Patterns ๐ฌ
The RAGAS Framework
RAGAS (Retrieval-Augmented Generation Assessment) provides a comprehensive evaluation suite:
๐ RAGAS Metric Components
| Metric | Formula | What It Catches |
|---|---|---|
| Context Precision | ฮฃ(Precision@k ร rel(k)) / total_relevant | Irrelevant docs in context |
| Context Recall | |GT_facts โฉ Retrieved_facts| / |GT_facts| | Missing information |
| Faithfulness | |Supported_claims| / |Total_claims| | Hallucinations |
| Answer Relevance | cos_sim(question, answer) | Off-topic responses |
Using RAGAS in practice:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevance,
context_recall,
context_precision,
)
## Your RAG system outputs
data = {
'question': ['What is vector search?'],
'answer': ['Vector search finds similar items using embeddings...'],
'contexts': [['Doc1 text...', 'Doc2 text...']],
'ground_truths': [['Vector search uses embeddings...']]
}
result = evaluate(
data,
metrics=[faithfulness, answer_relevance, context_recall, context_precision]
)
print(result)
## {'faithfulness': 0.95, 'answer_relevance': 0.92,
## 'context_recall': 0.88, 'context_precision': 0.78}
LLM-as-Judge Evaluation
Concept: Use a powerful LLM (GPT-4, Claude) to evaluate outputs from your RAG system.
Advantages:
- โ No labeled data required
- โ Handles nuanced quality dimensions
- โ Can evaluate open-ended responses
- โ Fast iteration
Disadvantages:
- โ ๏ธ Expensive (API costs)
- โ ๏ธ Potential bias
- โ ๏ธ Not deterministic
- โ ๏ธ Judge quality varies
Example implementation:
def llm_judge_faithfulness(answer, context):
prompt = f"""You are an expert evaluator. Assess if the answer is
fully supported by the context. Return a score from 0 to 1.
Context: {context}
Answer: {answer}
Evaluation criteria:
- 1.0: Every claim in the answer is directly supported
- 0.5: Some claims supported, some not verifiable
- 0.0: Answer contains unsupported or contradicting claims
Return only a number between 0 and 1.
Score:"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0 # Deterministic
)
return float(response.choices[0].message.content.strip())
๐ก Best Practice: Use LLM-as-judge for development, but collect human labels for critical production metrics.
Human Evaluation Best Practices ๐ฅ
Human evaluation remains the gold standard. Structured approach:
1. Define clear rubrics:
| Score | Correctness | Completeness | Clarity |
|---|---|---|---|
| 5 | Completely accurate | All info included | Perfect clarity |
| 4 | Mostly accurate, minor errors | Key info present | Clear with minor issues |
| 3 | Partially correct | Missing some info | Understandable but unclear |
| 2 | Mostly incorrect | Missing most info | Confusing |
| 1 | Completely wrong | No relevant info | Incomprehensible |
2. Sample strategically:
## Don't evaluate everything - sample intelligently
samples = [
random_sample(n=50), # Random baseline
high_confidence_sample(n=25), # Where model was confident
low_confidence_sample(n=25), # Where model struggled
edge_cases(n=25), # Known difficult queries
recent_queries(n=25) # Latest production traffic
]
3. Measure inter-annotator agreement:
from sklearn.metrics import cohen_kappa_score
## Two annotators rate the same 100 examples
rater1_scores = [4, 5, 3, ...]
rater2_scores = [4, 4, 3, ...]
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
print(f"Inter-rater agreement: {kappa:.2f}")
## 0.8+ is good, below 0.6 means unclear guidelines
Production Monitoring Metrics ๐
Evaluation doesn't stop at deployment. Monitor continuously:
System Health Metrics
## Track these in your production dashboard
metrics_to_monitor = {
'latency_p50': 'median response time',
'latency_p95': '95th percentile response time',
'latency_p99': '99th percentile response time',
'retrieval_time': 'time to fetch documents',
'generation_time': 'LLM inference time',
'error_rate': 'failed requests / total requests',
'timeout_rate': 'timed out requests / total',
'cache_hit_rate': 'cached responses / total'
}
Example monitoring setup:
import time
from prometheus_client import Histogram, Counter
## Define metrics
response_time = Histogram('rag_response_seconds', 'RAG response time')
error_counter = Counter('rag_errors_total', 'Total RAG errors')
def rag_pipeline(query):
start = time.time()
try:
# Retrieval
docs = retriever.search(query)
# Generation
answer = generator.generate(query, docs)
response_time.observe(time.time() - start)
return answer
except Exception as e:
error_counter.inc()
raise e
Quality Drift Detection
Concept: Automated detection when quality degrades over time.
class QualityDriftDetector:
def __init__(self, baseline_metrics, threshold=0.1):
self.baseline = baseline_metrics
self.threshold = threshold
def detect_drift(self, current_metrics):
alerts = []
for metric_name, baseline_value in self.baseline.items():
current_value = current_metrics.get(metric_name)
if current_value is None:
continue
# Check for significant drop
drop = baseline_value - current_value
if drop > self.threshold:
alerts.append({
'metric': metric_name,
'baseline': baseline_value,
'current': current_value,
'drop': drop
})
return alerts
## Usage
detector = QualityDriftDetector({
'faithfulness': 0.92,
'answer_relevance': 0.88,
'context_precision': 0.85
})
current = {
'faithfulness': 0.78, # Dropped!
'answer_relevance': 0.87,
'context_precision': 0.84
}
alerts = detector.detect_drift(current)
if alerts:
send_alert(f"Quality drift detected: {alerts}")
Detailed Examples ๐ก
Example 1: Building a Complete Evaluation Pipeline
Scenario: You're deploying a RAG system for customer support. You need end-to-end evaluation.
Step 1: Create evaluation dataset
import pandas as pd
## Collect diverse test cases
eval_data = pd.DataFrame([
{
'query': 'How do I reset my password?',
'ground_truth': 'Click forgot password, enter email, follow link',
'category': 'account',
'difficulty': 'easy'
},
{
'query': 'Why was I charged twice for my subscription?',
'ground_truth': 'Contact billing team with transaction IDs for refund',
'category': 'billing',
'difficulty': 'medium'
},
{
'query': 'Can I export my data in GDPR-compliant format?',
'ground_truth': 'Yes, go to Settings > Privacy > Export Data',
'category': 'privacy',
'difficulty': 'hard'
}
])
print(f"Evaluation set size: {len(eval_data)}")
Step 2: Run RAG system and collect outputs
def run_evaluation(eval_data, rag_system):
results = []
for idx, row in eval_data.iterrows():
query = row['query']
# Run RAG pipeline
retrieved_docs = rag_system.retrieve(query)
generated_answer = rag_system.generate(query, retrieved_docs)
results.append({
'query': query,
'retrieved_docs': retrieved_docs,
'generated_answer': generated_answer,
'ground_truth': row['ground_truth'],
'category': row['category']
})
return pd.DataFrame(results)
results_df = run_evaluation(eval_data, my_rag_system)
Step 3: Calculate metrics
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def calculate_all_metrics(results_df):
metrics = []
for idx, row in results_df.iterrows():
# Faithfulness: check if answer supported by docs
faithfulness = check_faithfulness(
row['generated_answer'],
row['retrieved_docs']
)
# Answer correctness: semantic similarity with ground truth
gt_emb = model.encode(row['ground_truth'])
ans_emb = model.encode(row['generated_answer'])
correctness = util.cos_sim(gt_emb, ans_emb).item()
# Context precision: are retrieved docs relevant?
context_precision = evaluate_context_precision(
row['retrieved_docs'],
row['ground_truth']
)
metrics.append({
'query': row['query'],
'faithfulness': faithfulness,
'correctness': correctness,
'context_precision': context_precision,
'category': row['category']
})
return pd.DataFrame(metrics)
metrics_df = calculate_all_metrics(results_df)
## Aggregate results
print("\nOverall Metrics:")
print(metrics_df[['faithfulness', 'correctness', 'context_precision']].mean())
print("\nMetrics by Category:")
print(metrics_df.groupby('category').mean())
Expected output:
Overall Metrics:
faithfulness 0.87
correctness 0.82
context_precision 0.79
Metrics by Category:
faithfulness correctness context_precision
category
account 0.95 0.92 0.88
billing 0.82 0.78 0.75
privacy 0.84 0.76 0.74
Insight: Account queries perform best, privacy queries need improvement.
Example 2: Debugging Low Faithfulness Scores
Problem: Your RAG system has faithfulness score of 0.65 (below target of 0.85).
Investigation approach:
## Step 1: Find examples with low faithfulness
low_faithfulness = metrics_df[metrics_df['faithfulness'] < 0.7]
for idx, row in low_faithfulness.iterrows():
print(f"\n{'='*60}")
print(f"Query: {row['query']}")
print(f"Answer: {results_df.loc[idx, 'generated_answer']}")
print(f"\nRetrieved docs:")
for doc in results_df.loc[idx, 'retrieved_docs']:
print(f" - {doc[:100]}...")
print(f"\nFaithfulness: {row['faithfulness']:.2f}")
Common root causes:
| Symptom | Root Cause | Solution |
|---|---|---|
| Answer contains facts not in docs | Model hallucinating | Add stronger grounding prompt: "Only use information from context" |
| Relevant docs not retrieved | Retrieval failure | Improve embeddings, adjust chunk size, add metadata filters |
| Answer combines multiple docs incorrectly | Context confusion | Add doc source attribution, reduce context length |
Solution implementation:
## Improved prompt with grounding instructions
SYSTEM_PROMPT = """You are a helpful assistant. Answer the question using
ONLY the information provided in the context below. If the context doesn't
contain enough information to answer fully, say "I don't have enough
information to answer that completely."
Context:
{context}
Question: {question}
Answer:"""
## Before fix: Faithfulness 0.65
## After fix: Faithfulness 0.86 โ
Example 3: A/B Testing RAG Improvements
Scenario: You want to test whether increasing chunk size from 256 to 512 tokens improves answer quality.
import numpy as np
from scipy import stats
## Variant A: 256 token chunks
variant_a_scores = [0.78, 0.82, 0.75, 0.80, 0.79, 0.81, 0.77, 0.83]
## Variant B: 512 token chunks
variant_b_scores = [0.85, 0.88, 0.84, 0.87, 0.86, 0.89, 0.85, 0.90]
## Statistical test
t_stat, p_value = stats.ttest_ind(variant_a_scores, variant_b_scores)
print(f"Variant A mean: {np.mean(variant_a_scores):.3f}")
print(f"Variant B mean: {np.mean(variant_b_scores):.3f}")
print(f"Improvement: {(np.mean(variant_b_scores) - np.mean(variant_a_scores)) * 100:.1f}%")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("โ
Improvement is statistically significant")
else:
print("โ ๏ธ Not enough evidence of improvement")
Output:
Variant A mean: 0.794
Variant B mean: 0.868
Improvement: 7.4%
P-value: 0.0023
โ
Improvement is statistically significant
Sample size calculation:
from statsmodels.stats.power import ttest_power
## How many samples needed to detect 5% improvement?
effect_size = 0.05 / np.std(variant_a_scores)
required_n = ttest_power(
effect_size=effect_size,
alpha=0.05, # Significance level
power=0.8, # 80% power
alternative='larger'
)
print(f"Required sample size per variant: {int(np.ceil(required_n))}")
Example 4: Continuous Evaluation in Production
Setup real-time monitoring:
import logging
from datetime import datetime
import json
class RAGMonitor:
def __init__(self, log_file='rag_metrics.jsonl'):
self.log_file = log_file
self.logger = logging.getLogger('rag_monitor')
def log_interaction(self, query, answer, retrieved_docs,
latency, user_feedback=None):
# Calculate metrics
faithfulness = self.calculate_faithfulness(answer, retrieved_docs)
record = {
'timestamp': datetime.now().isoformat(),
'query': query,
'answer': answer,
'num_docs_retrieved': len(retrieved_docs),
'latency_ms': latency * 1000,
'faithfulness': faithfulness,
'user_feedback': user_feedback # thumbs up/down if available
}
# Write to log file
with open(self.log_file, 'a') as f:
f.write(json.dumps(record) + '\n')
return record
def generate_daily_report(self):
# Read logs from past 24 hours
logs = self.read_recent_logs(hours=24)
report = {
'total_queries': len(logs),
'avg_latency_ms': np.mean([l['latency_ms'] for l in logs]),
'p95_latency_ms': np.percentile([l['latency_ms'] for l in logs], 95),
'avg_faithfulness': np.mean([l['faithfulness'] for l in logs]),
'positive_feedback_rate': self.calculate_feedback_rate(logs)
}
return report
## Usage in production
monitor = RAGMonitor()
def handle_user_query(query):
start = time.time()
docs = retriever.retrieve(query)
answer = generator.generate(query, docs)
latency = time.time() - start
# Log everything
monitor.log_interaction(
query=query,
answer=answer,
retrieved_docs=docs,
latency=latency
)
return answer
Daily report example:
report = monitor.generate_daily_report()
print(json.dumps(report, indent=2))
Output:
{
"total_queries": 1247,
"avg_latency_ms": 340,
"p95_latency_ms": 580,
"avg_faithfulness": 0.87,
"positive_feedback_rate": 0.82
}
Common Mistakes to Avoid โ ๏ธ
1. Evaluating Only on Easy Questions
โ Wrong approach:
eval_queries = [
"What is Python?",
"Who created Python?",
"When was Python created?"
]
โ Right approach:
eval_queries = [
# Simple factoid
"What is Python?",
# Multi-hop reasoning
"Compare Python's memory management to Java's",
# Ambiguous query
"Python performance issues",
# Requires synthesis
"Explain best practices for Python async programming",
# Edge case
"Python 2 vs Python 3 unicode handling differences"
]
Why it matters: Production queries are diverse and challenging. Easy-only evaluation gives false confidence.
2. Ignoring Retrieval Metrics
โ Wrong mindset: "My LLM is powerful, retrieval quality doesn't matter much."
โ Right approach: Track retrieval metrics separately:
## Always measure both!
retrieval_metrics = {
'recall@5': 0.78,
'precision@5': 0.65,
'mrr': 0.82
}
generation_metrics = {
'faithfulness': 0.88,
'relevance': 0.85
}
Reality check: Even GPT-4 can't fix bad retrieval. If relevant docs aren't retrieved, the answer will be wrong.
3. Not Testing Edge Cases
โ Missing test cases:
- Queries with no relevant documents in corpus
- Queries requiring information from multiple documents
- Queries with contradictory information in different docs
- Very long/complex queries
- Queries with typos or informal language
โ Comprehensive test suite:
edge_cases = {
'no_relevant_docs': [
"What is the meaning of life?" # Philosophical, not in docs
],
'multi_doc_synthesis': [
"Summarize all product features across documentation"
],
'contradictions': [
"What is the refund policy?" # If policy changed, old docs conflict
],
'long_context': [
"Explain the complete history of our company..." # 500 word query
],
'informal_language': [
"how do i cancel my sub lol" # Casual, typos
]
}
4. Using Only Automatic Metrics
โ Pure automation:
## Relying solely on computed metrics
if faithfulness > 0.8 and relevance > 0.8:
print("System is good!")
โ Hybrid approach:
## Combine automatic + human evaluation
auto_metrics = calculate_automatic_metrics(results)
## Sample for human review
high_risk = results[
(auto_metrics['faithfulness'] < 0.7) |
(auto_metrics['user_feedback'] == 'negative')
]
human_review_sample = high_risk.sample(n=50)
send_for_human_evaluation(human_review_sample)
Why: Automatic metrics miss nuanced issues. Human evaluation catches quality problems machines miss.
5. Ignoring Latency in Evaluation
โ Quality-only focus:
## Only tracking accuracy
metrics = {'accuracy': 0.92}
โ Quality + Performance:
metrics = {
'accuracy': 0.92,
'latency_p50': 250, # ms
'latency_p95': 450,
'timeout_rate': 0.02
}
## Set SLAs
if metrics['latency_p95'] > 500:
alert("Latency SLA violation!")
Reality: A system with 95% accuracy but 10-second latency is unusable.
6. Not Tracking Metrics Over Time
โ One-time evaluation:
## Evaluate once at launch
initial_metrics = evaluate(test_set)
print(f"Accuracy: {initial_metrics['accuracy']}")
## Never check again
โ Continuous monitoring:
import matplotlib.pyplot as plt
## Track weekly
weekly_metrics = [
{'week': 1, 'accuracy': 0.92, 'faithfulness': 0.88},
{'week': 2, 'accuracy': 0.91, 'faithfulness': 0.87},
{'week': 3, 'accuracy': 0.87, 'faithfulness': 0.82}, # Degrading!
]
plt.plot([m['week'] for m in weekly_metrics],
[m['accuracy'] for m in weekly_metrics])
plt.title('RAG System Quality Over Time')
plt.xlabel('Week')
plt.ylabel('Accuracy')
plt.show()
Why: Systems degrade over time due to data drift, model updates, or corpus changes.
Key Takeaways ๐
๐ Quick Reference Card: RAG Evaluation Essentials
Retrieval Metrics:
| Precision@K | % of retrieved docs that are relevant |
| Recall@K | % of relevant docs that were retrieved |
| MRR | 1 / rank of first relevant document |
| NDCG | Accounts for graded relevance + position |
Generation Metrics:
| Faithfulness | Are claims supported by retrieved docs? |
| Relevance | Does answer address the question? |
| Coherence | Is answer logically structured? |
End-to-End Metrics:
| Answer Correctness | Semantic similarity to ground truth |
| Context Precision | Are retrieved docs useful for answer? |
| Context Recall | Did we retrieve all needed information? |
Production Monitoring:
| Latency (p50, p95, p99) | Response time distribution |
| Error Rate | % of failed requests |
| User Feedback | Thumbs up/down, follow-up queries |
| Quality Drift | Metric degradation over time |
Remember:
- ๐ฏ Evaluate at multiple levels: Retrieval, generation, and end-to-end
- ๐ Combine approaches: Automatic metrics + LLM-as-judge + human evaluation
- ๐ Track over time: Quality can degrade; monitor continuously
- โก Balance quality and performance: Fast wrong answers are still wrong
- ๐งช Test edge cases: Real production queries are messy and diverse
- ๐ Use statistics: A/B test improvements with proper sample sizes
- ๐จ Set up alerts: Catch quality drops before users complain
๐ Further Study
- RAGAS Framework Documentation - https://docs.ragas.io/ - Comprehensive guide to RAG evaluation metrics with Python implementations
- Pinecone RAG Evaluation Guide - https://www.pinecone.io/learn/rag-evaluation/ - Practical tutorial on evaluating RAG systems in production
- LangChain Evaluation Module - https://python.langchain.com/docs/guides/evaluation - Tools and frameworks for evaluating LLM applications including RAG systems
Master RAG evaluation with these comprehensive metrics and techniques. Remember: you can't improve what you don't measure! ๐ฏ