RAG Architecture & Implementation
Build Retrieval-Augmented Generation systems that ground LLM outputs in retrieved facts to eliminate hallucinations.
RAG Architecture & Implementation
Master Retrieval-Augmented Generation (RAG) architecture with free flashcards and hands-on implementation guidance. This lesson covers RAG system design, vector databases, embedding strategies, retrieval mechanisms, and prompt engineeringโessential skills for building modern AI search applications that combine large language models with external knowledge bases.
Welcome ๐
Retrieval-Augmented Generation represents a paradigm shift in how we build AI applications. Rather than relying solely on a language model's parametric knowledge (what it learned during training), RAG systems dynamically retrieve relevant information from external sources and incorporate it into the generation process. This approach dramatically reduces hallucinations, enables access to up-to-date information, and allows AI systems to work with proprietary or domain-specific data.
In this comprehensive lesson, you'll learn how to architect and implement production-grade RAG systems from the ground up. We'll explore the complete pipelineโfrom document ingestion and chunking strategies to vector search optimization and context-aware generation. Whether you're building a chatbot for customer support, an internal knowledge assistant, or a research tool, understanding RAG architecture is crucial for creating reliable, accurate AI applications in 2026.
Core Concepts ๐ง
What is RAG?
Retrieval-Augmented Generation (RAG) is an architectural pattern that enhances large language models by retrieving relevant information from external knowledge sources before generating a response. Think of it as giving your AI a reference library it can consult before answering questions.
TRADITIONAL LLM vs RAG โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โ Traditional LLM โ โ RAG System โ โโโโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโโโค โ โ โ โ โ User Query โ โ User Query โ โ โ โ โ โ โ โ LLM Only โ โ Retrieval Step โ โ โ โ โ โ โ โ Response โ โ Relevant Docs โ โ โ โ โ โ โ โ ๏ธ Limited to โ โ LLM + Context โ โ training data โ โ โ โ โ โ โ Response โ โ โ โ โ โ โ โ โ Up-to-date โ โ โ โ โ Grounded โ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
The RAG Pipeline Architecture
A complete RAG system consists of two main phases: indexing (offline) and retrieval-generation (online).
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAG ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ INDEXING PHASE (Offline)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ ๐ Documents โ ๐ช Chunking โ ๐งฎ Embedding โ โ
โ ๐พ Vector Store โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (creates searchable index)
๐ RETRIEVAL-GENERATION PHASE (Online)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โ User Query โ ๐งฎ Query Embedding โ โ
โ ๐ Vector Search โ ๐ Retrieved Chunks โ โ
โ ๐ค LLM (query + context) โ โ
Response โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Document Chunking Strategies
Chunking is the process of breaking documents into smaller, semantically meaningful pieces. This is critical because:
- Embedding models have token limits (typically 512-8192 tokens)
- Retrieval precision improves with focused chunks
- Context windows are limited in LLMs
| Strategy | Method | Best For | Considerations |
|---|---|---|---|
| Fixed-Size | Split every N tokens/characters | Uniform content, technical docs | May break semantic units |
| Sentence-Based | Split on sentence boundaries | Narrative text, articles | Preserves meaning but varies in size |
| Paragraph-Based | Split on paragraph breaks | Well-structured documents | Chunks may be too large |
| Semantic | Use embeddings to find natural breaks | Complex documents, mixed content | Computationally expensive |
| Recursive | Try multiple delimiters hierarchically | Code, structured data | Most flexible, widely used |
๐ก Pro Tip: Use overlapping chunks (50-200 token overlap) to prevent information loss at boundaries. If a key concept is split across chunks, the overlap ensures it appears complete in at least one chunk.
Embeddings and Vector Representations
Embeddings transform text into high-dimensional vectors (typically 384-1536 dimensions) where semantically similar text appears closer together in vector space.
EMBEDDING TRANSFORMATION
"cat" โ [0.2, 0.8, 0.1, ...] โ
โโ Close in vector space
"kitten" โ [0.3, 0.7, 0.2, ...] โ (similar meaning)
"dog" โ [0.4, 0.6, 0.3, ...] โ Moderately close
"car" โ [0.9, 0.1, 0.8, ...] โ Far apart
(different meaning)
Popular embedding models in 2026:
| Model | Dimensions | Max Tokens | Best Use Case |
|---|---|---|---|
| text-embedding-3-small | 1536 | 8191 | Cost-effective, general purpose |
| text-embedding-3-large | 3072 | 8191 | Highest accuracy, complex queries |
| voyage-2 | 1024 | 4000 | Domain-specific retrieval |
| cohere-embed-v3 | 1024 | 512 | Multilingual support |
| BGE-large-en-v1.5 | 1024 | 512 | Open-source, self-hosted |
Vector Databases
Vector databases store embeddings and enable fast similarity search. Unlike traditional databases that search for exact matches, vector DBs find semantically similar content using distance metrics.
VECTOR SIMILARITY SEARCH
Query Vector
โญ
โฑโโฒ
โฑ โ โฒ
โฑ โ โฒ
๐ต ๐ต ๐ต โ Top-k nearest neighbors
โฑ โ โฒ (most relevant chunks)
๐ต ๐ต ๐ต
โฑ โ โฒ
๐ต ๐ต ๐ต
Distance metrics:
โข Cosine Similarity (most common)
โข Euclidean Distance
โข Dot Product
Leading vector database options:
| Database | Type | Strengths | Ideal For |
|---|---|---|---|
| Pinecone | Managed cloud | Fully managed, scalable, easy setup | Production apps, startups |
| Weaviate | Open-source/cloud | GraphQL API, hybrid search, modules | Complex data relationships |
| Qdrant | Open-source | Rust-based, fast, filtering support | Self-hosted, performance-critical |
| Chroma | Open-source | Simple API, Python-first, lightweight | Development, prototyping |
| pgvector | PostgreSQL extension | Integrates with existing PostgreSQL | Projects already using Postgres |
| Milvus | Open-source | Highly scalable, distributed | Large-scale enterprise apps |
Retrieval Strategies
Beyond basic vector similarity search, advanced RAG systems employ sophisticated retrieval techniques:
1. Hybrid Search
Combines dense retrieval (vector similarity) with sparse retrieval (keyword/BM25) for better precision:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ HYBRID SEARCH โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ Query: "How to optimize React?" โ โ โ โ โ โโโโโโดโโโโโ โ โ โ โ โ โ Vector Keyword โ โ Search Search (BM25) โ โ โ โ โ โ โ โ โ โ Results A Results B โ โ โ โ โ โ โโโโโโฌโโโโโ โ โ โ โ โ Reciprocal Rank Fusion โ โ โ โ โ Final Ranked Results โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2. Contextual Compression
Retrieve large chunks but extract only relevant portions before sending to the LLM:
## Pseudo-code
initial_docs = vector_store.similarity_search(query, k=10)
compressed_docs = compressor.compress(initial_docs, query)
response = llm.generate(query, context=compressed_docs)
3. Multi-Query Retrieval
Generate multiple variations of the user's query to capture different phrasings:
- Original: "How do I improve my code?"
- Variant 1: "What are code optimization techniques?"
- Variant 2: "Best practices for clean code"
- Variant 3: "How to refactor legacy code"
4. Parent-Child Chunking
Store small chunks for retrieval but include surrounding context when found:
PARENT-CHILD STRATEGY
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Parent Document โ โ Stored separately
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ [Chunk 1] [Chunk 2] [Chunk 3] โ
โ โ โ
โ Retrieved chunk โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Return entire parent document or
surrounding N chunks for full context
Prompt Engineering for RAG
The prompt template determines how retrieved context is presented to the LLM. A well-designed prompt includes:
- System instructions (role, behavior guidelines)
- Retrieved context (formatted clearly)
- User query (the actual question)
- Output format (structured response expectations)
Example RAG prompt template:
You are a helpful assistant answering questions based on provided context.
RULES:
- Answer only using information from the context below
- If the answer isn't in the context, say "I don't have enough information"
- Cite specific parts of the context in your answer
- Be concise and accurate
CONTEXT:
{retrieved_chunks}
QUESTION:
{user_query}
ANSWER:
๐ก Pro Tip: Include attribution markers in your chunks (e.g., source document, page number) so the LLM can cite sources in its response.
Metadata Filtering
Metadata enhances retrieval by allowing pre-filtering before vector search:
| Metadata Type | Example | Use Case |
|---|---|---|
| Temporal | date, timestamp | "Show results from last 6 months" |
| Source | document_id, url, author | "Search only in technical docs" |
| Categorical | department, topic, language | "Filter by HR department" |
| User-specific | user_id, permissions | "Show only documents I can access" |
| Content-based | doc_type, file_format | "Search only PDF files" |
Filtered vector search flow:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Query: "2024 sales strategy" โ
โ โ
โ Filters: โ
โ โข year = 2024 โ
โ โข department = "Sales" โ
โ โข type = "Strategy Doc" โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
Filter metadata FIRST
(reduces search space)
โ
Vector similarity search
(on filtered subset)
โ
Top-k relevant results
Evaluation Metrics
Measuring RAG system performance requires evaluating both retrieval quality and generation quality:
Retrieval Metrics:
- Precision@k: What fraction of top-k results are relevant?
- Recall@k: What fraction of all relevant docs are in top-k?
- MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant result
- NDCG (Normalized Discounted Cumulative Gain): Considers ranking quality
Generation Metrics:
- Faithfulness: Does the answer align with retrieved context?
- Answer Relevancy: Does it address the user's question?
- Context Precision: Are retrieved chunks actually relevant?
- Context Recall: Was all necessary information retrieved?
๐ง Try this: Use frameworks like RAGAS (RAG Assessment) or TruLens to automatically evaluate your system:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(
dataset=test_questions,
metrics=[faithfulness, answer_relevancy]
)
print(result.scores)
Real-World Implementation Examples ๐ง
Example 1: Basic RAG System with LangChain
Let's build a minimal RAG system for a company knowledge base:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
## Step 1: Load documents
loader = DirectoryLoader('./docs', glob="**/*.txt")
documents = loader.load()
## Step 2: Chunk documents
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
## Step 3: Create embeddings and store in vector DB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
## Step 4: Create retrieval chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # All context in one prompt
retriever=vector_store.as_retriever(
search_kwargs={"k": 4} # Retrieve top 4 chunks
),
return_source_documents=True
)
## Step 5: Query the system
query = "What is our remote work policy?"
result = qa_chain({"query": query})
print(f"Answer: {result['result']}")
print(f"\nSources: {result['source_documents']}")
What's happening:
- Document loading: Reads all
.txtfiles from a directory - Chunking: Splits into 1000-character chunks with 200-char overlap
- Embedding: Converts chunks to vectors using OpenAI's model
- Storage: Persists vectors in Chroma (local SQLite-based vector DB)
- Retrieval: Finds 4 most similar chunks to the query
- Generation: LLM generates answer using retrieved context
Example 2: Advanced RAG with Metadata Filtering
Adding temporal and source filtering for a news search system:
from datetime import datetime
import chromadb
from chromadb.config import Settings
## Initialize Chroma with metadata support
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="./news_db"
))
collection = client.get_or_create_collection(
name="news_articles",
metadata={"description": "News articles with temporal metadata"}
)
## Add documents with rich metadata
articles = [
{
"text": "New AI regulations announced in EU...",
"metadata": {
"source": "TechNews",
"date": "2024-03-15",
"category": "regulation",
"author": "Jane Smith"
}
},
# ... more articles
]
for idx, article in enumerate(articles):
embedding = embedding_model.encode(article["text"])
collection.add(
embeddings=[embedding.tolist()],
documents=[article["text"]],
metadatas=[article["metadata"]],
ids=[f"article_{idx}"]
)
## Query with metadata filtering
query = "AI regulations"
query_embedding = embedding_model.encode(query)
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=5,
where={
"$and": [
{"date": {"$gte": "2024-01-01"}}, # After Jan 1, 2024
{"category": "regulation"}, # Only regulation news
{"source": {"$in": ["TechNews", "AIDaily"]}} # Trusted sources
]
}
)
print(f"Found {len(results['documents'][0])} relevant articles")
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
print(f"\n{meta['date']} - {meta['source']}")
print(doc[:200] + "...")
Key improvements:
- Structured metadata: Each chunk has date, source, category, author
- Complex filtering: Combine multiple conditions with
$and,$or - Pre-filtering: Reduces search space before vector comparison
- Provenance tracking: Users see where information comes from
Example 3: Hybrid Search with Reciprocal Rank Fusion
Combining dense and sparse retrieval for optimal results:
import weaviate
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRAG:
def __init__(self, documents):
self.documents = documents
# BM25 for keyword search
tokenized_docs = [doc.split() for doc in documents]
self.bm25 = BM25Okapi(tokenized_docs)
# Weaviate for vector search
self.client = weaviate.Client("http://localhost:8080")
self.setup_weaviate_schema()
self.index_documents()
def reciprocal_rank_fusion(self, rankings_list, k=60):
"""Combine multiple rankings using RRF"""
fused_scores = {}
for rankings in rankings_list:
for rank, doc_id in enumerate(rankings, start=1):
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += 1 / (k + rank)
# Sort by fused score
sorted_docs = sorted(
fused_scores.items(),
key=lambda x: x[1],
reverse=True
)
return [doc_id for doc_id, score in sorted_docs]
def search(self, query, top_k=5):
# Sparse retrieval (BM25)
tokenized_query = query.split()
bm25_scores = self.bm25.get_scores(tokenized_query)
bm25_rankings = np.argsort(bm25_scores)[::-1][:20]
# Dense retrieval (Vector)
vector_results = self.client.query.get(
"Document",
["content", "doc_id"]
).with_near_text({
"concepts": [query]
}).with_limit(20).do()
vector_rankings = [
int(doc["doc_id"])
for doc in vector_results["data"]["Get"]["Document"]
]
# Fuse results
final_rankings = self.reciprocal_rank_fusion(
[bm25_rankings.tolist(), vector_rankings]
)
return [
self.documents[doc_id]
for doc_id in final_rankings[:top_k]
]
## Usage
rag = HybridRAG(my_documents)
results = rag.search("machine learning best practices")
Why this works better:
- BM25 catches exact term matches and rare keywords
- Vector search captures semantic similarity and synonyms
- RRF gives balanced weight to both approaches
- Result: Higher precision and recall than either method alone
Example 4: Production RAG with Monitoring
A production-ready system with observability:
from trulens_eval import TruChain, Feedback, Tru
from trulens_eval.feedback import Groundedness
import openai
import pinecone
class ProductionRAG:
def __init__(self):
# Initialize components
self.embeddings = OpenAIEmbeddings()
pinecone.init(
api_key="your-key",
environment="us-west1-gcp"
)
self.vector_index = pinecone.Index("prod-knowledge-base")
# Set up monitoring
self.tru = Tru()
self.groundedness = Groundedness(
groundedness_provider=openai
)
def retrieve(self, query, filters=None, k=5):
"""Retrieve with optional metadata filtering"""
query_vector = self.embeddings.embed_query(query)
results = self.vector_index.query(
vector=query_vector,
top_k=k,
include_metadata=True,
filter=filters
)
return [
{
"text": match["metadata"]["text"],
"source": match["metadata"]["source"],
"score": match["score"]
}
for match in results["matches"]
]
def generate(self, query, context_chunks):
"""Generate with monitoring"""
context = "\n\n".join([
f"[Source: {chunk['source']}]\n{chunk['text']}"
for chunk in context_chunks
])
prompt = f"""Answer based only on the context below.
Context:
{context}
Question: {query}
Answer:"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.1
)
answer = response.choices[0].message.content
# Log for monitoring
self.log_interaction(query, context_chunks, answer)
return answer
def log_interaction(self, query, context, answer):
"""Log to monitoring system"""
# Check groundedness (does answer match context?)
groundedness_score = self.groundedness.score(
source=context,
statement=answer
)
# Log to TruLens for dashboard visualization
self.tru.log({
"query": query,
"context_count": len(context),
"answer_length": len(answer),
"groundedness": groundedness_score,
"timestamp": datetime.now().isoformat()
})
# Alert if quality metrics drop
if groundedness_score < 0.7:
self.send_alert(
f"Low groundedness detected: {groundedness_score}"
)
## Deployment
rag = ProductionRAG()
## Serve via API
from fastapi import FastAPI
app = FastAPI()
@app.post("/query")
async def query_rag(query: str, filters: dict = None):
chunks = rag.retrieve(query, filters)
answer = rag.generate(query, chunks)
return {
"answer": answer,
"sources": [c["source"] for c in chunks]
}
Production features:
- โ Managed vector database (Pinecone) for reliability
- โ Monitoring and observability (TruLens dashboard)
- โ Quality metrics (groundedness, relevance)
- โ Alerting for degraded performance
- โ API interface for integration
- โ Source attribution in responses
Common Mistakes โ ๏ธ
1. Chunks Too Large or Too Small
โ Mistake: Using 5000-token chunks that exceed context windows, or 50-token chunks that lack context.
โ Solution: Aim for 500-1000 tokens per chunk with 10-20% overlap. Test different sizes with your specific content.
2. Ignoring Metadata
โ Mistake: Storing only text without source, date, or category information.
โ
Solution: Always include metadata for filtering, attribution, and debugging. Store at minimum: source, timestamp, document_id.
3. Not Evaluating Retrieval Quality
โ Mistake: Assuming vector search always returns relevant results.
โ Solution: Create a test set of query-document pairs. Measure precision@k and recall@k regularly. A/B test different retrieval strategies.
4. Single Retrieval Strategy
โ Mistake: Relying only on vector similarity without considering keywords.
โ Solution: Use hybrid search combining dense (vector) and sparse (BM25) retrieval, especially for domain-specific terminology.
5. Poor Prompt Engineering
โ Mistake: Passing raw retrieved chunks without clear instructions.
โ Solution: Structure prompts with:
- Clear role definition
- Explicit rules ("only use context," "cite sources")
- Formatted context sections
- Output format specifications
6. No Context Compression
โ Mistake: Sending all retrieved chunks verbatim, wasting context window.
โ Solution: Use contextual compression to extract only relevant sentences from each chunk, or implement re-ranking to prioritize best chunks.
7. Ignoring Token Costs
โ Mistake: Retrieving 10,000 tokens of context for every query.
โ Solution: Balance retrieval depth with cost. Start with k=3-5 chunks. Use semantic caching for repeated queries.
8. Static Chunking Without Document Structure
โ Mistake: Splitting a structured document (with headers, tables) using fixed character counts.
โ Solution: Use document-aware chunking that respects structure. For code, split by functions. For articles, by sections.
Key Takeaways ๐ฏ
RAG enhances LLMs by retrieving relevant external information before generation, reducing hallucinations and enabling access to current data.
The pipeline has two phases: offline indexing (chunk โ embed โ store) and online retrieval-generation (query โ search โ generate).
Chunking strategy matters: Balance semantic completeness with token limits. Use overlap to prevent information loss at boundaries.
Embeddings capture semantics: Similar concepts cluster in vector space. Choose embedding models based on your language, domain, and performance needs.
Vector databases enable fast similarity search: Pinecone, Weaviate, Qdrant, and others provide scalable infrastructure for production RAG.
Advanced retrieval improves results: Hybrid search, multi-query, parent-child chunking, and contextual compression all boost accuracy.
Metadata enables filtering: Add temporal, source, and categorical metadata to narrow search before vector comparison.
Evaluation is essential: Measure both retrieval quality (precision, recall) and generation quality (faithfulness, relevance).
Prompt engineering controls behavior: Structure prompts clearly with role, rules, context, and output format sections.
Production requires monitoring: Track groundedness, latency, token usage, and user satisfaction to maintain quality over time.
๐ Quick Reference Card: RAG Architecture
| Component | Purpose | Key Choices |
| Chunking | Break docs into searchable pieces | 500-1000 tokens, 10-20% overlap |
| Embeddings | Convert text to vectors | text-embedding-3-small, voyage-2, BGE |
| Vector DB | Store and search vectors | Pinecone (managed), Qdrant (self-hosted) |
| Retrieval | Find relevant chunks | Hybrid search, k=3-5, metadata filters |
| Generation | Create answer from context | Structured prompt, temp=0-0.3 |
| Evaluation | Measure quality | Precision@k, faithfulness, RAGAS |
Common Pattern:
Documents โ Chunk (1000 tok) โ Embed (384-1536 dim) โ
Store (Vector DB) โ Query โ Retrieve (top-5) โ
Prompt (context + query) โ LLM โ Answer
Cost Optimization:
- Cache embeddings and responses
- Use smaller embedding models for dev
- Compress context before generation
- Implement semantic caching
๐ Further Study
LangChain RAG Documentation - https://python.langchain.com/docs/use_cases/question_answering/ - Comprehensive tutorials and patterns for building RAG systems with LangChain framework.
Pinecone Learning Center - https://www.pinecone.io/learn/vector-database/ - Deep dives into vector database concepts, similarity search algorithms, and production deployment strategies.
RAG Papers Collection - https://github.com/microsoft/RAG-Survey - Microsoft's curated list of academic papers covering RAG architectures, evaluation methods, and advanced techniques.
๐ก Remember: RAG is not a single technique but an architectural pattern. Start simple with basic retrieval, measure performance, then add complexity (hybrid search, re-ranking, compression) only where needed. The best RAG system is one that reliably answers your users' questions with verifiable, accurate information.