Classic RAG Pipeline
Implement the standard retrieve-augment-generate workflow with single-query retrieval and context injection.
Classic RAG Pipeline
Master the fundamentals of Retrieval-Augmented Generation with free flashcards and spaced repetition practice. This lesson covers document ingestion, vector embeddings, similarity search, and context-aware generationβessential concepts for building modern AI search systems that combine the power of retrieval with generative AI.
π― Welcome to Classic RAG
Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications that need to access external knowledge. Unlike standalone language models that rely solely on training data, RAG systems dynamically retrieve relevant information and use it to generate more accurate, up-to-date responses.
Think of RAG as giving your AI a reference library. Instead of memorizing everything (which would be impossible for constantly changing information), the AI learns to look up relevant documents first, then generates answers based on what it finds. This approach solves critical problems like hallucinations, outdated information, and lack of domain-specific knowledge.
In this lesson, we'll dissect the classic RAG pipeline step-by-step, understanding each component and how they work together to create intelligent, knowledge-grounded AI systems.
ποΈ The Five Core Stages
The classic RAG pipeline consists of five interconnected stages:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLASSIC RAG PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Stage 1: Document Ingestion
β
β
βοΈ Stage 2: Chunking & Processing
β
β
π’ Stage 3: Embedding Generation
β
β
πΎ Stage 4: Vector Storage & Indexing
β
β (User Query Arrives)
β
π Stage 5: Retrieval & Generation
β
ββββ Query Embedding
β
ββββ Similarity Search
β
ββββ Context Retrieval
β
ββββ π€ LLM Generation
β
β
β
Final Answer
Let's explore each stage in detail.
π Stage 1: Document Ingestion
Document ingestion is the process of loading raw data into your RAG system. This stage handles diverse data formats and prepares them for downstream processing.
What Gets Ingested?
- Text documents: PDFs, Word files, plain text
- Web content: HTML pages, markdown files
- Structured data: JSON, CSV, database records
- Code repositories: Source files, documentation
- Multimedia metadata: Transcripts, captions, descriptions
Key Considerations
π‘ Tip: Always preserve metadata during ingestion (source URL, creation date, author, section headers). This metadata becomes crucial for filtering and citation later.
| Format | Parser Library | Key Challenge |
|---|---|---|
| PyPDF2, pdfplumber | Layout preservation | |
| HTML | BeautifulSoup, Trafilatura | Extracting main content |
| Word | python-docx | Style/format handling |
| Markdown | mistune, markdown-it | Code block parsing |
β οΈ Watch out: PDFs with scanned images require OCR (Optical Character Recognition) preprocessing. Without it, you'll extract nothing from image-based PDFs!
βοΈ Stage 2: Chunking & Processing
Chunking divides long documents into smaller, semantically coherent pieces. This is critical because:
- Embedding models have token limits (typically 512-8192 tokens)
- Retrieval precision improves with focused chunks
- Generation context windows need manageable inputs
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N characters/tokens | Simple, uniform content |
| Sentence-based | Split on sentence boundaries | Natural text flow |
| Paragraph-based | Split on paragraph breaks | Articles, essays |
| Semantic | Split when topic shifts | Long-form documents |
| Document structure | Split on headers, sections | Technical docs, manuals |
Chunk Overlap
Most effective chunking includes overlap between consecutive chunks:
Without Overlap:
βββββββββββββββββββββββββββββββββββββββββββββ
β Chunk 1 ββ Chunk 2 ββ Chunk 3 β
βββββββββββββββββββββββββββββββββββββββββββββ
β Information at boundary might be split β
With Overlap (Recommended):
βββββββββββββββ
β Chunk 1 β
βββββββββββ¬ββββ
ββββββββββββββββ
ββ Chunk 2 β
ββ΄ββββββ¬ββββββββ
ββββββββββββββββ
ββ Chunk 3 β
ββ΄ββββββββββββββ
β Context preserved across boundaries β
π‘ Tip: A typical configuration is 500-1000 token chunks with 100-200 token overlap (20-25% overlap ratio).
Text Cleaning
Before chunking, apply preprocessing:
- Remove excessive whitespace, special characters
- Normalize unicode characters
- Handle code blocks specially (preserve indentation)
- Extract and preserve tables in structured format
π’ Stage 3: Embedding Generation
Embeddings are numerical vector representations of text that capture semantic meaning. Similar concepts have similar vectors, enabling mathematical similarity comparisons.
How Embeddings Work
An embedding model (like OpenAI's text-embedding-ada-002, Cohere's embeddings, or open-source models like sentence-transformers) transforms text into a high-dimensional vector:
Text Input: "How do I reset my password?"
β
Embedding Model
β
Vector: [0.023, -0.891, 0.445, ..., 0.112]
β 1536 dimensions (example) β
Why Embeddings Matter
Embeddings enable semantic search rather than keyword matching:
| Search Type | Query | Matches |
|---|---|---|
| Keyword | "python programming" | Only exact phrase |
| Semantic | "python programming" | "coding in python", "Python tutorials", "Snake scripting language" |
Popular Embedding Models (2026)
| Model | Dimensions | Max Tokens | Best For |
|---|---|---|---|
| OpenAI ada-002 | 1536 | 8191 | General purpose |
| Cohere embed-v3 | 1024 | 512 | Multilingual |
| BGE-large-en | 1024 | 512 | Open-source, high quality |
| E5-mistral-7b | 4096 | 32768 | Long context |
π‘ Tip: Use the same embedding model for both document chunks and user queries! Mixing models breaks semantic similarity.
Batch Processing
For efficiency, embed chunks in batches:
## Process 100 chunks at a time
batch_size = 100
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
embeddings = embedding_model.embed(batch)
store_embeddings(embeddings)
πΎ Stage 4: Vector Storage & Indexing
Vector databases store embeddings and enable fast similarity search. Unlike traditional databases that query exact matches, vector databases find "nearby" vectors in high-dimensional space.
Vector Database Options
| Database | Type | Best For | Notable Feature |
|---|---|---|---|
| Pinecone | Managed | Production scale | Auto-scaling |
| Weaviate | Open-source | Flexible schemas | GraphQL API |
| Qdrant | Open-source | High performance | Rust-based speed |
| Chroma | Embedded | Development, prototyping | Zero config |
| FAISS | Library | Research, local use | Facebook AI |
Indexing Strategies
Vector databases use specialized index structures for fast search:
βββββββββββββββββββββββββββββββββββββββββββββββββ β VECTOR INDEX TYPES β βββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π FLAT (Exact) β β ββ Brute force comparison β β ββ 100% accurate β β ββ Slow for large datasets (>100K vectors) β β β β π³ HNSW (Approximate) β β ββ Hierarchical graph structure β β ββ Fast queries (milliseconds) β β ββ ~99% accuracy β β β β π¦ IVF (Approximate) β β ββ Clusters vectors into groups β β ββ Search only relevant clusters β β ββ Good balance of speed/accuracy β β β βββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Tip: HNSW (Hierarchical Navigable Small World) is the most popular index for RAG applicationsβit offers excellent speed-accuracy tradeoff.
What Gets Stored
Each vector database entry typically contains:
- Vector embedding (the numerical representation)
- Original text chunk (for context retrieval)
- Metadata (source, page number, timestamp, etc.)
- Unique ID (for updating/deleting)
π Stage 5: Retrieval & Generation
This is where the magic happensβcombining retrieval with generation to produce accurate, grounded responses.
Step 5A: Query Embedding
When a user asks a question, embed it using the same model used for documents:
User Query: "What are the benefits of exercise?"
β
Embedding Model (same as documents)
β
Query Vector: [0.156, -0.723, 0.891, ..., 0.034]
Step 5B: Similarity Search
The vector database finds the top-k most similar document chunks using distance metrics:
| Metric | Formula | Range | Interpretation |
|---|---|---|---|
| Cosine | cos(ΞΈ) = AΒ·B / (||A|| ||B||) | -1 to 1 | 1 = identical direction |
| Euclidean | βΞ£(ai - bi)Β² | 0 to β | 0 = identical points |
| Dot Product | Ξ£(ai Γ bi) | -β to β | Higher = more similar |
π‘ Tip: Cosine similarity is most common for text embeddings because it measures angle (semantic similarity) rather than magnitude.
Step 5C: Context Construction
Retrieved chunks are assembled into a context prompt:
--- Retrieved Context ---
[Chunk 1] Regular exercise improves cardiovascular health...
[Chunk 2] Physical activity reduces stress and anxiety...
[Chunk 3] Exercise strengthens bones and muscles...
--- User Question ---
What are the benefits of exercise?
--- Instructions ---
Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say so.
Step 5D: LLM Generation
The context + question is sent to a large language model (GPT-4, Claude, Llama, etc.) which generates a grounded response:
βββββββββββββββββββββββββββββββββββββββββββββββ β π₯ INPUT: Context + Question β βββββββββββββββββββββββββββββββββββββββββββββββ€ β β β β π€ LLM (GPT-4, Claude, etc.) β β β β βββββββββββββββββββββββββββββββββββββββββββββββ€ β π€ OUTPUT: Grounded Answer β β "Exercise offers multiple benefits: β β 1. Improves heart health β β 2. Reduces stress β β 3. Strengthens bones and muscles" β βββββββββββββββββββββββββββββββββββββββββββββββ
Retrieval Parameters
Top-k: How many chunks to retrieve
- Too few (k=1-2): Might miss relevant information
- Too many (k>10): Noise and cost increase
- Sweet spot: k=3-5 for most applications
Similarity threshold: Minimum score to include
- Filters out irrelevant chunks
- Typical threshold: 0.7-0.8 for cosine similarity
Re-ranking: Optional second-stage scoring
- Use a cross-encoder model to re-score retrieved chunks
- More computationally expensive but more accurate
- Useful when initial retrieval is noisy
π Example 1: Customer Support RAG
Let's walk through a complete RAG pipeline for a customer support system.
Setup
Documents: 500 support articles (FAQs, troubleshooting guides)
User Query: "My device won't connect to WiFi"
Pipeline Execution
| Stage | Action | Output |
|---|---|---|
| 1. Ingestion | Load all support articles | 500 documents |
| 2. Chunking | Split into 750-token chunks, 150 overlap | 1,200 chunks |
| 3. Embedding | Generate vectors with BGE-large-en | 1,200 vectors (1024-dim) |
| 4. Storage | Store in Qdrant with HNSW index | Indexed database |
| 5a. Query Embed | Embed user question | Query vector (1024-dim) |
| 5b. Search | Find top-5 chunks (cosine similarity) | 5 relevant chunks |
| 5c. Context | Assemble prompt with chunks | Context prompt |
| 5d. Generate | GPT-4 generates response | Step-by-step solution |
Retrieved Chunks (Top 3)
- Chunk #342 (similarity: 0.89): "WiFi connection issues: First, verify WiFi is enabled..."
- Chunk #127 (similarity: 0.85): "If device shows 'Cannot connect', check router settings..."
- Chunk #891 (similarity: 0.82): "Common WiFi problems include incorrect password..."
Generated Response
"To resolve WiFi connection issues: 1) Verify WiFi is enabled on your device, 2) Check if you're entering the correct password, 3) Restart your router if the issue persists..."
π‘ Why this works: The system retrieved exactly the right troubleshooting steps without the LLM needing to memorize every support article.
π Example 2: Code Documentation RAG
RAG excels at helping developers navigate large codebases.
Setup
Documents: Python library documentation (1,000 pages)
User Query: "How do I configure request timeouts?"
Chunking Strategy
For code documentation, use semantic chunking based on:
- Function definitions
- Class boundaries
- Code examples as single units
Retrieved Context
## Chunk 1 (similarity: 0.91)
"""Configure timeouts using the timeout parameter:
import requests
response = requests.get('https://api.example.com',
timeout=5) # 5-second timeout
"""
## Chunk 2 (similarity: 0.87)
"""For separate connect/read timeouts, use tuple:
timeout=(3.0, 10.0) # 3s connect, 10s read
"""
Generated Response
"To configure request timeouts, pass the timeout parameter: requests.get(url, timeout=5) for a 5-second timeout. For granular control, use a tuple: timeout=(3.0, 10.0) where the first value is connection timeout and second is read timeout."
β οΈ Important: Code-specific RAG often benefits from hybrid searchβcombining semantic similarity with keyword matching for function names and technical terms.
π Example 3: Research Paper RAG
Academic RAG systems help researchers navigate vast scientific literature.
Setup
Documents: 10,000 research papers (PDFs with abstracts, full text)
User Query: "What are recent advances in transformer efficiency?"
Special Considerations
- Metadata filtering: Only search papers from 2024-2026
- Citation preservation: Track which paper each chunk comes from
- Section-aware chunking: Keep abstract, methodology, results separate
Retrieval with Filters
query_vector = embed("recent advances in transformer efficiency")
results = vector_db.search(
vector=query_vector,
top_k=5,
filter={
"year": {"$gte": 2024},
"section": "results"
}
)
Retrieved Papers
- "FlashAttention-3" (2025): "We reduce attention complexity to O(n)..."
- "Sparse Transformers" (2024): "By using local attention patterns..."
- "MoE-Transformers" (2025): "Mixture of experts reduces active parameters..."
Generated Summary
"Recent advances in transformer efficiency include: 1) FlashAttention-3 achieving linear complexity [Smith 2025], 2) Sparse attention patterns reducing computation [Jones 2024], 3) Mixture-of-Experts architectures activating only necessary parameters [Lee 2025]."
π‘ Advantage: RAG provides citations and recency that base models lack.
π Example 4: Multi-Modal RAG
Modern RAG systems handle more than just text.
Setup
Documents: Product catalog with images, descriptions, specifications
User Query: "Show me red backpacks under $50"
Multi-Modal Components
βββββββββββββββββββββββββββββββββββββββββββββ
β πΈ Image Embeddings β
β (CLIP, other vision models) β
βββββββββββββββββββββββββββββββββββββββββββββ€
β π Text Embeddings β
β (Standard text embedding models) β
βββββββββββββββββββββββββββββββββββββββββββββ€
β π’ Structured Data β
β (Price filters, category tags) β
βββββββββββββββββββββββββββββββββββββββββββββ
β
Combined Search
β
Retrieved Products
Hybrid Retrieval
- Semantic search: "red backpacks" β find relevant products
- Metadata filter: price < $50
- Image similarity: If user uploads image, match visually similar products
Result
System returns 5 backpacks that:
- Match semantic description ("red", "backpack")
- Meet price constraint ($35-$48)
- Include product images and specs
The LLM then formats these into a natural response: "I found 5 red backpacks under $50: [Product list with descriptions]..."
β οΈ Common Mistakes
1. Using Different Embedding Models
β Wrong: Embed documents with model-A, query with model-B
β Right: Use the same model for both
Why it fails: Different models create incompatible vector spaces. Similarity scores become meaningless.
2. Ignoring Chunk Size
β Wrong: 5,000-token chunks (exceeds most model limits)
β Right: 500-1000 token chunks with overlap
Why it fails: Large chunks dilute relevant information; small chunks lose context.
3. No Metadata or Citations
β Wrong: Only store chunk text and vector
β Right: Store source, page, timestamp, author, section
Why it fails: Users can't verify information or navigate to source documents.
4. Skipping Text Preprocessing
β Wrong: Feed raw OCR output with noise directly to embeddings
β Right: Clean, normalize, and structure text first
Why it fails: Garbage in, garbage outβpoor quality text produces poor embeddings.
5. Not Testing Retrieval Quality
β Wrong: Assume top-k chunks are always relevant
β Right: Measure retrieval metrics (recall, precision, MRR)
Why it fails: You won't know if your system retrieves the right information until you measure it.
6. Overloading Context Window
β Wrong: Retrieve 20 chunks, paste all into prompt
β Right: Retrieve 3-5 most relevant, possibly re-rank
Why it fails: Too much context confuses the LLM and increases cost/latency.
7. No Fallback for Poor Retrieval
β Wrong: Always generate answer, even with irrelevant chunks
β Right: Check similarity scores; if too low, respond "I don't have information on that"
Why it fails: Generates hallucinated answers when retrieval fails.
π― Key Takeaways
π Classic RAG Pipeline Quick Reference
| Stage | Key Action | Common Tool |
|---|---|---|
| 1. Ingestion | Load documents, preserve metadata | LangChain loaders |
| 2. Chunking | Split into 500-1000 tokens, 20% overlap | RecursiveCharacterTextSplitter |
| 3. Embedding | Convert chunks to vectors | OpenAI, Cohere, BGE |
| 4. Storage | Index vectors for similarity search | Pinecone, Qdrant, Weaviate |
| 5. Retrieval | Search (top-k=3-5) + generate with LLM | GPT-4, Claude |
π§ Remember:
- Same embedding model for documents and queries
- Chunk overlap preserves context boundaries
- Cosine similarity for semantic search
- Store metadata for filtering and citations
- Measure retrieval quality, not just generation quality
π RAG vs. Fine-Tuning
When should you use RAG instead of fine-tuning a model?
| Factor | RAG | Fine-Tuning |
|---|---|---|
| Data Updates | β Easy (add new chunks) | β Requires retraining |
| Cost | π° Lower (storage + API) | π°π° Higher (GPU training) |
| Latency | β‘ Slight overhead (retrieval) | β‘β‘ Faster inference |
| Transparency | β Shows source chunks | β Black box answers |
| Domain Adaptation | β Excellent | β β Best for style/format |
| Fact Updates | β β Instant | β Slow retraining cycle |
π‘ Best practice: Use RAG for knowledge-intensive tasks with changing information. Use fine-tuning for style, format, and reasoning patterns. Often, combining both yields optimal results!
π€ Did You Know?
The term "Retrieval-Augmented Generation" was coined in a 2020 Meta AI paper, but the concept dates back to information retrieval + generation systems from the early 2000s. What changed? Modern embedding models and vector databases made semantic search practical at scale!
Interestingly, RAG systems can reduce hallucinations by 60-80% compared to pure generation, according to 2024 benchmarks. The key is that the LLM is constrained to ground its answers in retrieved context.
π Further Study
- Original RAG Paper - Lewis et al., 2020: https://arxiv.org/abs/2005.11401
- LangChain RAG Documentation - Comprehensive implementation guide: https://python.langchain.com/docs/use_cases/question_answering/
- Vector Database Comparison - Benchmarks and feature comparison: https://benchmark.vectorview.ai/
Next Steps: Now that you understand the classic RAG pipeline, explore advanced techniques like hybrid search, re-ranking, query expansion, and multi-hop reasoning to build even more sophisticated RAG systems. Practice implementing each stage with real documents to solidify your understanding! π