In the classic RAG pipeline, {{1}} converts text into vectors, then {{2}} stores these vectors for fast similarity search, and finally {{3}} is used to measure semantic similarity.

["embedding","vector database","cosine similarity"]

In a classic RAG pipeline, the five main stages are {{1}} (loading data), {{2}} (splitting documents), {{3}} (creating vectors), {{4}} (storing vectors), and {{5}} (finding relevant context).

["ingestion","chunking","embedding","storage","retrieval"]

Classic RAG Pipeline

Implement the standard retrieve-augment-generate workflow with single-query retrieval and context injection.

Classic RAG Pipeline

Master the fundamentals of Retrieval-Augmented Generation with free flashcards and spaced repetition practice. This lesson covers document ingestion, vector embeddings, similarity search, and context-aware generation—essential concepts for building modern AI search systems that combine the power of retrieval with generative AI.

🎯 Welcome to Classic RAG

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications that need to access external knowledge. Unlike standalone language models that rely solely on training data, RAG systems dynamically retrieve relevant information and use it to generate more accurate, up-to-date responses.

Think of RAG as giving your AI a reference library. Instead of memorizing everything (which would be impossible for constantly changing information), the AI learns to look up relevant documents first, then generates answers based on what it finds. This approach solves critical problems like hallucinations, outdated information, and lack of domain-specific knowledge.

In this lesson, we'll dissect the classic RAG pipeline step-by-step, understanding each component and how they work together to create intelligent, knowledge-grounded AI systems.

🏗️ The Five Core Stages

The classic RAG pipeline consists of five interconnected stages:

┌─────────────────────────────────────────────────────────────┐
│                    CLASSIC RAG PIPELINE                     │
└─────────────────────────────────────────────────────────────┘

    📄 Stage 1: Document Ingestion
           │
           ↓
    ✂️ Stage 2: Chunking & Processing
           │
           ↓
    🔢 Stage 3: Embedding Generation
           │
           ↓
    💾 Stage 4: Vector Storage & Indexing
           │
           ↓  (User Query Arrives)
           │
    🔍 Stage 5: Retrieval & Generation
           │
           ├──→ Query Embedding
           │
           ├──→ Similarity Search
           │
           ├──→ Context Retrieval
           │
           └──→ 🤖 LLM Generation
                     │
                     ↓
                ✅ Final Answer

Let's explore each stage in detail.

📄 Stage 1: Document Ingestion

Document ingestion is the process of loading raw data into your RAG system. This stage handles diverse data formats and prepares them for downstream processing.

What Gets Ingested?

Text documents: PDFs, Word files, plain text
Web content: HTML pages, markdown files
Structured data: JSON, CSV, database records
Code repositories: Source files, documentation
Multimedia metadata: Transcripts, captions, descriptions

Key Considerations

💡 Tip: Always preserve metadata during ingestion (source URL, creation date, author, section headers). This metadata becomes crucial for filtering and citation later.

Format	Parser Library	Key Challenge
PDF	PyPDF2, pdfplumber	Layout preservation
HTML	BeautifulSoup, Trafilatura	Extracting main content
Word	python-docx	Style/format handling
Markdown	mistune, markdown-it	Code block parsing

⚠️ Watch out: PDFs with scanned images require OCR (Optical Character Recognition) preprocessing. Without it, you'll extract nothing from image-based PDFs!

✂️ Stage 2: Chunking & Processing

Chunking divides long documents into smaller, semantically coherent pieces. This is critical because:

Embedding models have token limits (typically 512-8192 tokens)
Retrieval precision improves with focused chunks
Generation context windows need manageable inputs

Chunking Strategies

Strategy	How It Works	Best For
Fixed-size	Split every N characters/tokens	Simple, uniform content
Sentence-based	Split on sentence boundaries	Natural text flow
Paragraph-based	Split on paragraph breaks	Articles, essays
Semantic	Split when topic shifts	Long-form documents
Document structure	Split on headers, sections	Technical docs, manuals

Chunk Overlap

Most effective chunking includes overlap between consecutive chunks:

Without Overlap:
┌─────────────┐┌─────────────┐┌─────────────┐
│  Chunk 1    ││  Chunk 2    ││  Chunk 3    │
└─────────────┘└─────────────┘└─────────────┘
     ↑ Information at boundary might be split ↑

With Overlap (Recommended):
┌─────────────┐
│  Chunk 1    │
└─────────┬───┘
          │┌─────────────┐
          ││  Chunk 2    │
          └┴─────┬───────┘
                 │┌─────────────┐
                 ││  Chunk 3    │
                 └┴─────────────┘
     ↑ Context preserved across boundaries ↑

💡 Tip: A typical configuration is 500-1000 token chunks with 100-200 token overlap (20-25% overlap ratio).

Text Cleaning

Before chunking, apply preprocessing:

Remove excessive whitespace, special characters
Normalize unicode characters
Handle code blocks specially (preserve indentation)
Extract and preserve tables in structured format

🔢 Stage 3: Embedding Generation

Embeddings are numerical vector representations of text that capture semantic meaning. Similar concepts have similar vectors, enabling mathematical similarity comparisons.

How Embeddings Work

An embedding model (like OpenAI's text-embedding-ada-002, Cohere's embeddings, or open-source models like sentence-transformers) transforms text into a high-dimensional vector:

Text Input: "How do I reset my password?"
              ↓
    Embedding Model
              ↓
Vector: [0.023, -0.891, 0.445, ..., 0.112]
         ↑ 1536 dimensions (example) ↑

Why Embeddings Matter

Embeddings enable semantic search rather than keyword matching:

Search Type	Query	Matches
Keyword	"python programming"	Only exact phrase
Semantic	"python programming"	"coding in python", "Python tutorials", "Snake scripting language"

Popular Embedding Models (2026)

Model	Dimensions	Max Tokens	Best For
OpenAI ada-002	1536	8191	General purpose
Cohere embed-v3	1024	512	Multilingual
BGE-large-en	1024	512	Open-source, high quality
E5-mistral-7b	4096	32768	Long context

💡 Tip: Use the same embedding model for both document chunks and user queries! Mixing models breaks semantic similarity.

Batch Processing

For efficiency, embed chunks in batches:

## Process 100 chunks at a time
batch_size = 100
for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i + batch_size]
    embeddings = embedding_model.embed(batch)
    store_embeddings(embeddings)

💾 Stage 4: Vector Storage & Indexing

Vector databases store embeddings and enable fast similarity search. Unlike traditional databases that query exact matches, vector databases find "nearby" vectors in high-dimensional space.

Vector Database Options

Database	Type	Best For	Notable Feature
Pinecone	Managed	Production scale	Auto-scaling
Weaviate	Open-source	Flexible schemas	GraphQL API
Qdrant	Open-source	High performance	Rust-based speed
Chroma	Embedded	Development, prototyping	Zero config
FAISS	Library	Research, local use	Facebook AI

Indexing Strategies

Vector databases use specialized index structures for fast search:

┌───────────────────────────────────────────────┐
│         VECTOR INDEX TYPES                    │
├───────────────────────────────────────────────┤
│                                               │
│  📊 FLAT (Exact)                             │
│  ├─ Brute force comparison                   │
│  ├─ 100% accurate                            │
│  └─ Slow for large datasets (>100K vectors)  │
│                                               │
│  🌳 HNSW (Approximate)                       │
│  ├─ Hierarchical graph structure             │
│  ├─ Fast queries (milliseconds)              │
│  └─ ~99% accuracy                            │
│                                               │
│  📦 IVF (Approximate)                        │
│  ├─ Clusters vectors into groups             │
│  ├─ Search only relevant clusters            │
│  └─ Good balance of speed/accuracy           │
│                                               │
└───────────────────────────────────────────────┘

💡 Tip: HNSW (Hierarchical Navigable Small World) is the most popular index for RAG applications—it offers excellent speed-accuracy tradeoff.

What Gets Stored

Each vector database entry typically contains:

Vector embedding (the numerical representation)
Original text chunk (for context retrieval)
Metadata (source, page number, timestamp, etc.)
Unique ID (for updating/deleting)

🔍 Stage 5: Retrieval & Generation

This is where the magic happens—combining retrieval with generation to produce accurate, grounded responses.

Step 5A: Query Embedding

When a user asks a question, embed it using the same model used for documents:

User Query: "What are the benefits of exercise?"
              ↓
    Embedding Model (same as documents)
              ↓
Query Vector: [0.156, -0.723, 0.891, ..., 0.034]

Step 5B: Similarity Search

The vector database finds the top-k most similar document chunks using distance metrics:

Metric	Formula	Range	Interpretation
Cosine	cos(θ) = A·B / (\|\|A\|\| \|\|B\|\|)	-1 to 1	1 = identical direction
Euclidean	√Σ(ai - bi)²	0 to ∞	0 = identical points
Dot Product	Σ(ai × bi)	-∞ to ∞	Higher = more similar

💡 Tip: Cosine similarity is most common for text embeddings because it measures angle (semantic similarity) rather than magnitude.

Step 5C: Context Construction

Retrieved chunks are assembled into a context prompt:

--- Retrieved Context ---
[Chunk 1] Regular exercise improves cardiovascular health...
[Chunk 2] Physical activity reduces stress and anxiety...
[Chunk 3] Exercise strengthens bones and muscles...

--- User Question ---
What are the benefits of exercise?

--- Instructions ---
Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say so.

Step 5D: LLM Generation

The context + question is sent to a large language model (GPT-4, Claude, Llama, etc.) which generates a grounded response:

┌─────────────────────────────────────────────┐
│  📥 INPUT: Context + Question               │
├─────────────────────────────────────────────┤
│           ↓                                 │
│  🤖 LLM (GPT-4, Claude, etc.)              │
│           ↓                                 │
├─────────────────────────────────────────────┤
│  📤 OUTPUT: Grounded Answer                 │
│  "Exercise offers multiple benefits:        │
│   1. Improves heart health                  │
│   2. Reduces stress                         │
│   3. Strengthens bones and muscles"         │
└─────────────────────────────────────────────┘

Retrieval Parameters

Top-k: How many chunks to retrieve

Too few (k=1-2): Might miss relevant information
Too many (k>10): Noise and cost increase
Sweet spot: k=3-5 for most applications

Similarity threshold: Minimum score to include

Filters out irrelevant chunks
Typical threshold: 0.7-0.8 for cosine similarity

Re-ranking: Optional second-stage scoring

Use a cross-encoder model to re-score retrieved chunks
More computationally expensive but more accurate
Useful when initial retrieval is noisy

📊 Example 1: Customer Support RAG

Let's walk through a complete RAG pipeline for a customer support system.

Setup

Documents: 500 support articles (FAQs, troubleshooting guides)

User Query: "My device won't connect to WiFi"

Pipeline Execution

Stage	Action	Output
1. Ingestion	Load all support articles	500 documents
2. Chunking	Split into 750-token chunks, 150 overlap	1,200 chunks
3. Embedding	Generate vectors with BGE-large-en	1,200 vectors (1024-dim)
4. Storage	Store in Qdrant with HNSW index	Indexed database
5a. Query Embed	Embed user question	Query vector (1024-dim)
5b. Search	Find top-5 chunks (cosine similarity)	5 relevant chunks
5c. Context	Assemble prompt with chunks	Context prompt
5d. Generate	GPT-4 generates response	Step-by-step solution

Retrieved Chunks (Top 3)

Chunk #342 (similarity: 0.89): "WiFi connection issues: First, verify WiFi is enabled..."
Chunk #127 (similarity: 0.85): "If device shows 'Cannot connect', check router settings..."
Chunk #891 (similarity: 0.82): "Common WiFi problems include incorrect password..."

Generated Response

"To resolve WiFi connection issues: 1) Verify WiFi is enabled on your device, 2) Check if you're entering the correct password, 3) Restart your router if the issue persists..."

💡 Why this works: The system retrieved exactly the right troubleshooting steps without the LLM needing to memorize every support article.

📊 Example 2: Code Documentation RAG

RAG excels at helping developers navigate large codebases.

Setup

Documents: Python library documentation (1,000 pages)

User Query: "How do I configure request timeouts?"

Chunking Strategy

For code documentation, use semantic chunking based on:

Function definitions
Class boundaries
Code examples as single units

Retrieved Context

## Chunk 1 (similarity: 0.91)
"""Configure timeouts using the timeout parameter:

import requests
response = requests.get('https://api.example.com',
                        timeout=5)  # 5-second timeout
"""

## Chunk 2 (similarity: 0.87)
"""For separate connect/read timeouts, use tuple:
timeout=(3.0, 10.0)  # 3s connect, 10s read
"""

Generated Response

"To configure request timeouts, pass the timeout parameter: requests.get(url, timeout=5) for a 5-second timeout. For granular control, use a tuple: timeout=(3.0, 10.0) where the first value is connection timeout and second is read timeout."

⚠️ Important: Code-specific RAG often benefits from hybrid search—combining semantic similarity with keyword matching for function names and technical terms.

📊 Example 3: Research Paper RAG

Academic RAG systems help researchers navigate vast scientific literature.

Setup

Documents: 10,000 research papers (PDFs with abstracts, full text)

User Query: "What are recent advances in transformer efficiency?"

Special Considerations

Metadata filtering: Only search papers from 2024-2026
Citation preservation: Track which paper each chunk comes from
Section-aware chunking: Keep abstract, methodology, results separate

Retrieval with Filters

query_vector = embed("recent advances in transformer efficiency")

results = vector_db.search(
    vector=query_vector,
    top_k=5,
    filter={
        "year": {"$gte": 2024},
        "section": "results"
    }
)

Retrieved Papers

"FlashAttention-3" (2025): "We reduce attention complexity to O(n)..."
"Sparse Transformers" (2024): "By using local attention patterns..."
"MoE-Transformers" (2025): "Mixture of experts reduces active parameters..."

Generated Summary

"Recent advances in transformer efficiency include: 1) FlashAttention-3 achieving linear complexity [Smith 2025], 2) Sparse attention patterns reducing computation [Jones 2024], 3) Mixture-of-Experts architectures activating only necessary parameters [Lee 2025]."

💡 Advantage: RAG provides citations and recency that base models lack.

Modern RAG systems handle more than just text.

Setup

Documents: Product catalog with images, descriptions, specifications

User Query: "Show me red backpacks under $50"

┌───────────────────────────────────────────┐
│  📸 Image Embeddings                      │
│  (CLIP, other vision models)              │
├───────────────────────────────────────────┤
│  📝 Text Embeddings                       │
│  (Standard text embedding models)         │
├───────────────────────────────────────────┤
│  🔢 Structured Data                       │
│  (Price filters, category tags)           │
└───────────────────────────────────────────┘
           ↓
    Combined Search
           ↓
    Retrieved Products

Hybrid Retrieval

Semantic search: "red backpacks" → find relevant products
Metadata filter: price < $50
Image similarity: If user uploads image, match visually similar products

Result

System returns 5 backpacks that:

Match semantic description ("red", "backpack")
Meet price constraint ($35-$48)
Include product images and specs

The LLM then formats these into a natural response: "I found 5 red backpacks under $50: [Product list with descriptions]..."

⚠️ Common Mistakes

1. Using Different Embedding Models

❌ Wrong: Embed documents with model-A, query with model-B

✅ Right: Use the same model for both

Why it fails: Different models create incompatible vector spaces. Similarity scores become meaningless.

2. Ignoring Chunk Size

❌ Wrong: 5,000-token chunks (exceeds most model limits)

✅ Right: 500-1000 token chunks with overlap

Why it fails: Large chunks dilute relevant information; small chunks lose context.

3. No Metadata or Citations

❌ Wrong: Only store chunk text and vector

✅ Right: Store source, page, timestamp, author, section

Why it fails: Users can't verify information or navigate to source documents.

4. Skipping Text Preprocessing

❌ Wrong: Feed raw OCR output with noise directly to embeddings

✅ Right: Clean, normalize, and structure text first

Why it fails: Garbage in, garbage out—poor quality text produces poor embeddings.

5. Not Testing Retrieval Quality

❌ Wrong: Assume top-k chunks are always relevant

✅ Right: Measure retrieval metrics (recall, precision, MRR)

Why it fails: You won't know if your system retrieves the right information until you measure it.

6. Overloading Context Window

❌ Wrong: Retrieve 20 chunks, paste all into prompt

✅ Right: Retrieve 3-5 most relevant, possibly re-rank

Why it fails: Too much context confuses the LLM and increases cost/latency.

7. No Fallback for Poor Retrieval

❌ Wrong: Always generate answer, even with irrelevant chunks

✅ Right: Check similarity scores; if too low, respond "I don't have information on that"

Why it fails: Generates hallucinated answers when retrieval fails.

🎯 Key Takeaways

📋 Classic RAG Pipeline Quick Reference

Stage	Key Action	Common Tool
1. Ingestion	Load documents, preserve metadata	LangChain loaders
2. Chunking	Split into 500-1000 tokens, 20% overlap	RecursiveCharacterTextSplitter
3. Embedding	Convert chunks to vectors	OpenAI, Cohere, BGE
4. Storage	Index vectors for similarity search	Pinecone, Qdrant, Weaviate
5. Retrieval	Search (top-k=3-5) + generate with LLM	GPT-4, Claude

🧠 Remember:

Same embedding model for documents and queries
Chunk overlap preserves context boundaries
Cosine similarity for semantic search
Store metadata for filtering and citations
Measure retrieval quality, not just generation quality

🔄 RAG vs. Fine-Tuning

When should you use RAG instead of fine-tuning a model?

Factor	RAG	Fine-Tuning
Data Updates	✅ Easy (add new chunks)	❌ Requires retraining
Cost	💰 Lower (storage + API)	💰💰 Higher (GPU training)
Latency	⚡ Slight overhead (retrieval)	⚡⚡ Faster inference
Transparency	✅ Shows source chunks	❌ Black box answers
Domain Adaptation	✅ Excellent	✅✅ Best for style/format
Fact Updates	✅✅ Instant	❌ Slow retraining cycle

💡 Best practice: Use RAG for knowledge-intensive tasks with changing information. Use fine-tuning for style, format, and reasoning patterns. Often, combining both yields optimal results!

🤔 Did You Know?

The term "Retrieval-Augmented Generation" was coined in a 2020 Meta AI paper, but the concept dates back to information retrieval + generation systems from the early 2000s. What changed? Modern embedding models and vector databases made semantic search practical at scale!

Interestingly, RAG systems can reduce hallucinations by 60-80% compared to pure generation, according to 2024 benchmarks. The key is that the LLM is constrained to ground its answers in retrieved context.

📚 Further Study

Original RAG Paper - Lewis et al., 2020: https://arxiv.org/abs/2005.11401
LangChain RAG Documentation - Comprehensive implementation guide: https://python.langchain.com/docs/use_cases/question_answering/
Vector Database Comparison - Benchmarks and feature comparison: https://benchmark.vectorview.ai/

Next Steps: Now that you understand the classic RAG pipeline, explore advanced techniques like hybrid search, re-ranking, query expansion, and multi-hop reasoning to build even more sophisticated RAG systems. Practice implementing each stage with real documents to solidify your understanding! 🚀

📝

Ready to practice?

This lesson has 15 questions to help you learn

Classic RAG Pipeline

Classic RAG Pipeline

🎯 Welcome to Classic RAG

🏗️ The Five Core Stages

📄 Stage 1: Document Ingestion

What Gets Ingested?

Key Considerations

✂️ Stage 2: Chunking & Processing

Chunking Strategies

Chunk Overlap

Text Cleaning

🔢 Stage 3: Embedding Generation

How Embeddings Work

Why Embeddings Matter

Popular Embedding Models (2026)

Batch Processing

💾 Stage 4: Vector Storage & Indexing

Vector Database Options

Indexing Strategies

What Gets Stored

🔍 Stage 5: Retrieval & Generation

Step 5A: Query Embedding

Step 5B: Similarity Search

Step 5C: Context Construction

Step 5D: LLM Generation

Retrieval Parameters

📊 Example 1: Customer Support RAG

Setup

Pipeline Execution

Retrieved Chunks (Top 3)

Generated Response

📊 Example 2: Code Documentation RAG

Setup

Chunking Strategy

Retrieved Context

Generated Response

📊 Example 3: Research Paper RAG

Setup

Special Considerations

Retrieval with Filters

Retrieved Papers

Generated Summary

📊 Example 4: Multi-Modal RAG

Setup

Multi-Modal Components

Hybrid Retrieval

Result

⚠️ Common Mistakes

1. Using Different Embedding Models

2. Ignoring Chunk Size

3. No Metadata or Citations

4. Skipping Text Preprocessing

5. Not Testing Retrieval Quality

6. Overloading Context Window

7. No Fallback for Poor Retrieval

🎯 Key Takeaways

📋 Classic RAG Pipeline Quick Reference

🔄 RAG vs. Fine-Tuning

🤔 Did You Know?

📚 Further Study