Data Pipeline & Indexing

Create robust ingestion pipelines with smart chunking, embedding generation, and incremental updates.

Data Pipeline & Indexing

Master the fundamentals of data pipeline and indexing systems with free flashcards and spaced repetition practice. This lesson covers data ingestion strategies, transformation techniques, vector embeddings, and indexing architectures—essential concepts for building modern AI search and retrieval-augmented generation (RAG) systems.

Welcome to Data Pipeline & Indexing 🚀

In the world of AI search and RAG systems, your data pipeline and indexing strategy can make or break your application's performance. Think of a data pipeline as a sophisticated assembly line where raw data enters one end and emerges as searchable, semantically-rich information on the other. The indexing component is like creating a hyper-efficient library catalog system—but instead of organizing books by title, you're organizing information by meaning, context, and relevance.

Whether you're building a chatbot that needs to answer questions from thousands of documents, a semantic search engine for research papers, or an AI assistant that pulls information from multiple data sources, understanding how to efficiently move, transform, and index data is crucial.

💡 Did you know? Companies like Google process over 3.5 billion searches per day. Their indexing systems need to handle petabytes of data and return relevant results in milliseconds—a feat made possible through sophisticated pipeline and indexing architectures!

Core Concepts: Understanding the Data Journey 📊

The Data Pipeline Architecture

A data pipeline is an automated workflow that moves data from source systems through various transformation stages to a destination where it can be indexed and queried. Let's break down the key stages:

┌─────────────────────────────────────────────────────────────┐
│              DATA PIPELINE FLOW                             │
└─────────────────────────────────────────────────────────────┘

  📥 INGESTION → 🔄 TRANSFORMATION → 🧮 EMBEDDING → 📇 INDEXING → 🔍 SEARCH
       │              │                  │              │            │
       ↓              ↓                  ↓              ↓            ↓
   Extract        Clean &            Convert to      Store in     Query &
   from          Normalize          Vectors        Vector DB    Retrieve
   Sources       Data                                            Results

1. Data Ingestion Strategies 📥

Data ingestion is the process of collecting raw data from various sources. The strategy you choose depends on your data volume, velocity, and variety requirements.

Batch Ingestion: Processing data in large chunks at scheduled intervals

✅ Best for: Historical data, non-time-sensitive updates, cost optimization
⚠️ Trade-off: Data staleness (freshness can be hours or days old)
🔧 Example use case: Nightly updates of product catalogs

Stream Ingestion: Processing data continuously as it arrives

✅ Best for: Real-time analytics, live chat support, financial trading
⚠️ Trade-off: Higher complexity and infrastructure costs
🔧 Example use case: Live customer support ticket ingestion

Micro-batch: Hybrid approach processing small batches frequently

✅ Best for: Near real-time needs with controlled resource usage
⚠️ Trade-off: Balancing latency vs. throughput
🔧 Example use case: Social media monitoring (process every 5 minutes)

Ingestion Type	Latency	Complexity	Cost	Best For
Batch	Hours-Days	Low	💰 Low	Archives, reports
Micro-batch	Minutes	Medium	💰💰 Medium	News feeds, logs
Stream	Seconds	High	💰💰💰 High	Real-time chat, IoT

2. Data Transformation & Cleaning 🔄

Raw data is rarely ready for indexing. Transformation involves converting, cleaning, and enriching data to make it suitable for semantic search.

Key transformation operations:

Text Normalization

Lowercasing: "Apple" → "apple"
Removing special characters: "Hello!!! World???" → "Hello World"
Whitespace normalization: "too many spaces" → "too many spaces"

Chunking: Breaking large documents into smaller, semantically meaningful pieces

🎯 Why? Embedding models have token limits (typically 512-8192 tokens)
📏 Strategy: Aim for 200-500 tokens per chunk with 10-20% overlap
🧠 Mnemonic: COPS - Context preserved, Overlap included, Paragraph boundaries, Size consistent

Metadata Enrichment

Adding timestamps, authors, source URLs, categories
Extracting entities (people, places, organizations)
Calculating statistics (word count, reading time)

Example transformation pipeline:

RAW DOCUMENT │ ↓ ┌─────────────────────────┐ │ Remove HTML tags │ "

Hello

" → "Hello" └──────────┬──────────────┘ ↓ ┌─────────────────────────┐ │ Normalize text │ "HELLO!!!" → "hello" └──────────┬──────────────┘ ↓ ┌─────────────────────────┐ │ Chunk into sections │ One doc → Multiple chunks └──────────┬──────────────┘ ↓ ┌─────────────────────────┐ │ Add metadata │ + timestamp, source, author └──────────┬──────────────┘ ↓ CLEAN CHUNKS

3. Vector Embeddings: The Heart of Semantic Search 🧮

Embeddings are dense numerical representations of text that capture semantic meaning. Instead of matching exact keywords, embeddings allow you to find conceptually similar content.

🌍 Real-world analogy: Think of embeddings like GPS coordinates. "Coffee shop" and "café" might have different spellings, but their embeddings (coordinates) would be very close together in semantic space—just like two coffee shops on the same street have similar GPS coordinates.

How embeddings work:

TEXT INPUT                    EMBEDDING (simplified 5D)
"machine learning"     →      [0.23, -0.45, 0.89, 0.12, -0.67]
"artificial intelligence" →   [0.25, -0.42, 0.91, 0.15, -0.65]
"cooking recipes"      →      [-0.78, 0.34, -0.12, 0.56, 0.23]

                        ↑ Similar concepts = Similar vectors
                          (measured by cosine similarity)

Popular embedding models in 2026:

Model	Dimensions	Max Tokens	Best Use Case
OpenAI text-embedding-3-large	3072	8191	General purpose, high quality
Cohere embed-v3	1024	512	Multilingual, fast
sentence-transformers	384-768	512	Open source, self-hosted
BGE-large	1024	512	Retrieval-specific, SOTA

Embedding generation strategies:

Document-level embeddings: Embed entire documents
- Pro: Captures overall theme
- Con: Loses granular details
Chunk-level embeddings: Embed each chunk separately (most common)
- Pro: Precise retrieval of relevant sections
- Con: More vectors to store and search
Hybrid embeddings: Combine document + chunk embeddings
- Pro: Best of both worlds
- Con: Increased complexity and storage

💡 Pro tip: Always use the same embedding model for indexing and querying. Mixing models is like trying to use a metric ruler to measure something marked in inches—the numbers won't align!

4. Indexing Architectures 📇

Once you have embeddings, you need to index them for efficient retrieval. An index is a data structure optimized for fast similarity search across millions or billions of vectors.

Vector Index Types:

Flat Index (Brute Force)

Compares query to every vector in database
✅ Perfect accuracy (100% recall)
❌ Slow for large datasets (O(n) complexity)
🎯 Use when: Dataset < 10,000 vectors

HNSW (Hierarchical Navigable Small World)

Graph-based index with multiple layers
✅ Fast queries, good recall (95-99%)
❌ Higher memory usage
🎯 Use when: Need speed with high accuracy

IVF (Inverted File Index)

Clusters vectors, searches nearest clusters first
✅ Memory efficient, scalable
❌ Moderate accuracy (80-95% recall)
🎯 Use when: Very large datasets (millions+ vectors)

Quantization (PQ/SQ)

Compresses vectors to reduce storage
✅ Massive space savings (8-32x compression)
❌ Some accuracy loss
🎯 Use when: Storage costs are critical

INDEX PERFORMANCE COMPARISON

                 SPEED          RECALL         MEMORY
Flat Index       ████           ██████████     ████████
HNSW             ██████████     █████████      ████████████
IVF              █████████      ███████        ██████
IVF+PQ           █████████      ██████         ███

                 └──────────────────────────────────┘
                  Slower ← → Faster

5. Metadata Filtering & Hybrid Search 🔍

Modern RAG systems combine vector search (semantic similarity) with metadata filtering (structured queries) for powerful hybrid retrieval.

Metadata filtering examples:

Date ranges: "documents from last 30 days"
Categories: "only from 'technical' section"
Authors: "written by specific team members"
Access control: "user has permission to view"

Hybrid search strategies:

Strategy	Approach	When to Use
Pre-filtering	Filter metadata first, then vector search	Strict constraints (dates, permissions)
Post-filtering	Vector search first, filter results after	Soft preferences (categories, tags)
Sparse-Dense Hybrid	Combine keyword (BM25) + vector scores	Need exact matches + semantic similarity

Score fusion techniques:

When combining multiple search methods, you need to merge their scores:

Reciprocal Rank Fusion (RRF)
- Formula: score = Σ(1 / (k + rank_i)) where k=60
- ✅ Simple, robust, no tuning needed
- Most popular in practice
Weighted Sum
- Formula: score = α × vector_score + (1-α) × keyword_score
- ⚠️ Requires normalization and tuning α
Relative Score Fusion
- Normalize each score by max score in its result set
- ✅ Handles different score ranges automatically

Practical Examples 🛠️

Example 1: Building a Document Ingestion Pipeline

Let's walk through ingesting PDF documents into a RAG system:

Scenario: You're building a legal document search system that needs to index 10,000 PDF contracts.

Step-by-step pipeline:

## 1. INGESTION: Extract text from PDF
from pypdf import PdfReader

def ingest_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

## 2. TRANSFORMATION: Clean and chunk
def transform_text(text, chunk_size=500, overlap=50):
    # Remove extra whitespace
    text = " ".join(text.split())
    
    # Split into chunks with overlap
    chunks = []
    words = text.split()
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

## 3. EMBEDDING: Generate vectors
from openai import OpenAI

client = OpenAI()

def generate_embeddings(chunks):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    )
    return [item.embedding for item in response.data]

## 4. INDEXING: Store in vector database
from qdrant_client import QdrantClient

client = QdrantClient(":memory:")

def index_documents(chunks, embeddings, metadata):
    points = []
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        points.append({
            "id": i,
            "vector": embedding,
            "payload": {
                "text": chunk,
                "source": metadata["source"],
                "page": metadata["page"],
                "date": metadata["date"]
            }
        })
    
    client.upsert(collection_name="legal_docs", points=points)

Key decisions explained:

Chunk size 500 words: Balances context vs. precision for legal text
Overlap 50 words: Prevents important clauses from being split across chunks
Metadata tracking: Source, page number, date enable filtering by document origin

Example 2: Optimizing Index Performance

Scenario: Your vector search is too slow—queries take 2-3 seconds for a database of 1 million documents.

Diagnosis and solutions:

Problem	Solution	Expected Improvement
Using flat index	Switch to HNSW index	10-50x faster queries
No metadata pre-filtering	Add date/category filters before vector search	5-10x faster (smaller search space)
High dimensionality (3072D)	Use smaller embedding model (768D)	2-4x faster, 75% less storage
Returning too many results	Reduce top_k from 100 to 20	2-3x faster

Implementation with HNSW:

## Configure HNSW index parameters
from qdrant_client.models import VectorParams, Distance

client.create_collection(
    collection_name="optimized_docs",
    vectors_config=VectorParams(
        size=768,  # Embedding dimension
        distance=Distance.COSINE,  # Similarity metric
        hnsw_config={
            "m": 16,  # Number of connections per layer
            "ef_construct": 100,  # Construction time vs accuracy
        }
    )
)

## Query with pre-filtering
results = client.search(
    collection_name="optimized_docs",
    query_vector=query_embedding,
    query_filter={
        "must": [
            {"key": "date", "range": {"gte": "2024-01-01"}},
            {"key": "category", "match": {"value": "contracts"}}
        ]
    },
    limit=20,  # Reduced from 100
    search_params={"ef": 64}  # Query time accuracy
)

Parameter tuning guide:

m (connections): Higher = better recall, more memory (typical: 16-32)
ef_construct: Higher = better index quality, slower build (typical: 100-200)
ef (query time): Higher = better recall, slower queries (typical: 32-128)

Example 3: Implementing Hybrid Search

Scenario: Users complain that searching for exact product codes (like "SKU-1234") returns irrelevant semantic matches instead of the exact product.

Solution: Implement sparse-dense hybrid search combining keyword matching (BM25) with vector similarity.

from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_search(query, documents, embeddings, alpha=0.5):
    # 1. SPARSE: BM25 keyword search
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    bm25_scores = bm25.get_scores(query.split())
    
    # 2. DENSE: Vector similarity search
    query_embedding = generate_embedding(query)
    vector_scores = cosine_similarity(query_embedding, embeddings)
    
    # 3. NORMALIZE: Bring scores to same scale
    bm25_normalized = bm25_scores / max(bm25_scores)
    vector_normalized = (vector_scores + 1) / 2  # Cosine [-1,1] to [0,1]
    
    # 4. FUSION: Combine with weighted sum
    final_scores = alpha * vector_normalized + (1 - alpha) * bm25_normalized
    
    # 5. RANK: Sort by final score
    ranked_indices = np.argsort(final_scores)[::-1]
    return ranked_indices

When to tune α (alpha):

α = 0.7-0.9: Prioritize semantic similarity (research papers, knowledge bases)
α = 0.3-0.5: Balance semantic + exact matches (e-commerce, product catalogs)
α = 0.1-0.3: Prioritize exact keyword matches (legal, technical docs with IDs)

Results comparison:

Query	Vector Only	Keyword Only	Hybrid (α=0.4)
"SKU-1234"	❌ Similar product descriptions	✅ Exact SKU match	✅ Exact SKU match
"affordable laptop for students"	✅ Budget laptops with student reviews	❌ Any doc with all 4 words	✅ Budget laptops with student context
"waterproof hiking boots"	✅ Outdoor footwear	⚠️ Any waterproof item + any boot	✅ Waterproof hiking boots specifically

Example 4: Real-Time Pipeline Monitoring

Scenario: Your pipeline processes 10,000 documents per hour. You need to monitor for failures, bottlenecks, and data quality issues.

Key metrics to track:

PIPELINE HEALTH DASHBOARD
┌─────────────────────────────────────────────────┐
│  INGESTION                                      │
│  ████████████████████ 10,234 docs/hr  ✅ Healthy│
│  Failures: 12 (0.1%)                            │
├─────────────────────────────────────────────────┤
│  TRANSFORMATION                                 │
│  █████████████████████ Avg: 0.3s/doc  ⚠️ Slow   │
│  Queue depth: 1,245 (high)                      │
├─────────────────────────────────────────────────┤
│  EMBEDDING                                      │
│  ████████████████████ 9,891 vectors/hr  ✅      │
│  API latency: 45ms (good)                       │
├─────────────────────────────────────────────────┤
│  INDEXING                                       │
│  ████████████████████ 9,889 indexed/hr  ✅      │
│  Index size: 4.2GB, growth: +120MB/hr           │
└─────────────────────────────────────────────────┘

Monitoring code example:

import time
import logging
from prometheus_client import Counter, Histogram, Gauge

## Define metrics
docs_ingested = Counter('docs_ingested_total', 'Total documents ingested')
ingestion_errors = Counter('ingestion_errors_total', 'Ingestion failures')
chunk_processing_time = Histogram('chunk_processing_seconds', 'Time to process chunks')
queue_depth = Gauge('pipeline_queue_depth', 'Documents waiting in queue')

def monitored_pipeline(document):
    try:
        # Track queue depth
        queue_depth.set(get_queue_size())
        
        # Time the transformation step
        start = time.time()
        chunks = transform_text(document)
        chunk_processing_time.observe(time.time() - start)
        
        # Generate embeddings
        embeddings = generate_embeddings(chunks)
        
        # Index
        index_documents(chunks, embeddings, document.metadata)
        
        # Success metric
        docs_ingested.inc()
        
    except Exception as e:
        # Track failures
        ingestion_errors.inc()
        logging.error(f"Pipeline failed for doc {document.id}: {e}")
        # Send to dead letter queue for retry
        send_to_dlq(document)

Alert thresholds:

⚠️ Warning: Error rate > 1%
🚨 Critical: Error rate > 5%
⚠️ Warning: Processing time > 2x baseline
🚨 Critical: Queue depth > 10,000 documents

Common Mistakes & How to Avoid Them ⚠️

Mistake 1: Chunking Without Overlap

Problem: Important context gets split across chunks, reducing retrieval quality.

❌ Wrong approach:

Chunk 1: "...to meet quarterly goals. "
Chunk 2: "The sales team reported..."

The connection between "quarterly goals" and "sales team" is lost!

✅ Correct approach:

Chunk 1: "...to meet quarterly goals. The sales team reported..."
Chunk 2: "quarterly goals. The sales team reported strong performance..."

Overlap preserves context across boundaries.

Rule of thumb: Use 10-20% overlap (50-100 tokens for 500-token chunks).

Mistake 2: Ignoring Embedding Model Token Limits

Problem: Sending 10,000-token documents to a model with 512-token limit results in silent truncation—you lose 95% of your content!

⚠️ Warning signs:

Retrieval quality drops for longer documents
Only first section of documents appears relevant
Inconsistent results across document length

✅ Solution: Always chunk before embedding:

def safe_embed(text, max_tokens=512):
    # Estimate tokens (rough: 1 token ≈ 4 chars)
    estimated_tokens = len(text) // 4
    
    if estimated_tokens > max_tokens:
        raise ValueError(
            f"Text too long: {estimated_tokens} tokens. "
            f"Chunk it first to max {max_tokens} tokens!"
        )
    
    return generate_embedding(text)

Mistake 3: Mixing Embedding Models

Problem: Indexing with model A, querying with model B produces nonsense results.

🌍 Why it fails: It's like measuring your height in meters, then comparing to someone measured in feet—the numbers don't align!

Each embedding model creates a unique "semantic space" with its own coordinate system.

✅ Golden rule:

Store the embedding model name with your index
Always use the same model for queries
If you change models, re-index everything

Mistake 4: No Monitoring or Error Handling

Problem: Pipeline fails silently, and you discover hours later that no documents were indexed.

Real-world disaster scenario:

09:00 - Embedding API key expires
09:01 - Pipeline starts failing silently  
14:00 - Users report search returning no results
14:30 - Team discovers 5 hours of data lost
16:00 - Emergency re-processing begins

✅ Prevention checklist:

☑️ Log every stage (ingestion, transformation, embedding, indexing)
☑️ Set up alerts for error rate spikes
☑️ Implement retry logic with exponential backoff
☑️ Use dead letter queues for failed documents
☑️ Monitor pipeline latency and throughput
☑️ Test error scenarios regularly (API failures, malformed data)

Mistake 5: Over-Indexing vs. Under-Indexing

Over-indexing: Creating too many small chunks

Each chunk lacks sufficient context
Retrieval returns fragmented information
Higher storage and query costs

Under-indexing: Chunks too large

Irrelevant content mixed with relevant passages
Wastes LLM context window on noise
Poor retrieval precision

✅ Optimal chunk sizing guide:

Content Type	Recommended Size	Rationale
Technical docs	300-500 tokens	Code examples need context
Legal contracts	400-600 tokens	Clauses reference each other
News articles	200-400 tokens	Paragraphs are self-contained
Chat messages	100-200 tokens	Short, conversational
Research papers	500-800 tokens	Complex arguments need space

Key Takeaways 🎯

Let's consolidate everything you've learned about data pipeline and indexing:

✅ Data pipelines automate the flow from raw data to searchable indexes through ingestion → transformation → embedding → indexing stages

✅ Choose ingestion strategy based on latency needs: batch (hours), micro-batch (minutes), or stream (seconds)

✅ Chunk intelligently: 200-500 tokens with 10-20% overlap preserves context while respecting token limits

✅ Embeddings convert text to semantic vectors—use the same model consistently for indexing and querying

✅ Index types trade off speed vs. accuracy: Flat (perfect, slow), HNSW (fast, accurate), IVF (scalable, moderate accuracy)

✅ Hybrid search combines vector similarity with keyword matching for best results on diverse queries

✅ Monitor everything: Track throughput, latency, errors, and queue depth to catch issues early

✅ Metadata filtering dramatically speeds up queries by reducing the search space before vector comparison

✅ Avoid silent failures: Implement comprehensive error handling, logging, and alerting

🧠 Memory Devices

Remember pipeline stages with ITEMS:

Ingestion (collect data)
Transformation (clean & chunk)
Embedding (generate vectors)
Metadata (add context)
Storage (index in vector DB)

Remember index types with CHEF:

Cheap & slow: Flat index
High-speed graph: HNSW
Efficient clusters: IVF
Fast & compressed: Quantized

🔧 Try This: Hands-On Exercise

Challenge: Build a mini RAG pipeline for your own documents!

Collect: Grab 5-10 PDF or text files (research papers, reports, articles)
Process: Write code to extract text and chunk it (300-500 words)
Embed: Use OpenAI API or Sentence Transformers to generate embeddings
Index: Store in Qdrant (local) or Pinecone (cloud)
Query: Ask natural language questions and retrieve relevant chunks
Measure: Compare retrieval quality with vs. without overlap in chunks

Bonus: Add metadata filtering (date, source, topic) and test hybrid search!

📋 Quick Reference Card: Pipeline & Indexing Essentials

Ingestion Types	Batch (scheduled), Stream (real-time), Micro-batch (frequent small batches)
Optimal Chunk Size	200-500 tokens with 10-20% overlap
Popular Embeddings	OpenAI text-embedding-3-large (3072D), Cohere embed-v3 (1024D), BGE-large (1024D)
Index for Speed	HNSW (graph-based, 95-99% recall)
Index for Scale	IVF with quantization (clusters + compression)
Hybrid Search	Combine BM25 (keywords) + vector similarity with RRF or weighted fusion
HNSW Parameters	m=16-32 (connections), ef_construct=100-200 (build quality), ef=32-128 (query accuracy)
Monitor These	Throughput (docs/hour), error rate (%), latency (seconds/doc), queue depth
Golden Rules	1) Same embedding model always 2) Chunk before embed 3) Add overlap 4) Monitor failures

📚 Further Study

Ready to dive deeper? Check out these resources:

Pinecone Learning Center - https://www.pinecone.io/learn/ - Comprehensive guides on vector databases, embeddings, and RAG architectures with hands-on tutorials
LangChain Documentation: Text Splitters - https://python.langchain.com/docs/modules/data_connection/document_transformers/ - Deep dive into chunking strategies with code examples for various document types
HNSW Algorithm Explained - https://arxiv.org/abs/1603.09320 - Original research paper on Hierarchical Navigable Small World graphs (for those who want the mathematical foundation)

Congratulations! 🎉 You now understand the complete data pipeline and indexing architecture that powers modern AI search and RAG systems. You're ready to build production-grade retrieval systems that can handle millions of documents and return relevant results in milliseconds!

📝

Ready to practice?

This lesson has 15 questions to help you learn