You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

Data Pipeline & Indexing

Create robust ingestion pipelines with smart chunking, embedding generation, and incremental updates.

Data Pipeline & Indexing

Master the fundamentals of data pipeline and indexing systems with free flashcards and spaced repetition practice. This lesson covers data ingestion strategies, transformation techniques, vector embeddings, and indexing architecturesโ€”essential concepts for building modern AI search and retrieval-augmented generation (RAG) systems.

Welcome to Data Pipeline & Indexing ๐Ÿš€

In the world of AI search and RAG systems, your data pipeline and indexing strategy can make or break your application's performance. Think of a data pipeline as a sophisticated assembly line where raw data enters one end and emerges as searchable, semantically-rich information on the other. The indexing component is like creating a hyper-efficient library catalog systemโ€”but instead of organizing books by title, you're organizing information by meaning, context, and relevance.

Whether you're building a chatbot that needs to answer questions from thousands of documents, a semantic search engine for research papers, or an AI assistant that pulls information from multiple data sources, understanding how to efficiently move, transform, and index data is crucial.

๐Ÿ’ก Did you know? Companies like Google process over 3.5 billion searches per day. Their indexing systems need to handle petabytes of data and return relevant results in millisecondsโ€”a feat made possible through sophisticated pipeline and indexing architectures!

Core Concepts: Understanding the Data Journey ๐Ÿ“Š

The Data Pipeline Architecture

A data pipeline is an automated workflow that moves data from source systems through various transformation stages to a destination where it can be indexed and queried. Let's break down the key stages:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              DATA PIPELINE FLOW                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  ๐Ÿ“ฅ INGESTION โ†’ ๐Ÿ”„ TRANSFORMATION โ†’ ๐Ÿงฎ EMBEDDING โ†’ ๐Ÿ“‡ INDEXING โ†’ ๐Ÿ” SEARCH
       โ”‚              โ”‚                  โ”‚              โ”‚            โ”‚
       โ†“              โ†“                  โ†“              โ†“            โ†“
   Extract        Clean &            Convert to      Store in     Query &
   from          Normalize          Vectors        Vector DB    Retrieve
   Sources       Data                                            Results

1. Data Ingestion Strategies ๐Ÿ“ฅ

Data ingestion is the process of collecting raw data from various sources. The strategy you choose depends on your data volume, velocity, and variety requirements.

Batch Ingestion: Processing data in large chunks at scheduled intervals

  • โœ… Best for: Historical data, non-time-sensitive updates, cost optimization
  • โš ๏ธ Trade-off: Data staleness (freshness can be hours or days old)
  • ๐Ÿ”ง Example use case: Nightly updates of product catalogs

Stream Ingestion: Processing data continuously as it arrives

  • โœ… Best for: Real-time analytics, live chat support, financial trading
  • โš ๏ธ Trade-off: Higher complexity and infrastructure costs
  • ๐Ÿ”ง Example use case: Live customer support ticket ingestion

Micro-batch: Hybrid approach processing small batches frequently

  • โœ… Best for: Near real-time needs with controlled resource usage
  • โš ๏ธ Trade-off: Balancing latency vs. throughput
  • ๐Ÿ”ง Example use case: Social media monitoring (process every 5 minutes)
Ingestion Type Latency Complexity Cost Best For
Batch Hours-Days Low ๐Ÿ’ฐ Low Archives, reports
Micro-batch Minutes Medium ๐Ÿ’ฐ๐Ÿ’ฐ Medium News feeds, logs
Stream Seconds High ๐Ÿ’ฐ๐Ÿ’ฐ๐Ÿ’ฐ High Real-time chat, IoT

2. Data Transformation & Cleaning ๐Ÿ”„

Raw data is rarely ready for indexing. Transformation involves converting, cleaning, and enriching data to make it suitable for semantic search.

Key transformation operations:

Text Normalization

  • Lowercasing: "Apple" โ†’ "apple"
  • Removing special characters: "Hello!!! World???" โ†’ "Hello World"
  • Whitespace normalization: "too many spaces" โ†’ "too many spaces"

Chunking: Breaking large documents into smaller, semantically meaningful pieces

  • ๐ŸŽฏ Why? Embedding models have token limits (typically 512-8192 tokens)
  • ๐Ÿ“ Strategy: Aim for 200-500 tokens per chunk with 10-20% overlap
  • ๐Ÿง  Mnemonic: COPS - Context preserved, Overlap included, Paragraph boundaries, Size consistent

Metadata Enrichment

  • Adding timestamps, authors, source URLs, categories
  • Extracting entities (people, places, organizations)
  • Calculating statistics (word count, reading time)

Example transformation pipeline:

RAW DOCUMENT
     โ”‚
     โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Remove HTML tags       โ”‚ "

Hello

" โ†’ "Hello" โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Normalize text โ”‚ "HELLO!!!" โ†’ "hello" โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Chunk into sections โ”‚ One doc โ†’ Multiple chunks โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Add metadata โ”‚ + timestamp, source, author โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ CLEAN CHUNKS

Embeddings are dense numerical representations of text that capture semantic meaning. Instead of matching exact keywords, embeddings allow you to find conceptually similar content.

๐ŸŒ Real-world analogy: Think of embeddings like GPS coordinates. "Coffee shop" and "cafรฉ" might have different spellings, but their embeddings (coordinates) would be very close together in semantic spaceโ€”just like two coffee shops on the same street have similar GPS coordinates.

How embeddings work:

TEXT INPUT                    EMBEDDING (simplified 5D)
"machine learning"     โ†’      [0.23, -0.45, 0.89, 0.12, -0.67]
"artificial intelligence" โ†’   [0.25, -0.42, 0.91, 0.15, -0.65]
"cooking recipes"      โ†’      [-0.78, 0.34, -0.12, 0.56, 0.23]

                        โ†‘ Similar concepts = Similar vectors
                          (measured by cosine similarity)

Popular embedding models in 2026:

Model Dimensions Max Tokens Best Use Case
OpenAI text-embedding-3-large 3072 8191 General purpose, high quality
Cohere embed-v3 1024 512 Multilingual, fast
sentence-transformers 384-768 512 Open source, self-hosted
BGE-large 1024 512 Retrieval-specific, SOTA

Embedding generation strategies:

  1. Document-level embeddings: Embed entire documents

    • Pro: Captures overall theme
    • Con: Loses granular details
  2. Chunk-level embeddings: Embed each chunk separately (most common)

    • Pro: Precise retrieval of relevant sections
    • Con: More vectors to store and search
  3. Hybrid embeddings: Combine document + chunk embeddings

    • Pro: Best of both worlds
    • Con: Increased complexity and storage

๐Ÿ’ก Pro tip: Always use the same embedding model for indexing and querying. Mixing models is like trying to use a metric ruler to measure something marked in inchesโ€”the numbers won't align!

4. Indexing Architectures ๐Ÿ“‡

Once you have embeddings, you need to index them for efficient retrieval. An index is a data structure optimized for fast similarity search across millions or billions of vectors.

Vector Index Types:

Flat Index (Brute Force)

  • Compares query to every vector in database
  • โœ… Perfect accuracy (100% recall)
  • โŒ Slow for large datasets (O(n) complexity)
  • ๐ŸŽฏ Use when: Dataset < 10,000 vectors

HNSW (Hierarchical Navigable Small World)

  • Graph-based index with multiple layers
  • โœ… Fast queries, good recall (95-99%)
  • โŒ Higher memory usage
  • ๐ŸŽฏ Use when: Need speed with high accuracy

IVF (Inverted File Index)

  • Clusters vectors, searches nearest clusters first
  • โœ… Memory efficient, scalable
  • โŒ Moderate accuracy (80-95% recall)
  • ๐ŸŽฏ Use when: Very large datasets (millions+ vectors)

Quantization (PQ/SQ)

  • Compresses vectors to reduce storage
  • โœ… Massive space savings (8-32x compression)
  • โŒ Some accuracy loss
  • ๐ŸŽฏ Use when: Storage costs are critical
INDEX PERFORMANCE COMPARISON

                 SPEED          RECALL         MEMORY
Flat Index       โ–ˆโ–ˆโ–ˆโ–ˆ           โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
HNSW             โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
IVF              โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ        โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
IVF+PQ           โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ         โ–ˆโ–ˆโ–ˆ

                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  Slower โ† โ†’ Faster

Modern RAG systems combine vector search (semantic similarity) with metadata filtering (structured queries) for powerful hybrid retrieval.

Metadata filtering examples:

  • Date ranges: "documents from last 30 days"
  • Categories: "only from 'technical' section"
  • Authors: "written by specific team members"
  • Access control: "user has permission to view"

Hybrid search strategies:

Strategy Approach When to Use
Pre-filtering Filter metadata first, then vector search Strict constraints (dates, permissions)
Post-filtering Vector search first, filter results after Soft preferences (categories, tags)
Sparse-Dense Hybrid Combine keyword (BM25) + vector scores Need exact matches + semantic similarity

Score fusion techniques:

When combining multiple search methods, you need to merge their scores:

  1. Reciprocal Rank Fusion (RRF)

    • Formula: score = ฮฃ(1 / (k + rank_i)) where k=60
    • โœ… Simple, robust, no tuning needed
    • Most popular in practice
  2. Weighted Sum

    • Formula: score = ฮฑ ร— vector_score + (1-ฮฑ) ร— keyword_score
    • โš ๏ธ Requires normalization and tuning ฮฑ
  3. Relative Score Fusion

    • Normalize each score by max score in its result set
    • โœ… Handles different score ranges automatically

Practical Examples ๐Ÿ› ๏ธ

Example 1: Building a Document Ingestion Pipeline

Let's walk through ingesting PDF documents into a RAG system:

Scenario: You're building a legal document search system that needs to index 10,000 PDF contracts.

Step-by-step pipeline:

## 1. INGESTION: Extract text from PDF
from pypdf import PdfReader

def ingest_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

## 2. TRANSFORMATION: Clean and chunk
def transform_text(text, chunk_size=500, overlap=50):
    # Remove extra whitespace
    text = " ".join(text.split())
    
    # Split into chunks with overlap
    chunks = []
    words = text.split()
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

## 3. EMBEDDING: Generate vectors
from openai import OpenAI

client = OpenAI()

def generate_embeddings(chunks):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    )
    return [item.embedding for item in response.data]

## 4. INDEXING: Store in vector database
from qdrant_client import QdrantClient

client = QdrantClient(":memory:")

def index_documents(chunks, embeddings, metadata):
    points = []
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        points.append({
            "id": i,
            "vector": embedding,
            "payload": {
                "text": chunk,
                "source": metadata["source"],
                "page": metadata["page"],
                "date": metadata["date"]
            }
        })
    
    client.upsert(collection_name="legal_docs", points=points)

Key decisions explained:

  • Chunk size 500 words: Balances context vs. precision for legal text
  • Overlap 50 words: Prevents important clauses from being split across chunks
  • Metadata tracking: Source, page number, date enable filtering by document origin

Example 2: Optimizing Index Performance

Scenario: Your vector search is too slowโ€”queries take 2-3 seconds for a database of 1 million documents.

Diagnosis and solutions:

Problem Solution Expected Improvement
Using flat index Switch to HNSW index 10-50x faster queries
No metadata pre-filtering Add date/category filters before vector search 5-10x faster (smaller search space)
High dimensionality (3072D) Use smaller embedding model (768D) 2-4x faster, 75% less storage
Returning too many results Reduce top_k from 100 to 20 2-3x faster

Implementation with HNSW:

## Configure HNSW index parameters
from qdrant_client.models import VectorParams, Distance

client.create_collection(
    collection_name="optimized_docs",
    vectors_config=VectorParams(
        size=768,  # Embedding dimension
        distance=Distance.COSINE,  # Similarity metric
        hnsw_config={
            "m": 16,  # Number of connections per layer
            "ef_construct": 100,  # Construction time vs accuracy
        }
    )
)

## Query with pre-filtering
results = client.search(
    collection_name="optimized_docs",
    query_vector=query_embedding,
    query_filter={
        "must": [
            {"key": "date", "range": {"gte": "2024-01-01"}},
            {"key": "category", "match": {"value": "contracts"}}
        ]
    },
    limit=20,  # Reduced from 100
    search_params={"ef": 64}  # Query time accuracy
)

Parameter tuning guide:

  • m (connections): Higher = better recall, more memory (typical: 16-32)
  • ef_construct: Higher = better index quality, slower build (typical: 100-200)
  • ef (query time): Higher = better recall, slower queries (typical: 32-128)

Scenario: Users complain that searching for exact product codes (like "SKU-1234") returns irrelevant semantic matches instead of the exact product.

Solution: Implement sparse-dense hybrid search combining keyword matching (BM25) with vector similarity.

from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_search(query, documents, embeddings, alpha=0.5):
    # 1. SPARSE: BM25 keyword search
    tokenized_docs = [doc.split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    bm25_scores = bm25.get_scores(query.split())
    
    # 2. DENSE: Vector similarity search
    query_embedding = generate_embedding(query)
    vector_scores = cosine_similarity(query_embedding, embeddings)
    
    # 3. NORMALIZE: Bring scores to same scale
    bm25_normalized = bm25_scores / max(bm25_scores)
    vector_normalized = (vector_scores + 1) / 2  # Cosine [-1,1] to [0,1]
    
    # 4. FUSION: Combine with weighted sum
    final_scores = alpha * vector_normalized + (1 - alpha) * bm25_normalized
    
    # 5. RANK: Sort by final score
    ranked_indices = np.argsort(final_scores)[::-1]
    return ranked_indices

When to tune ฮฑ (alpha):

  • ฮฑ = 0.7-0.9: Prioritize semantic similarity (research papers, knowledge bases)
  • ฮฑ = 0.3-0.5: Balance semantic + exact matches (e-commerce, product catalogs)
  • ฮฑ = 0.1-0.3: Prioritize exact keyword matches (legal, technical docs with IDs)

Results comparison:

Query Vector Only Keyword Only Hybrid (ฮฑ=0.4)
"SKU-1234" โŒ Similar product descriptions โœ… Exact SKU match โœ… Exact SKU match
"affordable laptop for students" โœ… Budget laptops with student reviews โŒ Any doc with all 4 words โœ… Budget laptops with student context
"waterproof hiking boots" โœ… Outdoor footwear โš ๏ธ Any waterproof item + any boot โœ… Waterproof hiking boots specifically

Example 4: Real-Time Pipeline Monitoring

Scenario: Your pipeline processes 10,000 documents per hour. You need to monitor for failures, bottlenecks, and data quality issues.

Key metrics to track:

PIPELINE HEALTH DASHBOARD
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  INGESTION                                      โ”‚
โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 10,234 docs/hr  โœ… Healthyโ”‚
โ”‚  Failures: 12 (0.1%)                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  TRANSFORMATION                                 โ”‚
โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Avg: 0.3s/doc  โš ๏ธ Slow   โ”‚
โ”‚  Queue depth: 1,245 (high)                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  EMBEDDING                                      โ”‚
โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 9,891 vectors/hr  โœ…      โ”‚
โ”‚  API latency: 45ms (good)                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  INDEXING                                       โ”‚
โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 9,889 indexed/hr  โœ…      โ”‚
โ”‚  Index size: 4.2GB, growth: +120MB/hr           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Monitoring code example:

import time
import logging
from prometheus_client import Counter, Histogram, Gauge

## Define metrics
docs_ingested = Counter('docs_ingested_total', 'Total documents ingested')
ingestion_errors = Counter('ingestion_errors_total', 'Ingestion failures')
chunk_processing_time = Histogram('chunk_processing_seconds', 'Time to process chunks')
queue_depth = Gauge('pipeline_queue_depth', 'Documents waiting in queue')

def monitored_pipeline(document):
    try:
        # Track queue depth
        queue_depth.set(get_queue_size())
        
        # Time the transformation step
        start = time.time()
        chunks = transform_text(document)
        chunk_processing_time.observe(time.time() - start)
        
        # Generate embeddings
        embeddings = generate_embeddings(chunks)
        
        # Index
        index_documents(chunks, embeddings, document.metadata)
        
        # Success metric
        docs_ingested.inc()
        
    except Exception as e:
        # Track failures
        ingestion_errors.inc()
        logging.error(f"Pipeline failed for doc {document.id}: {e}")
        # Send to dead letter queue for retry
        send_to_dlq(document)

Alert thresholds:

  • โš ๏ธ Warning: Error rate > 1%
  • ๐Ÿšจ Critical: Error rate > 5%
  • โš ๏ธ Warning: Processing time > 2x baseline
  • ๐Ÿšจ Critical: Queue depth > 10,000 documents

Common Mistakes & How to Avoid Them โš ๏ธ

Mistake 1: Chunking Without Overlap

Problem: Important context gets split across chunks, reducing retrieval quality.

โŒ Wrong approach:

Chunk 1: "...to meet quarterly goals. "
Chunk 2: "The sales team reported..."

The connection between "quarterly goals" and "sales team" is lost!

โœ… Correct approach:

Chunk 1: "...to meet quarterly goals. The sales team reported..."
Chunk 2: "quarterly goals. The sales team reported strong performance..."

Overlap preserves context across boundaries.

Rule of thumb: Use 10-20% overlap (50-100 tokens for 500-token chunks).

Mistake 2: Ignoring Embedding Model Token Limits

Problem: Sending 10,000-token documents to a model with 512-token limit results in silent truncationโ€”you lose 95% of your content!

โš ๏ธ Warning signs:

  • Retrieval quality drops for longer documents
  • Only first section of documents appears relevant
  • Inconsistent results across document length

โœ… Solution: Always chunk before embedding:

def safe_embed(text, max_tokens=512):
    # Estimate tokens (rough: 1 token โ‰ˆ 4 chars)
    estimated_tokens = len(text) // 4
    
    if estimated_tokens > max_tokens:
        raise ValueError(
            f"Text too long: {estimated_tokens} tokens. "
            f"Chunk it first to max {max_tokens} tokens!"
        )
    
    return generate_embedding(text)

Mistake 3: Mixing Embedding Models

Problem: Indexing with model A, querying with model B produces nonsense results.

๐ŸŒ Why it fails: It's like measuring your height in meters, then comparing to someone measured in feetโ€”the numbers don't align!

Each embedding model creates a unique "semantic space" with its own coordinate system.

โœ… Golden rule:

  • Store the embedding model name with your index
  • Always use the same model for queries
  • If you change models, re-index everything

Mistake 4: No Monitoring or Error Handling

Problem: Pipeline fails silently, and you discover hours later that no documents were indexed.

Real-world disaster scenario:

09:00 - Embedding API key expires
09:01 - Pipeline starts failing silently  
14:00 - Users report search returning no results
14:30 - Team discovers 5 hours of data lost
16:00 - Emergency re-processing begins

โœ… Prevention checklist:

  • โ˜‘๏ธ Log every stage (ingestion, transformation, embedding, indexing)
  • โ˜‘๏ธ Set up alerts for error rate spikes
  • โ˜‘๏ธ Implement retry logic with exponential backoff
  • โ˜‘๏ธ Use dead letter queues for failed documents
  • โ˜‘๏ธ Monitor pipeline latency and throughput
  • โ˜‘๏ธ Test error scenarios regularly (API failures, malformed data)

Mistake 5: Over-Indexing vs. Under-Indexing

Over-indexing: Creating too many small chunks

  • Each chunk lacks sufficient context
  • Retrieval returns fragmented information
  • Higher storage and query costs

Under-indexing: Chunks too large

  • Irrelevant content mixed with relevant passages
  • Wastes LLM context window on noise
  • Poor retrieval precision

โœ… Optimal chunk sizing guide:

Content Type Recommended Size Rationale
Technical docs 300-500 tokens Code examples need context
Legal contracts 400-600 tokens Clauses reference each other
News articles 200-400 tokens Paragraphs are self-contained
Chat messages 100-200 tokens Short, conversational
Research papers 500-800 tokens Complex arguments need space

Key Takeaways ๐ŸŽฏ

Let's consolidate everything you've learned about data pipeline and indexing:

โœ… Data pipelines automate the flow from raw data to searchable indexes through ingestion โ†’ transformation โ†’ embedding โ†’ indexing stages

โœ… Choose ingestion strategy based on latency needs: batch (hours), micro-batch (minutes), or stream (seconds)

โœ… Chunk intelligently: 200-500 tokens with 10-20% overlap preserves context while respecting token limits

โœ… Embeddings convert text to semantic vectorsโ€”use the same model consistently for indexing and querying

โœ… Index types trade off speed vs. accuracy: Flat (perfect, slow), HNSW (fast, accurate), IVF (scalable, moderate accuracy)

โœ… Hybrid search combines vector similarity with keyword matching for best results on diverse queries

โœ… Monitor everything: Track throughput, latency, errors, and queue depth to catch issues early

โœ… Metadata filtering dramatically speeds up queries by reducing the search space before vector comparison

โœ… Avoid silent failures: Implement comprehensive error handling, logging, and alerting

๐Ÿง  Memory Devices

Remember pipeline stages with ITEMS:

  • Ingestion (collect data)
  • Transformation (clean & chunk)
  • Embedding (generate vectors)
  • Metadata (add context)
  • Storage (index in vector DB)

Remember index types with CHEF:

  • Cheap & slow: Flat index
  • High-speed graph: HNSW
  • Efficient clusters: IVF
  • Fast & compressed: Quantized

๐Ÿ”ง Try This: Hands-On Exercise

Challenge: Build a mini RAG pipeline for your own documents!

  1. Collect: Grab 5-10 PDF or text files (research papers, reports, articles)
  2. Process: Write code to extract text and chunk it (300-500 words)
  3. Embed: Use OpenAI API or Sentence Transformers to generate embeddings
  4. Index: Store in Qdrant (local) or Pinecone (cloud)
  5. Query: Ask natural language questions and retrieve relevant chunks
  6. Measure: Compare retrieval quality with vs. without overlap in chunks

Bonus: Add metadata filtering (date, source, topic) and test hybrid search!

๐Ÿ“‹ Quick Reference Card: Pipeline & Indexing Essentials

Ingestion Types Batch (scheduled), Stream (real-time), Micro-batch (frequent small batches)
Optimal Chunk Size 200-500 tokens with 10-20% overlap
Popular Embeddings OpenAI text-embedding-3-large (3072D), Cohere embed-v3 (1024D), BGE-large (1024D)
Index for Speed HNSW (graph-based, 95-99% recall)
Index for Scale IVF with quantization (clusters + compression)
Hybrid Search Combine BM25 (keywords) + vector similarity with RRF or weighted fusion
HNSW Parameters m=16-32 (connections), ef_construct=100-200 (build quality), ef=32-128 (query accuracy)
Monitor These Throughput (docs/hour), error rate (%), latency (seconds/doc), queue depth
Golden Rules 1) Same embedding model always 2) Chunk before embed 3) Add overlap 4) Monitor failures

๐Ÿ“š Further Study

Ready to dive deeper? Check out these resources:

  1. Pinecone Learning Center - https://www.pinecone.io/learn/ - Comprehensive guides on vector databases, embeddings, and RAG architectures with hands-on tutorials

  2. LangChain Documentation: Text Splitters - https://python.langchain.com/docs/modules/data_connection/document_transformers/ - Deep dive into chunking strategies with code examples for various document types

  3. HNSW Algorithm Explained - https://arxiv.org/abs/1603.09320 - Original research paper on Hierarchical Navigable Small World graphs (for those who want the mathematical foundation)

Congratulations! ๐ŸŽ‰ You now understand the complete data pipeline and indexing architecture that powers modern AI search and RAG systems. You're ready to build production-grade retrieval systems that can handle millions of documents and return relevant results in milliseconds!