Data Pipeline & Indexing
Create robust ingestion pipelines with smart chunking, embedding generation, and incremental updates.
Data Pipeline & Indexing
Master the fundamentals of data pipeline and indexing systems with free flashcards and spaced repetition practice. This lesson covers data ingestion strategies, transformation techniques, vector embeddings, and indexing architecturesโessential concepts for building modern AI search and retrieval-augmented generation (RAG) systems.
Welcome to Data Pipeline & Indexing ๐
In the world of AI search and RAG systems, your data pipeline and indexing strategy can make or break your application's performance. Think of a data pipeline as a sophisticated assembly line where raw data enters one end and emerges as searchable, semantically-rich information on the other. The indexing component is like creating a hyper-efficient library catalog systemโbut instead of organizing books by title, you're organizing information by meaning, context, and relevance.
Whether you're building a chatbot that needs to answer questions from thousands of documents, a semantic search engine for research papers, or an AI assistant that pulls information from multiple data sources, understanding how to efficiently move, transform, and index data is crucial.
๐ก Did you know? Companies like Google process over 3.5 billion searches per day. Their indexing systems need to handle petabytes of data and return relevant results in millisecondsโa feat made possible through sophisticated pipeline and indexing architectures!
Core Concepts: Understanding the Data Journey ๐
The Data Pipeline Architecture
A data pipeline is an automated workflow that moves data from source systems through various transformation stages to a destination where it can be indexed and queried. Let's break down the key stages:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATA PIPELINE FLOW โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฅ INGESTION โ ๐ TRANSFORMATION โ ๐งฎ EMBEDDING โ ๐ INDEXING โ ๐ SEARCH
โ โ โ โ โ
โ โ โ โ โ
Extract Clean & Convert to Store in Query &
from Normalize Vectors Vector DB Retrieve
Sources Data Results
1. Data Ingestion Strategies ๐ฅ
Data ingestion is the process of collecting raw data from various sources. The strategy you choose depends on your data volume, velocity, and variety requirements.
Batch Ingestion: Processing data in large chunks at scheduled intervals
- โ Best for: Historical data, non-time-sensitive updates, cost optimization
- โ ๏ธ Trade-off: Data staleness (freshness can be hours or days old)
- ๐ง Example use case: Nightly updates of product catalogs
Stream Ingestion: Processing data continuously as it arrives
- โ Best for: Real-time analytics, live chat support, financial trading
- โ ๏ธ Trade-off: Higher complexity and infrastructure costs
- ๐ง Example use case: Live customer support ticket ingestion
Micro-batch: Hybrid approach processing small batches frequently
- โ Best for: Near real-time needs with controlled resource usage
- โ ๏ธ Trade-off: Balancing latency vs. throughput
- ๐ง Example use case: Social media monitoring (process every 5 minutes)
| Ingestion Type | Latency | Complexity | Cost | Best For |
|---|---|---|---|---|
| Batch | Hours-Days | Low | ๐ฐ Low | Archives, reports |
| Micro-batch | Minutes | Medium | ๐ฐ๐ฐ Medium | News feeds, logs |
| Stream | Seconds | High | ๐ฐ๐ฐ๐ฐ High | Real-time chat, IoT |
2. Data Transformation & Cleaning ๐
Raw data is rarely ready for indexing. Transformation involves converting, cleaning, and enriching data to make it suitable for semantic search.
Key transformation operations:
Text Normalization
- Lowercasing: "Apple" โ "apple"
- Removing special characters: "Hello!!! World???" โ "Hello World"
- Whitespace normalization: "too many spaces" โ "too many spaces"
Chunking: Breaking large documents into smaller, semantically meaningful pieces
- ๐ฏ Why? Embedding models have token limits (typically 512-8192 tokens)
- ๐ Strategy: Aim for 200-500 tokens per chunk with 10-20% overlap
- ๐ง Mnemonic: COPS - Context preserved, Overlap included, Paragraph boundaries, Size consistent
Metadata Enrichment
- Adding timestamps, authors, source URLs, categories
- Extracting entities (people, places, organizations)
- Calculating statistics (word count, reading time)
Example transformation pipeline:
RAW DOCUMENT
โ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Remove HTML tags โ "Hello
" โ "Hello"
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Normalize text โ "HELLO!!!" โ "hello"
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Chunk into sections โ One doc โ Multiple chunks
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Add metadata โ + timestamp, source, author
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ
CLEAN CHUNKS
3. Vector Embeddings: The Heart of Semantic Search ๐งฎ
Embeddings are dense numerical representations of text that capture semantic meaning. Instead of matching exact keywords, embeddings allow you to find conceptually similar content.
๐ Real-world analogy: Think of embeddings like GPS coordinates. "Coffee shop" and "cafรฉ" might have different spellings, but their embeddings (coordinates) would be very close together in semantic spaceโjust like two coffee shops on the same street have similar GPS coordinates.
How embeddings work:
TEXT INPUT EMBEDDING (simplified 5D)
"machine learning" โ [0.23, -0.45, 0.89, 0.12, -0.67]
"artificial intelligence" โ [0.25, -0.42, 0.91, 0.15, -0.65]
"cooking recipes" โ [-0.78, 0.34, -0.12, 0.56, 0.23]
โ Similar concepts = Similar vectors
(measured by cosine similarity)
Popular embedding models in 2026:
| Model | Dimensions | Max Tokens | Best Use Case |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | General purpose, high quality |
| Cohere embed-v3 | 1024 | 512 | Multilingual, fast |
| sentence-transformers | 384-768 | 512 | Open source, self-hosted |
| BGE-large | 1024 | 512 | Retrieval-specific, SOTA |
Embedding generation strategies:
Document-level embeddings: Embed entire documents
- Pro: Captures overall theme
- Con: Loses granular details
Chunk-level embeddings: Embed each chunk separately (most common)
- Pro: Precise retrieval of relevant sections
- Con: More vectors to store and search
Hybrid embeddings: Combine document + chunk embeddings
- Pro: Best of both worlds
- Con: Increased complexity and storage
๐ก Pro tip: Always use the same embedding model for indexing and querying. Mixing models is like trying to use a metric ruler to measure something marked in inchesโthe numbers won't align!
4. Indexing Architectures ๐
Once you have embeddings, you need to index them for efficient retrieval. An index is a data structure optimized for fast similarity search across millions or billions of vectors.
Vector Index Types:
Flat Index (Brute Force)
- Compares query to every vector in database
- โ Perfect accuracy (100% recall)
- โ Slow for large datasets (O(n) complexity)
- ๐ฏ Use when: Dataset < 10,000 vectors
HNSW (Hierarchical Navigable Small World)
- Graph-based index with multiple layers
- โ Fast queries, good recall (95-99%)
- โ Higher memory usage
- ๐ฏ Use when: Need speed with high accuracy
IVF (Inverted File Index)
- Clusters vectors, searches nearest clusters first
- โ Memory efficient, scalable
- โ Moderate accuracy (80-95% recall)
- ๐ฏ Use when: Very large datasets (millions+ vectors)
Quantization (PQ/SQ)
- Compresses vectors to reduce storage
- โ Massive space savings (8-32x compression)
- โ Some accuracy loss
- ๐ฏ Use when: Storage costs are critical
INDEX PERFORMANCE COMPARISON
SPEED RECALL MEMORY
Flat Index โโโโ โโโโโโโโโโ โโโโโโโโ
HNSW โโโโโโโโโโ โโโโโโโโโ โโโโโโโโโโโโ
IVF โโโโโโโโโ โโโโโโโ โโโโโโ
IVF+PQ โโโโโโโโโ โโโโโโ โโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Slower โ โ Faster
5. Metadata Filtering & Hybrid Search ๐
Modern RAG systems combine vector search (semantic similarity) with metadata filtering (structured queries) for powerful hybrid retrieval.
Metadata filtering examples:
- Date ranges: "documents from last 30 days"
- Categories: "only from 'technical' section"
- Authors: "written by specific team members"
- Access control: "user has permission to view"
Hybrid search strategies:
| Strategy | Approach | When to Use |
|---|---|---|
| Pre-filtering | Filter metadata first, then vector search | Strict constraints (dates, permissions) |
| Post-filtering | Vector search first, filter results after | Soft preferences (categories, tags) |
| Sparse-Dense Hybrid | Combine keyword (BM25) + vector scores | Need exact matches + semantic similarity |
Score fusion techniques:
When combining multiple search methods, you need to merge their scores:
Reciprocal Rank Fusion (RRF)
- Formula:
score = ฮฃ(1 / (k + rank_i))where k=60 - โ Simple, robust, no tuning needed
- Most popular in practice
- Formula:
Weighted Sum
- Formula:
score = ฮฑ ร vector_score + (1-ฮฑ) ร keyword_score - โ ๏ธ Requires normalization and tuning ฮฑ
- Formula:
Relative Score Fusion
- Normalize each score by max score in its result set
- โ Handles different score ranges automatically
Practical Examples ๐ ๏ธ
Example 1: Building a Document Ingestion Pipeline
Let's walk through ingesting PDF documents into a RAG system:
Scenario: You're building a legal document search system that needs to index 10,000 PDF contracts.
Step-by-step pipeline:
## 1. INGESTION: Extract text from PDF
from pypdf import PdfReader
def ingest_pdf(pdf_path):
reader = PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
## 2. TRANSFORMATION: Clean and chunk
def transform_text(text, chunk_size=500, overlap=50):
# Remove extra whitespace
text = " ".join(text.split())
# Split into chunks with overlap
chunks = []
words = text.split()
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
## 3. EMBEDDING: Generate vectors
from openai import OpenAI
client = OpenAI()
def generate_embeddings(chunks):
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
return [item.embedding for item in response.data]
## 4. INDEXING: Store in vector database
from qdrant_client import QdrantClient
client = QdrantClient(":memory:")
def index_documents(chunks, embeddings, metadata):
points = []
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
points.append({
"id": i,
"vector": embedding,
"payload": {
"text": chunk,
"source": metadata["source"],
"page": metadata["page"],
"date": metadata["date"]
}
})
client.upsert(collection_name="legal_docs", points=points)
Key decisions explained:
- Chunk size 500 words: Balances context vs. precision for legal text
- Overlap 50 words: Prevents important clauses from being split across chunks
- Metadata tracking: Source, page number, date enable filtering by document origin
Example 2: Optimizing Index Performance
Scenario: Your vector search is too slowโqueries take 2-3 seconds for a database of 1 million documents.
Diagnosis and solutions:
| Problem | Solution | Expected Improvement |
|---|---|---|
| Using flat index | Switch to HNSW index | 10-50x faster queries |
| No metadata pre-filtering | Add date/category filters before vector search | 5-10x faster (smaller search space) |
| High dimensionality (3072D) | Use smaller embedding model (768D) | 2-4x faster, 75% less storage |
| Returning too many results | Reduce top_k from 100 to 20 | 2-3x faster |
Implementation with HNSW:
## Configure HNSW index parameters
from qdrant_client.models import VectorParams, Distance
client.create_collection(
collection_name="optimized_docs",
vectors_config=VectorParams(
size=768, # Embedding dimension
distance=Distance.COSINE, # Similarity metric
hnsw_config={
"m": 16, # Number of connections per layer
"ef_construct": 100, # Construction time vs accuracy
}
)
)
## Query with pre-filtering
results = client.search(
collection_name="optimized_docs",
query_vector=query_embedding,
query_filter={
"must": [
{"key": "date", "range": {"gte": "2024-01-01"}},
{"key": "category", "match": {"value": "contracts"}}
]
},
limit=20, # Reduced from 100
search_params={"ef": 64} # Query time accuracy
)
Parameter tuning guide:
- m (connections): Higher = better recall, more memory (typical: 16-32)
- ef_construct: Higher = better index quality, slower build (typical: 100-200)
- ef (query time): Higher = better recall, slower queries (typical: 32-128)
Example 3: Implementing Hybrid Search
Scenario: Users complain that searching for exact product codes (like "SKU-1234") returns irrelevant semantic matches instead of the exact product.
Solution: Implement sparse-dense hybrid search combining keyword matching (BM25) with vector similarity.
from rank_bm25 import BM25Okapi
import numpy as np
def hybrid_search(query, documents, embeddings, alpha=0.5):
# 1. SPARSE: BM25 keyword search
tokenized_docs = [doc.split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
bm25_scores = bm25.get_scores(query.split())
# 2. DENSE: Vector similarity search
query_embedding = generate_embedding(query)
vector_scores = cosine_similarity(query_embedding, embeddings)
# 3. NORMALIZE: Bring scores to same scale
bm25_normalized = bm25_scores / max(bm25_scores)
vector_normalized = (vector_scores + 1) / 2 # Cosine [-1,1] to [0,1]
# 4. FUSION: Combine with weighted sum
final_scores = alpha * vector_normalized + (1 - alpha) * bm25_normalized
# 5. RANK: Sort by final score
ranked_indices = np.argsort(final_scores)[::-1]
return ranked_indices
When to tune ฮฑ (alpha):
- ฮฑ = 0.7-0.9: Prioritize semantic similarity (research papers, knowledge bases)
- ฮฑ = 0.3-0.5: Balance semantic + exact matches (e-commerce, product catalogs)
- ฮฑ = 0.1-0.3: Prioritize exact keyword matches (legal, technical docs with IDs)
Results comparison:
| Query | Vector Only | Keyword Only | Hybrid (ฮฑ=0.4) |
|---|---|---|---|
| "SKU-1234" | โ Similar product descriptions | โ Exact SKU match | โ Exact SKU match |
| "affordable laptop for students" | โ Budget laptops with student reviews | โ Any doc with all 4 words | โ Budget laptops with student context |
| "waterproof hiking boots" | โ Outdoor footwear | โ ๏ธ Any waterproof item + any boot | โ Waterproof hiking boots specifically |
Example 4: Real-Time Pipeline Monitoring
Scenario: Your pipeline processes 10,000 documents per hour. You need to monitor for failures, bottlenecks, and data quality issues.
Key metrics to track:
PIPELINE HEALTH DASHBOARD โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ INGESTION โ โ โโโโโโโโโโโโโโโโโโโโ 10,234 docs/hr โ Healthyโ โ Failures: 12 (0.1%) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ TRANSFORMATION โ โ โโโโโโโโโโโโโโโโโโโโโ Avg: 0.3s/doc โ ๏ธ Slow โ โ Queue depth: 1,245 (high) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ EMBEDDING โ โ โโโโโโโโโโโโโโโโโโโโ 9,891 vectors/hr โ โ โ API latency: 45ms (good) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ INDEXING โ โ โโโโโโโโโโโโโโโโโโโโ 9,889 indexed/hr โ โ โ Index size: 4.2GB, growth: +120MB/hr โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Monitoring code example:
import time
import logging
from prometheus_client import Counter, Histogram, Gauge
## Define metrics
docs_ingested = Counter('docs_ingested_total', 'Total documents ingested')
ingestion_errors = Counter('ingestion_errors_total', 'Ingestion failures')
chunk_processing_time = Histogram('chunk_processing_seconds', 'Time to process chunks')
queue_depth = Gauge('pipeline_queue_depth', 'Documents waiting in queue')
def monitored_pipeline(document):
try:
# Track queue depth
queue_depth.set(get_queue_size())
# Time the transformation step
start = time.time()
chunks = transform_text(document)
chunk_processing_time.observe(time.time() - start)
# Generate embeddings
embeddings = generate_embeddings(chunks)
# Index
index_documents(chunks, embeddings, document.metadata)
# Success metric
docs_ingested.inc()
except Exception as e:
# Track failures
ingestion_errors.inc()
logging.error(f"Pipeline failed for doc {document.id}: {e}")
# Send to dead letter queue for retry
send_to_dlq(document)
Alert thresholds:
- โ ๏ธ Warning: Error rate > 1%
- ๐จ Critical: Error rate > 5%
- โ ๏ธ Warning: Processing time > 2x baseline
- ๐จ Critical: Queue depth > 10,000 documents
Common Mistakes & How to Avoid Them โ ๏ธ
Mistake 1: Chunking Without Overlap
Problem: Important context gets split across chunks, reducing retrieval quality.
โ Wrong approach:
Chunk 1: "...to meet quarterly goals. "
Chunk 2: "The sales team reported..."
The connection between "quarterly goals" and "sales team" is lost!
โ Correct approach:
Chunk 1: "...to meet quarterly goals. The sales team reported..."
Chunk 2: "quarterly goals. The sales team reported strong performance..."
Overlap preserves context across boundaries.
Rule of thumb: Use 10-20% overlap (50-100 tokens for 500-token chunks).
Mistake 2: Ignoring Embedding Model Token Limits
Problem: Sending 10,000-token documents to a model with 512-token limit results in silent truncationโyou lose 95% of your content!
โ ๏ธ Warning signs:
- Retrieval quality drops for longer documents
- Only first section of documents appears relevant
- Inconsistent results across document length
โ Solution: Always chunk before embedding:
def safe_embed(text, max_tokens=512):
# Estimate tokens (rough: 1 token โ 4 chars)
estimated_tokens = len(text) // 4
if estimated_tokens > max_tokens:
raise ValueError(
f"Text too long: {estimated_tokens} tokens. "
f"Chunk it first to max {max_tokens} tokens!"
)
return generate_embedding(text)
Mistake 3: Mixing Embedding Models
Problem: Indexing with model A, querying with model B produces nonsense results.
๐ Why it fails: It's like measuring your height in meters, then comparing to someone measured in feetโthe numbers don't align!
Each embedding model creates a unique "semantic space" with its own coordinate system.
โ Golden rule:
- Store the embedding model name with your index
- Always use the same model for queries
- If you change models, re-index everything
Mistake 4: No Monitoring or Error Handling
Problem: Pipeline fails silently, and you discover hours later that no documents were indexed.
Real-world disaster scenario:
09:00 - Embedding API key expires
09:01 - Pipeline starts failing silently
14:00 - Users report search returning no results
14:30 - Team discovers 5 hours of data lost
16:00 - Emergency re-processing begins
โ Prevention checklist:
- โ๏ธ Log every stage (ingestion, transformation, embedding, indexing)
- โ๏ธ Set up alerts for error rate spikes
- โ๏ธ Implement retry logic with exponential backoff
- โ๏ธ Use dead letter queues for failed documents
- โ๏ธ Monitor pipeline latency and throughput
- โ๏ธ Test error scenarios regularly (API failures, malformed data)
Mistake 5: Over-Indexing vs. Under-Indexing
Over-indexing: Creating too many small chunks
- Each chunk lacks sufficient context
- Retrieval returns fragmented information
- Higher storage and query costs
Under-indexing: Chunks too large
- Irrelevant content mixed with relevant passages
- Wastes LLM context window on noise
- Poor retrieval precision
โ Optimal chunk sizing guide:
| Content Type | Recommended Size | Rationale |
|---|---|---|
| Technical docs | 300-500 tokens | Code examples need context |
| Legal contracts | 400-600 tokens | Clauses reference each other |
| News articles | 200-400 tokens | Paragraphs are self-contained |
| Chat messages | 100-200 tokens | Short, conversational |
| Research papers | 500-800 tokens | Complex arguments need space |
Key Takeaways ๐ฏ
Let's consolidate everything you've learned about data pipeline and indexing:
โ Data pipelines automate the flow from raw data to searchable indexes through ingestion โ transformation โ embedding โ indexing stages
โ Choose ingestion strategy based on latency needs: batch (hours), micro-batch (minutes), or stream (seconds)
โ Chunk intelligently: 200-500 tokens with 10-20% overlap preserves context while respecting token limits
โ Embeddings convert text to semantic vectorsโuse the same model consistently for indexing and querying
โ Index types trade off speed vs. accuracy: Flat (perfect, slow), HNSW (fast, accurate), IVF (scalable, moderate accuracy)
โ Hybrid search combines vector similarity with keyword matching for best results on diverse queries
โ Monitor everything: Track throughput, latency, errors, and queue depth to catch issues early
โ Metadata filtering dramatically speeds up queries by reducing the search space before vector comparison
โ Avoid silent failures: Implement comprehensive error handling, logging, and alerting
๐ง Memory Devices
Remember pipeline stages with ITEMS:
- Ingestion (collect data)
- Transformation (clean & chunk)
- Embedding (generate vectors)
- Metadata (add context)
- Storage (index in vector DB)
Remember index types with CHEF:
- Cheap & slow: Flat index
- High-speed graph: HNSW
- Efficient clusters: IVF
- Fast & compressed: Quantized
๐ง Try This: Hands-On Exercise
Challenge: Build a mini RAG pipeline for your own documents!
- Collect: Grab 5-10 PDF or text files (research papers, reports, articles)
- Process: Write code to extract text and chunk it (300-500 words)
- Embed: Use OpenAI API or Sentence Transformers to generate embeddings
- Index: Store in Qdrant (local) or Pinecone (cloud)
- Query: Ask natural language questions and retrieve relevant chunks
- Measure: Compare retrieval quality with vs. without overlap in chunks
Bonus: Add metadata filtering (date, source, topic) and test hybrid search!
๐ Quick Reference Card: Pipeline & Indexing Essentials
| Ingestion Types | Batch (scheduled), Stream (real-time), Micro-batch (frequent small batches) |
| Optimal Chunk Size | 200-500 tokens with 10-20% overlap |
| Popular Embeddings | OpenAI text-embedding-3-large (3072D), Cohere embed-v3 (1024D), BGE-large (1024D) |
| Index for Speed | HNSW (graph-based, 95-99% recall) |
| Index for Scale | IVF with quantization (clusters + compression) |
| Hybrid Search | Combine BM25 (keywords) + vector similarity with RRF or weighted fusion |
| HNSW Parameters | m=16-32 (connections), ef_construct=100-200 (build quality), ef=32-128 (query accuracy) |
| Monitor These | Throughput (docs/hour), error rate (%), latency (seconds/doc), queue depth |
| Golden Rules | 1) Same embedding model always 2) Chunk before embed 3) Add overlap 4) Monitor failures |
๐ Further Study
Ready to dive deeper? Check out these resources:
Pinecone Learning Center - https://www.pinecone.io/learn/ - Comprehensive guides on vector databases, embeddings, and RAG architectures with hands-on tutorials
LangChain Documentation: Text Splitters - https://python.langchain.com/docs/modules/data_connection/document_transformers/ - Deep dive into chunking strategies with code examples for various document types
HNSW Algorithm Explained - https://arxiv.org/abs/1603.09320 - Original research paper on Hierarchical Navigable Small World graphs (for those who want the mathematical foundation)
Congratulations! ๐ You now understand the complete data pipeline and indexing architecture that powers modern AI search and RAG systems. You're ready to build production-grade retrieval systems that can handle millions of documents and return relevant results in milliseconds!