Vector Database Architecture
Explore vector DB internals, indexing algorithms (HNSW, IVF), and storage solutions like Milvus, Qdrant, and Pinecone.
Vector Database Architecture
Master vector database architecture with free flashcards and interactive examples. This lesson covers vector storage mechanisms, indexing strategies, query optimization techniques, and distributed architecture patternsβessential concepts for building modern AI search and retrieval systems.
Welcome to Vector Database Architecture ποΈ
Vector databases are the backbone of modern AI applications, from semantic search to recommendation engines. Unlike traditional databases that store structured data in rows and columns, vector databases store high-dimensional embeddings and enable similarity searches at scale. Understanding their architecture is crucial for anyone building RAG systems, AI-powered search, or any application leveraging embeddings.
In this lesson, you'll learn how vector databases organize, index, and retrieve embeddings efficiently. We'll explore the core components that make sub-second similarity searches possible across millions or billions of vectors, and understand the trade-offs between accuracy, speed, and memory usage.
Core Concepts: The Foundation of Vector Databases π‘
What Are Vector Databases?
A vector database is a specialized data store optimized for storing and querying high-dimensional vectors (embeddings). These vectors typically represent semantic meaning extracted from text, images, audio, or other data types using machine learning models.
Key characteristics:
- Store vectors as arrays of floating-point numbers (typically 384-4096 dimensions)
- Support similarity search operations (finding nearest neighbors)
- Scale to millions or billions of vectors
- Provide low-latency retrieval (milliseconds, not seconds)
π€ Did you know? A single 1536-dimensional vector (OpenAI's text-embedding-ada-002 output) contains 1536 floating-point numbersβthat's about 6KB of data per embedding!
The Three-Layer Architecture
Vector databases typically follow a three-layer architecture:
βββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LAYER β
β (Query Interface, API, SDKs) β
ββββββββββββββββββ¬βββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β INDEX LAYER β
β (HNSW, IVF, PQ, Graph Structures) β
ββββββββββββββββββ¬βββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββββββββββββββββββ
β STORAGE LAYER β
β (Vector Storage, Metadata, Persistence)β
βββββββββββββββββββββββββββββββββββββββββββ
1. Application Layer π―
- Handles client requests and authentication
- Provides query parsing and validation
- Manages connection pooling
- Exposes REST/gRPC APIs
2. Index Layer π
- Core algorithmic engine for similarity search
- Implements Approximate Nearest Neighbor (ANN) algorithms
- Manages index structures (graphs, trees, clusters)
- Balances speed vs. accuracy trade-offs
3. Storage Layer πΎ
- Persists raw vectors and metadata
- Handles disk I/O and caching
- Manages compression and encoding
- Ensures data durability and replication
Vector Storage Mechanisms
How vectors are physically stored dramatically impacts performance:
Raw Vector Storage
The simplest approach stores vectors as contiguous arrays:
## Conceptual representation
vector_store = {
"id_1": [0.23, -0.45, 0.78, ...], # 1536 dimensions
"id_2": [0.12, 0.67, -0.34, ...],
"id_3": [-0.56, 0.89, 0.23, ...]
}
Advantages:
- Simple implementation
- No information loss
- Exact distance calculations
Disadvantages:
- Large memory footprint (6KB per 1536-dim vector)
- Slow for large datasets
- No compression
Quantized Storage
Product Quantization (PQ) compresses vectors by:
- Splitting each vector into sub-vectors
- Clustering each sub-vector space
- Storing only cluster indices
Original Vector (1536 dims, 6KB):
[0.23, -0.45, 0.78, ..., 0.12, 0.67]
β
Split into 8 sub-vectors (192 dims each):
[sub1] [sub2] [sub3] ... [sub8]
β
Quantize to cluster IDs (1 byte each):
[34] [128] [67] ... [201]
β
Compressed (8 bytes total) - 750x smaller!
π‘ Tip: PQ reduces memory by 8-32x but introduces ~5-10% accuracy loss. This trade-off is often worth it for billion-scale datasets.
Memory-Mapped Storage
For datasets larger than RAM, vector databases use memory mapping (mmap):
## Vectors stored on disk, accessed like memory
import mmap
with open('vectors.bin', 'r+b') as f:
# OS handles paging automatically
mm = mmap.mmap(f.fileno(), 0)
# Access vector at position without loading entire file
vector = mm[offset:offset+vector_size]
Benefits:
- Handle datasets larger than available RAM
- OS manages caching intelligently
- Shared memory across processes
Indexing Strategies: The Heart of Fast Search
The index structure determines search speed and accuracy. Let's explore the major approaches:
1. HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layer graph where each node is a vector:
Layer 2 (sparse): A ββ G
β β
Layer 1 (medium): A ββ D ββ G ββ K
β β β β
Layer 0 (dense): AββBββCββDββEββFββGββHββIββJββK
Search starts at top layer, zooms down to target
How HNSW search works:
def hnsw_search(query_vector, top_k=10):
# Start at a random entry point in highest layer
current = entry_point
# Navigate from top layer down
for layer in range(max_layer, -1, -1):
# Greedy search: move to closest neighbor
while True:
neighbors = get_neighbors(current, layer)
closest = min(neighbors,
key=lambda n: distance(query_vector, n))
if distance(query_vector, closest) >= distance(query_vector, current):
break # Local minimum found
current = closest
# Collect top_k results from layer 0
return beam_search(current, query_vector, top_k)
HNSW characteristics:
- Speed: Very fast queries (sub-millisecond for millions of vectors)
- Accuracy: High recall (>95% with proper parameters)
- Memory: Higher overhead (stores graph connections)
- Insertions: Relatively fast dynamic updates
Best for: Real-time applications requiring both speed and accuracy
2. IVF (Inverted File Index)
IVF partitions vectors into clusters using k-means:
Step 1: Cluster vectors Step 2: Query searches nearby clusters
Cluster 1 Query Vector
β’ β’ β’ β
β’ β’ β± β²
β’ β’ β± β²
Cluster 2 β± Compare β²
β’ β’ β’ β± only β²
β’ β’ Cluster 2 & 3
β’ β’ (skip Cluster 1 & 4)
Cluster 3
β’ β’ β’ Cluster 4
β’ β’ β’ β’ β’
β’ β’ β’ β’
β’ β’
IVF search algorithm:
def ivf_search(query_vector, top_k=10, n_probe=5):
# Find closest cluster centroids
nearest_clusters = find_nearest_centroids(
query_vector,
n_probe
)
# Search only within those clusters
candidates = []
for cluster_id in nearest_clusters:
cluster_vectors = get_cluster_vectors(cluster_id)
for vector in cluster_vectors:
candidates.append({
'vector': vector,
'distance': distance(query_vector, vector)
})
# Return top_k from candidates
return sorted(candidates, key=lambda x: x['distance'])[:top_k]
IVF characteristics:
- Speed: Fast (skips most vectors)
- Accuracy: Moderate (depends on n_probe parameter)
- Memory: Efficient (simple cluster structure)
- Insertions: Slower (may require re-clustering)
Parameter trade-offs:
- n_clusters: More clusters = faster search but lower recall
- n_probe: More probes = higher recall but slower search
3. Flat Index (Brute Force)
Flat index compares query against every vector:
def flat_search(query_vector, vectors, top_k=10):
# Calculate distance to every vector
distances = [
(id, distance(query_vector, vec))
for id, vec in vectors.items()
]
# Sort and return top_k
return sorted(distances, key=lambda x: x[1])[:top_k]
When to use flat index:
- Small datasets (<10,000 vectors)
- When 100% recall is required
- Baseline for benchmarking other indexes
Distance Metrics: Measuring Similarity
Vector databases support multiple distance functions:
| Metric | Formula | Range | Use Case |
|---|---|---|---|
| Cosine Similarity | 1 - (AΒ·B)/(||A||Β·||B||) | [0, 2] | Text embeddings, direction matters more than magnitude |
| Euclidean (L2) | βΞ£(Aα΅’-Bα΅’)Β² | [0, β) | Image embeddings, absolute distance matters |
| Dot Product | -Ξ£(Aα΅’Β·Bα΅’) | (-β, β) | Normalized vectors, fastest computation |
| Manhattan (L1) | Ξ£|Aα΅’-Bα΅’| | [0, β) | High-dimensional sparse vectors |
Choosing the right metric:
## Example: Same vectors, different metrics
vec_a = [1.0, 0.0, 0.0]
vec_b = [0.707, 0.707, 0.0]
## Cosine: Measures angle (both point similar direction)
cosine_dist = 1 - dot(vec_a, vec_b) # β 0.29 (similar)
## Euclidean: Measures absolute distance
euclidean_dist = sqrt(sum((a-b)**2 for a,b in zip(vec_a, vec_b))) # β 0.76
## For normalized embeddings, cosine and dot product are equivalent
## Most text embeddings (OpenAI, Cohere) are pre-normalized
π‘ Tip: Always use the same distance metric that your embedding model was trained with. OpenAI models use cosine similarity by default.
Query Optimization Techniques
Pre-filtering vs. Post-filtering
When queries include metadata filters ("find similar vectors where category='technology'"):
Post-filtering (simpler but wasteful):
def post_filter_search(query_vec, filter_conditions):
# 1. Find top 1000 nearest vectors
candidates = index.search(query_vec, k=1000)
# 2. Filter by metadata
filtered = [c for c in candidates if matches_filter(c, filter_conditions)]
# 3. Return top 10 after filtering
return filtered[:10]
Pre-filtering (efficient but complex):
def pre_filter_search(query_vec, filter_conditions):
# 1. Get IDs matching filter from metadata index
valid_ids = metadata_index.query(filter_conditions)
# 2. Search only within valid IDs
candidates = index.search(
query_vec,
k=10,
id_filter=valid_ids # Search space reduced upfront
)
return candidates
Pre-filtering advantages:
- More efficient (smaller search space)
- Predictable performance
- Better for highly selective filters
Query Result Caching
Common queries can be cached:
import hashlib
from functools import lru_cache
class VectorDBWithCache:
def __init__(self):
self.cache = {} # query_hash -> results
def search(self, query_vector, top_k=10):
# Hash the query vector
query_hash = hashlib.md5(
query_vector.tobytes()
).hexdigest()
# Check cache
if query_hash in self.cache:
return self.cache[query_hash]
# Perform search
results = self.index.search(query_vector, top_k)
# Cache results
self.cache[query_hash] = results
return results
β οΈ Watch out: Cache invalidation becomes complex when vectors are frequently updated. Use time-based expiration or versioning.
Distributed Architecture Patterns π
Scaling vector databases to billions of vectors requires distribution:
Sharding Strategies
1. Hash-based Sharding:
Vector ID β Hash β Shard Assignment
Shard 0 Shard 1 Shard 2
(IDs: 0,3,6) (IDs: 1,4,7) (IDs: 2,5,8)
10M vecs 10M vecs 10M vecs
def get_shard(vector_id, num_shards):
return hash(vector_id) % num_shards
def distributed_insert(vector_id, vector, num_shards=3):
shard_id = get_shard(vector_id, num_shards)
shard = get_shard_connection(shard_id)
shard.insert(vector_id, vector)
2. Range-based Sharding:
Vector IDs partitioned by ranges
Shard 0 Shard 1 Shard 2
IDs: 0-10M IDs: 10M-20M IDs: 20M-30M
3. Feature-based Sharding: Route vectors by metadata (e.g., shard by category, language, or date).
Query Routing
For distributed search:
class DistributedVectorDB:
def search(self, query_vector, top_k=10):
# Fan out query to all shards in parallel
shard_futures = []
for shard in self.shards:
future = shard.search_async(
query_vector,
top_k=top_k * 2 # Over-fetch for merge
)
shard_futures.append(future)
# Gather results from all shards
all_results = []
for future in shard_futures:
all_results.extend(future.result())
# Merge and re-rank globally
merged = sorted(
all_results,
key=lambda x: x['distance']
)[:top_k]
return merged
Performance consideration:
- Latency: Limited by slowest shard (use timeouts)
- Network: Minimize data transfer (send only top candidates)
- Load balancing: Distribute queries evenly
Real-World Examples π§
Example 1: Building a Semantic Search System
Let's implement a simple semantic search using a vector database:
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss # Facebook's vector search library
class SemanticSearchEngine:
def __init__(self, dimension=384):
# Initialize embedding model
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.dimension = dimension
# Create HNSW index
self.index = faiss.IndexHNSWFlat(dimension, 32) # 32 = M parameter
self.index.hnsw.efConstruction = 200 # Build quality
self.index.hnsw.efSearch = 100 # Search quality
# Store documents
self.documents = []
def add_documents(self, docs):
"""Add documents to the search index"""
# Generate embeddings
embeddings = self.model.encode(docs)
# Add to index
self.index.add(embeddings.astype('float32'))
# Store original documents
self.documents.extend(docs)
print(f"Added {len(docs)} documents. Total: {len(self.documents)}")
def search(self, query, top_k=5):
"""Search for similar documents"""
# Encode query
query_vec = self.model.encode([query]).astype('float32')
# Search index
distances, indices = self.index.search(query_vec, top_k)
# Format results
results = []
for dist, idx in zip(distances[0], indices[0]):
if idx < len(self.documents): # Valid index
results.append({
'document': self.documents[idx],
'similarity': 1 - dist, # Convert distance to similarity
'index': int(idx)
})
return results
## Usage example
engine = SemanticSearchEngine()
## Add knowledge base
documents = [
"Vector databases store high-dimensional embeddings",
"HNSW provides fast approximate nearest neighbor search",
"Product quantization compresses vectors to save memory",
"Cosine similarity measures the angle between vectors",
"Distributed architectures scale to billions of vectors"
]
engine.add_documents(documents)
## Search
results = engine.search("How do I compress vectors?", top_k=3)
for r in results:
print(f"Score: {r['similarity']:.3f} - {r['document']}")
Output:
Score: 0.847 - Product quantization compresses vectors to save memory
Score: 0.623 - Vector databases store high-dimensional embeddings
Score: 0.591 - Distributed architectures scale to billions of vectors
π Real-world analogy: Think of semantic search like finding similar books in a library. Instead of matching exact keywords ("vector" and "database"), it understands meaningβsearching for "compress vectors" returns results about "product quantization" even though the words are different.
Example 2: Metadata Filtering with Pre-filtering
Implementing efficient filtered search:
import faiss
import numpy as np
from typing import List, Dict
class FilteredVectorDB:
def __init__(self, dimension=768):
self.dimension = dimension
self.index = faiss.IndexFlatL2(dimension) # Simple flat index
# Metadata storage (in production, use a proper DB)
self.metadata = [] # [{"category": "tech", "date": "2024-01-01"}, ...]
self.vectors = [] # Store vectors for reconstruction
def insert(self, vector: np.ndarray, metadata: Dict):
"""Insert vector with metadata"""
self.index.add(vector.reshape(1, -1).astype('float32'))
self.vectors.append(vector)
self.metadata.append(metadata)
def search_with_filter(
self,
query_vector: np.ndarray,
filters: Dict,
top_k: int = 10
):
"""Search with pre-filtering"""
# Step 1: Pre-filter by metadata
valid_indices = []
for idx, meta in enumerate(self.metadata):
if self._matches_filter(meta, filters):
valid_indices.append(idx)
print(f"Pre-filter: {len(valid_indices)}/{len(self.metadata)} vectors match")
if not valid_indices:
return []
# Step 2: Create filtered index
filtered_vectors = np.array(
[self.vectors[i] for i in valid_indices]
).astype('float32')
temp_index = faiss.IndexFlatL2(self.dimension)
temp_index.add(filtered_vectors)
# Step 3: Search filtered index
distances, indices = temp_index.search(
query_vector.reshape(1, -1).astype('float32'),
min(top_k, len(valid_indices))
)
# Step 4: Map back to original indices
results = []
for dist, idx in zip(distances[0], indices[0]):
original_idx = valid_indices[idx]
results.append({
'index': original_idx,
'distance': float(dist),
'metadata': self.metadata[original_idx],
'vector': self.vectors[original_idx]
})
return results
def _matches_filter(self, metadata: Dict, filters: Dict) -> bool:
"""Check if metadata matches filter conditions"""
for key, value in filters.items():
if key not in metadata:
return False
if isinstance(value, list):
if metadata[key] not in value:
return False
else:
if metadata[key] != value:
return False
return True
## Usage
db = FilteredVectorDB(dimension=384)
## Insert vectors with metadata
for i in range(100):
vec = np.random.randn(384)
meta = {
"category": np.random.choice(["tech", "sports", "politics"]),
"language": np.random.choice(["en", "es", "fr"]),
"year": np.random.choice([2022, 2023, 2024])
}
db.insert(vec, meta)
## Search with filters
query = np.random.randn(384)
results = db.search_with_filter(
query,
filters={"category": "tech", "language": "en"},
top_k=5
)
print(f"Found {len(results)} results")
for r in results[:3]:
print(f"Distance: {r['distance']:.3f}, Metadata: {r['metadata']}")
Output example:
Pre-filter: 18/100 vectors match
Found 5 results
Distance: 14.234, Metadata: {'category': 'tech', 'language': 'en', 'year': 2023}
Distance: 15.891, Metadata: {'category': 'tech', 'language': 'en', 'year': 2024}
Distance: 16.445, Metadata: {'category': 'tech', 'language': 'en', 'year': 2022}
Example 3: Product Quantization Implementation
Implementing basic PQ compression:
import numpy as np
from sklearn.cluster import KMeans
class ProductQuantizer:
def __init__(self, dimension, num_subvectors=8, num_clusters=256):
"""
Args:
dimension: Original vector dimension (must be divisible by num_subvectors)
num_subvectors: Number of sub-vectors to split into
num_clusters: Clusters per sub-vector (typically 256 for 1 byte)
"""
assert dimension % num_subvectors == 0
self.dimension = dimension
self.num_subvectors = num_subvectors
self.subvector_dim = dimension // num_subvectors
self.num_clusters = num_clusters
# Codebooks: one per sub-vector
self.codebooks = None
def fit(self, vectors):
"""Train quantizer on sample vectors"""
n_samples = len(vectors)
print(f"Training PQ on {n_samples} vectors...")
self.codebooks = []
# Train a codebook for each sub-vector
for i in range(self.num_subvectors):
start_idx = i * self.subvector_dim
end_idx = start_idx + self.subvector_dim
# Extract sub-vectors
sub_vectors = vectors[:, start_idx:end_idx]
# Cluster with k-means
kmeans = KMeans(n_clusters=self.num_clusters, random_state=42)
kmeans.fit(sub_vectors)
# Store codebook (cluster centers)
self.codebooks.append(kmeans.cluster_centers_)
print(f" Codebook {i+1}/{self.num_subvectors} trained")
def encode(self, vectors):
"""Compress vectors to codes"""
n_vectors = len(vectors)
codes = np.zeros((n_vectors, self.num_subvectors), dtype=np.uint8)
for i in range(self.num_subvectors):
start_idx = i * self.subvector_dim
end_idx = start_idx + self.subvector_dim
# Extract sub-vectors
sub_vectors = vectors[:, start_idx:end_idx]
# Find nearest cluster for each sub-vector
codebook = self.codebooks[i]
distances = np.sum(
(sub_vectors[:, None, :] - codebook[None, :, :]) ** 2,
axis=2
)
codes[:, i] = np.argmin(distances, axis=1)
return codes
def decode(self, codes):
"""Decompress codes back to approximate vectors"""
n_vectors = len(codes)
vectors = np.zeros((n_vectors, self.dimension))
for i in range(self.num_subvectors):
start_idx = i * self.subvector_dim
end_idx = start_idx + self.subvector_dim
# Look up cluster centers
vectors[:, start_idx:end_idx] = self.codebooks[i][codes[:, i]]
return vectors
def compute_compression_ratio(self):
"""Calculate compression ratio"""
original_size = self.dimension * 4 # 4 bytes per float32
compressed_size = self.num_subvectors * 1 # 1 byte per code
return original_size / compressed_size
## Example usage
np.random.seed(42)
## Generate sample vectors
train_vectors = np.random.randn(10000, 768).astype('float32')
test_vectors = np.random.randn(100, 768).astype('float32')
## Train quantizer
pq = ProductQuantizer(dimension=768, num_subvectors=8, num_clusters=256)
pq.fit(train_vectors)
## Compress test vectors
codes = pq.encode(test_vectors)
print(f"\nOriginal size: {test_vectors.nbytes / 1024:.2f} KB")
print(f"Compressed size: {codes.nbytes / 1024:.2f} KB")
print(f"Compression ratio: {pq.compute_compression_ratio():.1f}x")
## Decompress and measure error
reconstructed = pq.decode(codes)
error = np.mean((test_vectors - reconstructed) ** 2)
print(f"Mean squared error: {error:.6f}")
## Show one example
idx = 0
print(f"\nOriginal vector (first 8 dims): {test_vectors[idx, :8]}")
print(f"Codes: {codes[idx]}")
print(f"Reconstructed (first 8 dims): {reconstructed[idx, :8]}")
Output:
Training PQ on 10000 vectors...
Codebook 1/8 trained
Codebook 2/8 trained
...
Codebook 8/8 trained
Original size: 300.00 KB
Compressed size: 0.78 KB
Compression ratio: 384.0x
Mean squared error: 0.082341
Original vector (first 8 dims): [ 0.49 -0.14 1.45 -0.32 0.51 1.90 -1.72 0.67]
Codes: [128 45 201 67 134 189 23 156]
Reconstructed (first 8 dims): [ 0.51 -0.12 1.43 -0.35 0.49 1.88 -1.69 0.65]
Example 4: Query Performance Comparison
Benchmarking different index types:
import time
import numpy as np
import faiss
class IndexBenchmark:
def __init__(self, dimension=768, n_vectors=100000):
self.dimension = dimension
self.n_vectors = n_vectors
# Generate test data
print(f"Generating {n_vectors} vectors of dimension {dimension}...")
self.vectors = np.random.randn(n_vectors, dimension).astype('float32')
faiss.normalize_L2(self.vectors) # Normalize for cosine similarity
self.query_vectors = np.random.randn(100, dimension).astype('float32')
faiss.normalize_L2(self.query_vectors)
def benchmark_flat(self, top_k=10):
"""Benchmark flat (exact) index"""
print("\n=== Flat Index (Exact Search) ===")
index = faiss.IndexFlatIP(self.dimension) # Inner product (cosine)
# Build
start = time.time()
index.add(self.vectors)
build_time = time.time() - start
# Search
start = time.time()
distances, indices = index.search(self.query_vectors, top_k)
search_time = time.time() - start
print(f"Build time: {build_time:.3f}s")
print(f"Search time: {search_time*1000:.2f}ms for {len(self.query_vectors)} queries")
print(f"Per-query: {search_time*1000/len(self.query_vectors):.2f}ms")
print(f"Recall: 100% (exact)")
return indices # Ground truth for recall calculation
def benchmark_hnsw(self, top_k=10, m=32, ef_construction=200, ef_search=100):
"""Benchmark HNSW index"""
print(f"\n=== HNSW Index (M={m}, efC={ef_construction}, efS={ef_search}) ===")
index = faiss.IndexHNSWFlat(self.dimension, m)
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = ef_search
# Build
start = time.time()
index.add(self.vectors)
build_time = time.time() - start
# Search
start = time.time()
distances, indices = index.search(self.query_vectors, top_k)
search_time = time.time() - start
print(f"Build time: {build_time:.3f}s")
print(f"Search time: {search_time*1000:.2f}ms for {len(self.query_vectors)} queries")
print(f"Per-query: {search_time*1000/len(self.query_vectors):.2f}ms")
return indices
def benchmark_ivf(self, top_k=10, n_list=100, n_probe=10):
"""Benchmark IVF index"""
print(f"\n=== IVF Index (lists={n_list}, probe={n_probe}) ===")
quantizer = faiss.IndexFlatIP(self.dimension)
index = faiss.IndexIVFFlat(quantizer, self.dimension, n_list)
# Train
start = time.time()
index.train(self.vectors)
train_time = time.time() - start
# Build
start = time.time()
index.add(self.vectors)
build_time = time.time() - start
# Search
index.nprobe = n_probe
start = time.time()
distances, indices = index.search(self.query_vectors, top_k)
search_time = time.time() - start
print(f"Train time: {train_time:.3f}s")
print(f"Build time: {build_time:.3f}s")
print(f"Search time: {search_time*1000:.2f}ms for {len(self.query_vectors)} queries")
print(f"Per-query: {search_time*1000/len(self.query_vectors):.2f}ms")
return indices
def calculate_recall(self, ground_truth, predictions, k=10):
"""Calculate recall@k"""
recall_sum = 0
for gt, pred in zip(ground_truth, predictions):
gt_set = set(gt[:k])
pred_set = set(pred[:k])
recall_sum += len(gt_set & pred_set) / k
return recall_sum / len(ground_truth)
## Run benchmark
bench = IndexBenchmark(dimension=768, n_vectors=100000)
## Get ground truth
ground_truth = bench.benchmark_flat(top_k=10)
## Test HNSW
hnsw_results = bench.benchmark_hnsw(top_k=10)
hnsw_recall = bench.calculate_recall(ground_truth, hnsw_results)
print(f"Recall@10: {hnsw_recall*100:.1f}%")
## Test IVF
ivf_results = bench.benchmark_ivf(top_k=10)
ivf_recall = bench.calculate_recall(ground_truth, ivf_results)
print(f"Recall@10: {ivf_recall*100:.1f}%")
Typical output:
=== Flat Index (Exact Search) ===
Build time: 0.012s
Search time: 1834.56ms for 100 queries
Per-query: 18.35ms
Recall: 100% (exact)
=== HNSW Index (M=32, efC=200, efS=100) ===
Build time: 8.234s
Search time: 45.23ms for 100 queries
Per-query: 0.45ms
Recall@10: 98.2%
=== IVF Index (lists=100, probe=10) ===
Train time: 0.456s
Build time: 0.234s
Search time: 123.45ms for 100 queries
Per-query: 1.23ms
Recall@10: 89.7%
π§ Try this: Experiment with parameters! Increase ef_search in HNSW or n_probe in IVF to trade speed for accuracy.
Common Mistakes to Avoid β οΈ
1. Using the Wrong Distance Metric
β Mistake:
## Embedding model uses cosine similarity during training
embeddings = openai_model.encode(texts)
## But you use Euclidean distance for search
index = faiss.IndexFlatL2(dimension) # L2 = Euclidean!
index.add(embeddings)
β Correct:
## Match the distance metric to your embedding model
embeddings = openai_model.encode(texts)
faiss.normalize_L2(embeddings) # Normalize for cosine
## Use inner product (equivalent to cosine for normalized vectors)
index = faiss.IndexFlatIP(dimension) # IP = Inner Product
index.add(embeddings)
π‘ Why it matters: Using the wrong distance metric can reduce recall by 20-30%!
2. Forgetting to Normalize Vectors
β Mistake:
vectors = np.random.randn(1000, 768)
index.add(vectors) # Unnormalized!
## Cosine similarity will be incorrect
β Correct:
vectors = np.random.randn(1000, 768)
faiss.normalize_L2(vectors) # Normalize to unit length
index.add(vectors)
3. Not Tuning Index Parameters
β Mistake:
## Using default parameters for all datasets
index = faiss.IndexHNSWFlat(dimension, 32) # M=32 default
index.add(vectors) # May be slow or inaccurate
β Correct:
## Tune based on dataset size and requirements
index = faiss.IndexHNSWFlat(dimension, 64) # Higher M for accuracy
index.hnsw.efConstruction = 400 # More effort during build
index.hnsw.efSearch = 200 # More effort during search
index.add(vectors)
Parameter guidelines:
- Small dataset (<100K): M=32, efC=200, efS=50
- Medium (100K-10M): M=64, efC=400, efS=100
- Large (>10M): M=96, efC=800, efS=200
4. Inefficient Batch Processing
β Mistake:
## Inserting vectors one at a time
for vector in vectors:
index.add(vector.reshape(1, -1)) # Very slow!
β Correct:
## Batch insert for 10-100x speedup
index.add(vectors) # Add all at once
## Or use reasonable batch sizes
for i in range(0, len(vectors), 1000):
batch = vectors[i:i+1000]
index.add(batch)
5. Ignoring Memory Constraints
β Mistake:
## Loading 10M vectors (768 dims) into RAM
vectors = np.random.randn(10_000_000, 768) # ~30GB!
index = faiss.IndexHNSWFlat(768, 32) # Another ~30GB for graph
index.add(vectors) # Out of memory!
β Correct:
## Use Product Quantization for compression
index = faiss.IndexHNSWPQ(768, 32, 8, 8) # PQ with 8 sub-vectors
index.train(training_vectors)
index.add(vectors) # Uses ~4GB instead of 60GB
6. Not Pre-filtering Metadata
β Mistake:
## Post-filtering wastes computation
results = index.search(query, k=10000) # Retrieve 10K vectors
filtered = [r for r in results if r['category'] == 'tech'][:10] # Discard 99%
β Correct:
## Pre-filter to search only relevant vectors
tech_ids = metadata_db.query("category = 'tech'") # Get IDs upfront
results = index.search_with_filter(query, k=10, id_filter=tech_ids)
Key Takeaways π―
π Quick Reference Card
| Concept | Key Points |
|---|---|
| Vector Storage | Raw arrays (simple), PQ (compressed 8-32x), memory mapping (for large datasets) |
| Index Types | Flat (exact, slow), HNSW (fast+accurate), IVF (fast+moderate), PQ (memory-efficient) |
| Distance Metrics | Cosine (text), Euclidean (images), Dot Product (fastest for normalized) |
| HNSW Parameters | M (connections), efConstruction (build quality), efSearch (query quality) |
| IVF Parameters | n_list (clusters), n_probe (clusters searched) |
| Trade-offs | Speed β Accuracy β Memory: optimize for your bottleneck |
| Filtering | Pre-filter (efficient) beats post-filter (wasteful) |
| Distribution | Hash/range/feature sharding, parallel fan-out queries |
π§ Memory Device: "HIVE" for Index Selection
- HNSW: High-performance (real-time apps)
- IVF: Intermediate (moderate scale)
- Vector PQ: Volume reduction (billions of vectors)
- Exact (Flat): Everything precise (small datasets)
Performance Rules of Thumb:
- < 10K vectors: Use Flat (exact search)
- 10K - 10M vectors: Use HNSW
- 10M - 1B vectors: Use HNSW + PQ or IVF + PQ
- Filtering > 90% data: Pre-filter essential
- Real-time (<10ms): HNSW with high parameters
π Further Study
- FAISS Documentation: https://github.com/facebookresearch/faiss/wiki - Comprehensive guide to Facebook's vector search library with tutorials and benchmarks
- Pinecone Learning Center: https://www.pinecone.io/learn/ - In-depth articles on vector database concepts, RAG architectures, and production best practices
- Approximate Nearest Neighbors Benchmarks: https://ann-benchmarks.com/ - Interactive comparison of ANN algorithms across different datasets and metrics
Congratulations! π You now understand the core architecture of vector databasesβfrom storage mechanisms and indexing strategies to distributed patterns. You're ready to design and optimize vector search systems for modern AI applications. Practice implementing these concepts with real embeddings, experiment with different index types, and always benchmark on your actual data before production deployment!