You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

Vector Database Architecture

Explore vector DB internals, indexing algorithms (HNSW, IVF), and storage solutions like Milvus, Qdrant, and Pinecone.

Vector Database Architecture

Master vector database architecture with free flashcards and interactive examples. This lesson covers vector storage mechanisms, indexing strategies, query optimization techniques, and distributed architecture patternsβ€”essential concepts for building modern AI search and retrieval systems.

Welcome to Vector Database Architecture πŸ—„οΈ

Vector databases are the backbone of modern AI applications, from semantic search to recommendation engines. Unlike traditional databases that store structured data in rows and columns, vector databases store high-dimensional embeddings and enable similarity searches at scale. Understanding their architecture is crucial for anyone building RAG systems, AI-powered search, or any application leveraging embeddings.

In this lesson, you'll learn how vector databases organize, index, and retrieve embeddings efficiently. We'll explore the core components that make sub-second similarity searches possible across millions or billions of vectors, and understand the trade-offs between accuracy, speed, and memory usage.

Core Concepts: The Foundation of Vector Databases πŸ’‘

What Are Vector Databases?

A vector database is a specialized data store optimized for storing and querying high-dimensional vectors (embeddings). These vectors typically represent semantic meaning extracted from text, images, audio, or other data types using machine learning models.

Key characteristics:

  • Store vectors as arrays of floating-point numbers (typically 384-4096 dimensions)
  • Support similarity search operations (finding nearest neighbors)
  • Scale to millions or billions of vectors
  • Provide low-latency retrieval (milliseconds, not seconds)

πŸ€” Did you know? A single 1536-dimensional vector (OpenAI's text-embedding-ada-002 output) contains 1536 floating-point numbersβ€”that's about 6KB of data per embedding!

The Three-Layer Architecture

Vector databases typically follow a three-layer architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         APPLICATION LAYER               β”‚
β”‚  (Query Interface, API, SDKs)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         INDEX LAYER                     β”‚
β”‚  (HNSW, IVF, PQ, Graph Structures)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         STORAGE LAYER                   β”‚
β”‚  (Vector Storage, Metadata, Persistence)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Application Layer 🎯

  • Handles client requests and authentication
  • Provides query parsing and validation
  • Manages connection pooling
  • Exposes REST/gRPC APIs

2. Index Layer πŸ”

  • Core algorithmic engine for similarity search
  • Implements Approximate Nearest Neighbor (ANN) algorithms
  • Manages index structures (graphs, trees, clusters)
  • Balances speed vs. accuracy trade-offs

3. Storage Layer πŸ’Ύ

  • Persists raw vectors and metadata
  • Handles disk I/O and caching
  • Manages compression and encoding
  • Ensures data durability and replication

Vector Storage Mechanisms

How vectors are physically stored dramatically impacts performance:

Raw Vector Storage

The simplest approach stores vectors as contiguous arrays:

## Conceptual representation
vector_store = {
    "id_1": [0.23, -0.45, 0.78, ...],  # 1536 dimensions
    "id_2": [0.12, 0.67, -0.34, ...],
    "id_3": [-0.56, 0.89, 0.23, ...]
}

Advantages:

  • Simple implementation
  • No information loss
  • Exact distance calculations

Disadvantages:

  • Large memory footprint (6KB per 1536-dim vector)
  • Slow for large datasets
  • No compression
Quantized Storage

Product Quantization (PQ) compresses vectors by:

  1. Splitting each vector into sub-vectors
  2. Clustering each sub-vector space
  3. Storing only cluster indices
Original Vector (1536 dims, 6KB):
[0.23, -0.45, 0.78, ..., 0.12, 0.67]
         ↓
Split into 8 sub-vectors (192 dims each):
[sub1] [sub2] [sub3] ... [sub8]
         ↓
Quantize to cluster IDs (1 byte each):
[34] [128] [67] ... [201]
         ↓
Compressed (8 bytes total) - 750x smaller!

πŸ’‘ Tip: PQ reduces memory by 8-32x but introduces ~5-10% accuracy loss. This trade-off is often worth it for billion-scale datasets.

Memory-Mapped Storage

For datasets larger than RAM, vector databases use memory mapping (mmap):

## Vectors stored on disk, accessed like memory
import mmap

with open('vectors.bin', 'r+b') as f:
    # OS handles paging automatically
    mm = mmap.mmap(f.fileno(), 0)
    # Access vector at position without loading entire file
    vector = mm[offset:offset+vector_size]

Benefits:

  • Handle datasets larger than available RAM
  • OS manages caching intelligently
  • Shared memory across processes

The index structure determines search speed and accuracy. Let's explore the major approaches:

1. HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph where each node is a vector:

Layer 2 (sparse):     A ←→ G
                      ↕     ↕
Layer 1 (medium):     A ←→ D ←→ G ←→ K
                      ↕     ↕     ↕     ↕
Layer 0 (dense):      A←→B←→C←→D←→E←→F←→G←→H←→I←→J←→K
                      
Search starts at top layer, zooms down to target

How HNSW search works:

def hnsw_search(query_vector, top_k=10):
    # Start at a random entry point in highest layer
    current = entry_point
    
    # Navigate from top layer down
    for layer in range(max_layer, -1, -1):
        # Greedy search: move to closest neighbor
        while True:
            neighbors = get_neighbors(current, layer)
            closest = min(neighbors, 
                         key=lambda n: distance(query_vector, n))
            
            if distance(query_vector, closest) >= distance(query_vector, current):
                break  # Local minimum found
            current = closest
    
    # Collect top_k results from layer 0
    return beam_search(current, query_vector, top_k)

HNSW characteristics:

  • Speed: Very fast queries (sub-millisecond for millions of vectors)
  • Accuracy: High recall (>95% with proper parameters)
  • Memory: Higher overhead (stores graph connections)
  • Insertions: Relatively fast dynamic updates

Best for: Real-time applications requiring both speed and accuracy

2. IVF (Inverted File Index)

IVF partitions vectors into clusters using k-means:

Step 1: Cluster vectors          Step 2: Query searches nearby clusters

    Cluster 1                         Query Vector
    β€’  β€’ β€’                                  β˜…
   β€’     β€’                                β•±  β•²
    β€’  β€’                               β•±      β•²
              Cluster 2             β•± Compare    β•²
              β€’  β€’  β€’            β•±    only        β•²
             β€’      β€’         Cluster 2 & 3
              β€’  β€’          (skip Cluster 1 & 4)
    Cluster 3                              
    β€’ β€’  β€’                              Cluster 4
   β€’    β€’                                β€’ β€’  β€’
    β€’ β€’                                 β€’    β€’
                                         β€’ β€’

IVF search algorithm:

def ivf_search(query_vector, top_k=10, n_probe=5):
    # Find closest cluster centroids
    nearest_clusters = find_nearest_centroids(
        query_vector, 
        n_probe
    )
    
    # Search only within those clusters
    candidates = []
    for cluster_id in nearest_clusters:
        cluster_vectors = get_cluster_vectors(cluster_id)
        for vector in cluster_vectors:
            candidates.append({
                'vector': vector,
                'distance': distance(query_vector, vector)
            })
    
    # Return top_k from candidates
    return sorted(candidates, key=lambda x: x['distance'])[:top_k]

IVF characteristics:

  • Speed: Fast (skips most vectors)
  • Accuracy: Moderate (depends on n_probe parameter)
  • Memory: Efficient (simple cluster structure)
  • Insertions: Slower (may require re-clustering)

Parameter trade-offs:

  • n_clusters: More clusters = faster search but lower recall
  • n_probe: More probes = higher recall but slower search
3. Flat Index (Brute Force)

Flat index compares query against every vector:

def flat_search(query_vector, vectors, top_k=10):
    # Calculate distance to every vector
    distances = [
        (id, distance(query_vector, vec)) 
        for id, vec in vectors.items()
    ]
    
    # Sort and return top_k
    return sorted(distances, key=lambda x: x[1])[:top_k]

When to use flat index:

  • Small datasets (<10,000 vectors)
  • When 100% recall is required
  • Baseline for benchmarking other indexes

Distance Metrics: Measuring Similarity

Vector databases support multiple distance functions:

MetricFormulaRangeUse Case
Cosine Similarity1 - (AΒ·B)/(||A||Β·||B||)[0, 2]Text embeddings, direction matters more than magnitude
Euclidean (L2)√Σ(Aᡒ-Bᡒ)²[0, ∞)Image embeddings, absolute distance matters
Dot Product-Σ(Aᡒ·Bᡒ)(-∞, ∞)Normalized vectors, fastest computation
Manhattan (L1)Σ|Aᡒ-Bᡒ|[0, ∞)High-dimensional sparse vectors

Choosing the right metric:

## Example: Same vectors, different metrics
vec_a = [1.0, 0.0, 0.0]
vec_b = [0.707, 0.707, 0.0]

## Cosine: Measures angle (both point similar direction)
cosine_dist = 1 - dot(vec_a, vec_b)  # β‰ˆ 0.29 (similar)

## Euclidean: Measures absolute distance
euclidean_dist = sqrt(sum((a-b)**2 for a,b in zip(vec_a, vec_b)))  # β‰ˆ 0.76

## For normalized embeddings, cosine and dot product are equivalent
## Most text embeddings (OpenAI, Cohere) are pre-normalized

πŸ’‘ Tip: Always use the same distance metric that your embedding model was trained with. OpenAI models use cosine similarity by default.

Query Optimization Techniques

Pre-filtering vs. Post-filtering

When queries include metadata filters ("find similar vectors where category='technology'"):

Post-filtering (simpler but wasteful):

def post_filter_search(query_vec, filter_conditions):
    # 1. Find top 1000 nearest vectors
    candidates = index.search(query_vec, k=1000)
    
    # 2. Filter by metadata
    filtered = [c for c in candidates if matches_filter(c, filter_conditions)]
    
    # 3. Return top 10 after filtering
    return filtered[:10]

Pre-filtering (efficient but complex):

def pre_filter_search(query_vec, filter_conditions):
    # 1. Get IDs matching filter from metadata index
    valid_ids = metadata_index.query(filter_conditions)
    
    # 2. Search only within valid IDs
    candidates = index.search(
        query_vec, 
        k=10,
        id_filter=valid_ids  # Search space reduced upfront
    )
    
    return candidates

Pre-filtering advantages:

  • More efficient (smaller search space)
  • Predictable performance
  • Better for highly selective filters
Query Result Caching

Common queries can be cached:

import hashlib
from functools import lru_cache

class VectorDBWithCache:
    def __init__(self):
        self.cache = {}  # query_hash -> results
    
    def search(self, query_vector, top_k=10):
        # Hash the query vector
        query_hash = hashlib.md5(
            query_vector.tobytes()
        ).hexdigest()
        
        # Check cache
        if query_hash in self.cache:
            return self.cache[query_hash]
        
        # Perform search
        results = self.index.search(query_vector, top_k)
        
        # Cache results
        self.cache[query_hash] = results
        return results

⚠️ Watch out: Cache invalidation becomes complex when vectors are frequently updated. Use time-based expiration or versioning.

Distributed Architecture Patterns 🌐

Scaling vector databases to billions of vectors requires distribution:

Sharding Strategies

1. Hash-based Sharding:

Vector ID β†’ Hash β†’ Shard Assignment

    Shard 0        Shard 1        Shard 2
  (IDs: 0,3,6)   (IDs: 1,4,7)   (IDs: 2,5,8)
     10M vecs       10M vecs       10M vecs
def get_shard(vector_id, num_shards):
    return hash(vector_id) % num_shards

def distributed_insert(vector_id, vector, num_shards=3):
    shard_id = get_shard(vector_id, num_shards)
    shard = get_shard_connection(shard_id)
    shard.insert(vector_id, vector)

2. Range-based Sharding:

Vector IDs partitioned by ranges

    Shard 0           Shard 1           Shard 2
  IDs: 0-10M        IDs: 10M-20M      IDs: 20M-30M

3. Feature-based Sharding: Route vectors by metadata (e.g., shard by category, language, or date).

Query Routing

For distributed search:

class DistributedVectorDB:
    def search(self, query_vector, top_k=10):
        # Fan out query to all shards in parallel
        shard_futures = []
        for shard in self.shards:
            future = shard.search_async(
                query_vector, 
                top_k=top_k * 2  # Over-fetch for merge
            )
            shard_futures.append(future)
        
        # Gather results from all shards
        all_results = []
        for future in shard_futures:
            all_results.extend(future.result())
        
        # Merge and re-rank globally
        merged = sorted(
            all_results, 
            key=lambda x: x['distance']
        )[:top_k]
        
        return merged

Performance consideration:

  • Latency: Limited by slowest shard (use timeouts)
  • Network: Minimize data transfer (send only top candidates)
  • Load balancing: Distribute queries evenly

Real-World Examples πŸ”§

Example 1: Building a Semantic Search System

Let's implement a simple semantic search using a vector database:

import numpy as np
from sentence_transformers import SentenceTransformer
import faiss  # Facebook's vector search library

class SemanticSearchEngine:
    def __init__(self, dimension=384):
        # Initialize embedding model
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.dimension = dimension
        
        # Create HNSW index
        self.index = faiss.IndexHNSWFlat(dimension, 32)  # 32 = M parameter
        self.index.hnsw.efConstruction = 200  # Build quality
        self.index.hnsw.efSearch = 100        # Search quality
        
        # Store documents
        self.documents = []
    
    def add_documents(self, docs):
        """Add documents to the search index"""
        # Generate embeddings
        embeddings = self.model.encode(docs)
        
        # Add to index
        self.index.add(embeddings.astype('float32'))
        
        # Store original documents
        self.documents.extend(docs)
        
        print(f"Added {len(docs)} documents. Total: {len(self.documents)}")
    
    def search(self, query, top_k=5):
        """Search for similar documents"""
        # Encode query
        query_vec = self.model.encode([query]).astype('float32')
        
        # Search index
        distances, indices = self.index.search(query_vec, top_k)
        
        # Format results
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx < len(self.documents):  # Valid index
                results.append({
                    'document': self.documents[idx],
                    'similarity': 1 - dist,  # Convert distance to similarity
                    'index': int(idx)
                })
        
        return results

## Usage example
engine = SemanticSearchEngine()

## Add knowledge base
documents = [
    "Vector databases store high-dimensional embeddings",
    "HNSW provides fast approximate nearest neighbor search",
    "Product quantization compresses vectors to save memory",
    "Cosine similarity measures the angle between vectors",
    "Distributed architectures scale to billions of vectors"
]
engine.add_documents(documents)

## Search
results = engine.search("How do I compress vectors?", top_k=3)
for r in results:
    print(f"Score: {r['similarity']:.3f} - {r['document']}")

Output:

Score: 0.847 - Product quantization compresses vectors to save memory
Score: 0.623 - Vector databases store high-dimensional embeddings
Score: 0.591 - Distributed architectures scale to billions of vectors

🌍 Real-world analogy: Think of semantic search like finding similar books in a library. Instead of matching exact keywords ("vector" and "database"), it understands meaningβ€”searching for "compress vectors" returns results about "product quantization" even though the words are different.

Example 2: Metadata Filtering with Pre-filtering

Implementing efficient filtered search:

import faiss
import numpy as np
from typing import List, Dict

class FilteredVectorDB:
    def __init__(self, dimension=768):
        self.dimension = dimension
        self.index = faiss.IndexFlatL2(dimension)  # Simple flat index
        
        # Metadata storage (in production, use a proper DB)
        self.metadata = []  # [{"category": "tech", "date": "2024-01-01"}, ...]
        self.vectors = []   # Store vectors for reconstruction
    
    def insert(self, vector: np.ndarray, metadata: Dict):
        """Insert vector with metadata"""
        self.index.add(vector.reshape(1, -1).astype('float32'))
        self.vectors.append(vector)
        self.metadata.append(metadata)
    
    def search_with_filter(
        self, 
        query_vector: np.ndarray, 
        filters: Dict,
        top_k: int = 10
    ):
        """Search with pre-filtering"""
        # Step 1: Pre-filter by metadata
        valid_indices = []
        for idx, meta in enumerate(self.metadata):
            if self._matches_filter(meta, filters):
                valid_indices.append(idx)
        
        print(f"Pre-filter: {len(valid_indices)}/{len(self.metadata)} vectors match")
        
        if not valid_indices:
            return []
        
        # Step 2: Create filtered index
        filtered_vectors = np.array(
            [self.vectors[i] for i in valid_indices]
        ).astype('float32')
        
        temp_index = faiss.IndexFlatL2(self.dimension)
        temp_index.add(filtered_vectors)
        
        # Step 3: Search filtered index
        distances, indices = temp_index.search(
            query_vector.reshape(1, -1).astype('float32'),
            min(top_k, len(valid_indices))
        )
        
        # Step 4: Map back to original indices
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            original_idx = valid_indices[idx]
            results.append({
                'index': original_idx,
                'distance': float(dist),
                'metadata': self.metadata[original_idx],
                'vector': self.vectors[original_idx]
            })
        
        return results
    
    def _matches_filter(self, metadata: Dict, filters: Dict) -> bool:
        """Check if metadata matches filter conditions"""
        for key, value in filters.items():
            if key not in metadata:
                return False
            if isinstance(value, list):
                if metadata[key] not in value:
                    return False
            else:
                if metadata[key] != value:
                    return False
        return True

## Usage
db = FilteredVectorDB(dimension=384)

## Insert vectors with metadata
for i in range(100):
    vec = np.random.randn(384)
    meta = {
        "category": np.random.choice(["tech", "sports", "politics"]),
        "language": np.random.choice(["en", "es", "fr"]),
        "year": np.random.choice([2022, 2023, 2024])
    }
    db.insert(vec, meta)

## Search with filters
query = np.random.randn(384)
results = db.search_with_filter(
    query,
    filters={"category": "tech", "language": "en"},
    top_k=5
)

print(f"Found {len(results)} results")
for r in results[:3]:
    print(f"Distance: {r['distance']:.3f}, Metadata: {r['metadata']}")

Output example:

Pre-filter: 18/100 vectors match
Found 5 results
Distance: 14.234, Metadata: {'category': 'tech', 'language': 'en', 'year': 2023}
Distance: 15.891, Metadata: {'category': 'tech', 'language': 'en', 'year': 2024}
Distance: 16.445, Metadata: {'category': 'tech', 'language': 'en', 'year': 2022}

Example 3: Product Quantization Implementation

Implementing basic PQ compression:

import numpy as np
from sklearn.cluster import KMeans

class ProductQuantizer:
    def __init__(self, dimension, num_subvectors=8, num_clusters=256):
        """
        Args:
            dimension: Original vector dimension (must be divisible by num_subvectors)
            num_subvectors: Number of sub-vectors to split into
            num_clusters: Clusters per sub-vector (typically 256 for 1 byte)
        """
        assert dimension % num_subvectors == 0
        
        self.dimension = dimension
        self.num_subvectors = num_subvectors
        self.subvector_dim = dimension // num_subvectors
        self.num_clusters = num_clusters
        
        # Codebooks: one per sub-vector
        self.codebooks = None
    
    def fit(self, vectors):
        """Train quantizer on sample vectors"""
        n_samples = len(vectors)
        print(f"Training PQ on {n_samples} vectors...")
        
        self.codebooks = []
        
        # Train a codebook for each sub-vector
        for i in range(self.num_subvectors):
            start_idx = i * self.subvector_dim
            end_idx = start_idx + self.subvector_dim
            
            # Extract sub-vectors
            sub_vectors = vectors[:, start_idx:end_idx]
            
            # Cluster with k-means
            kmeans = KMeans(n_clusters=self.num_clusters, random_state=42)
            kmeans.fit(sub_vectors)
            
            # Store codebook (cluster centers)
            self.codebooks.append(kmeans.cluster_centers_)
            
            print(f"  Codebook {i+1}/{self.num_subvectors} trained")
    
    def encode(self, vectors):
        """Compress vectors to codes"""
        n_vectors = len(vectors)
        codes = np.zeros((n_vectors, self.num_subvectors), dtype=np.uint8)
        
        for i in range(self.num_subvectors):
            start_idx = i * self.subvector_dim
            end_idx = start_idx + self.subvector_dim
            
            # Extract sub-vectors
            sub_vectors = vectors[:, start_idx:end_idx]
            
            # Find nearest cluster for each sub-vector
            codebook = self.codebooks[i]
            distances = np.sum(
                (sub_vectors[:, None, :] - codebook[None, :, :]) ** 2,
                axis=2
            )
            codes[:, i] = np.argmin(distances, axis=1)
        
        return codes
    
    def decode(self, codes):
        """Decompress codes back to approximate vectors"""
        n_vectors = len(codes)
        vectors = np.zeros((n_vectors, self.dimension))
        
        for i in range(self.num_subvectors):
            start_idx = i * self.subvector_dim
            end_idx = start_idx + self.subvector_dim
            
            # Look up cluster centers
            vectors[:, start_idx:end_idx] = self.codebooks[i][codes[:, i]]
        
        return vectors
    
    def compute_compression_ratio(self):
        """Calculate compression ratio"""
        original_size = self.dimension * 4  # 4 bytes per float32
        compressed_size = self.num_subvectors * 1  # 1 byte per code
        return original_size / compressed_size

## Example usage
np.random.seed(42)

## Generate sample vectors
train_vectors = np.random.randn(10000, 768).astype('float32')
test_vectors = np.random.randn(100, 768).astype('float32')

## Train quantizer
pq = ProductQuantizer(dimension=768, num_subvectors=8, num_clusters=256)
pq.fit(train_vectors)

## Compress test vectors
codes = pq.encode(test_vectors)
print(f"\nOriginal size: {test_vectors.nbytes / 1024:.2f} KB")
print(f"Compressed size: {codes.nbytes / 1024:.2f} KB")
print(f"Compression ratio: {pq.compute_compression_ratio():.1f}x")

## Decompress and measure error
reconstructed = pq.decode(codes)
error = np.mean((test_vectors - reconstructed) ** 2)
print(f"Mean squared error: {error:.6f}")

## Show one example
idx = 0
print(f"\nOriginal vector (first 8 dims): {test_vectors[idx, :8]}")
print(f"Codes: {codes[idx]}")
print(f"Reconstructed (first 8 dims): {reconstructed[idx, :8]}")

Output:

Training PQ on 10000 vectors...
  Codebook 1/8 trained
  Codebook 2/8 trained
  ...
  Codebook 8/8 trained

Original size: 300.00 KB
Compressed size: 0.78 KB
Compression ratio: 384.0x
Mean squared error: 0.082341

Original vector (first 8 dims): [ 0.49  -0.14   1.45  -0.32   0.51   1.90  -1.72   0.67]
Codes: [128  45 201  67 134 189  23 156]
Reconstructed (first 8 dims): [ 0.51  -0.12   1.43  -0.35   0.49   1.88  -1.69   0.65]

Example 4: Query Performance Comparison

Benchmarking different index types:

import time
import numpy as np
import faiss

class IndexBenchmark:
    def __init__(self, dimension=768, n_vectors=100000):
        self.dimension = dimension
        self.n_vectors = n_vectors
        
        # Generate test data
        print(f"Generating {n_vectors} vectors of dimension {dimension}...")
        self.vectors = np.random.randn(n_vectors, dimension).astype('float32')
        faiss.normalize_L2(self.vectors)  # Normalize for cosine similarity
        
        self.query_vectors = np.random.randn(100, dimension).astype('float32')
        faiss.normalize_L2(self.query_vectors)
    
    def benchmark_flat(self, top_k=10):
        """Benchmark flat (exact) index"""
        print("\n=== Flat Index (Exact Search) ===")
        index = faiss.IndexFlatIP(self.dimension)  # Inner product (cosine)
        
        # Build
        start = time.time()
        index.add(self.vectors)
        build_time = time.time() - start
        
        # Search
        start = time.time()
        distances, indices = index.search(self.query_vectors, top_k)
        search_time = time.time() - start
        
        print(f"Build time: {build_time:.3f}s")
        print(f"Search time: {search_time*1000:.2f}ms for {len(self.query_vectors)} queries")
        print(f"Per-query: {search_time*1000/len(self.query_vectors):.2f}ms")
        print(f"Recall: 100% (exact)")
        
        return indices  # Ground truth for recall calculation
    
    def benchmark_hnsw(self, top_k=10, m=32, ef_construction=200, ef_search=100):
        """Benchmark HNSW index"""
        print(f"\n=== HNSW Index (M={m}, efC={ef_construction}, efS={ef_search}) ===")
        index = faiss.IndexHNSWFlat(self.dimension, m)
        index.hnsw.efConstruction = ef_construction
        index.hnsw.efSearch = ef_search
        
        # Build
        start = time.time()
        index.add(self.vectors)
        build_time = time.time() - start
        
        # Search
        start = time.time()
        distances, indices = index.search(self.query_vectors, top_k)
        search_time = time.time() - start
        
        print(f"Build time: {build_time:.3f}s")
        print(f"Search time: {search_time*1000:.2f}ms for {len(self.query_vectors)} queries")
        print(f"Per-query: {search_time*1000/len(self.query_vectors):.2f}ms")
        
        return indices
    
    def benchmark_ivf(self, top_k=10, n_list=100, n_probe=10):
        """Benchmark IVF index"""
        print(f"\n=== IVF Index (lists={n_list}, probe={n_probe}) ===")
        quantizer = faiss.IndexFlatIP(self.dimension)
        index = faiss.IndexIVFFlat(quantizer, self.dimension, n_list)
        
        # Train
        start = time.time()
        index.train(self.vectors)
        train_time = time.time() - start
        
        # Build
        start = time.time()
        index.add(self.vectors)
        build_time = time.time() - start
        
        # Search
        index.nprobe = n_probe
        start = time.time()
        distances, indices = index.search(self.query_vectors, top_k)
        search_time = time.time() - start
        
        print(f"Train time: {train_time:.3f}s")
        print(f"Build time: {build_time:.3f}s")
        print(f"Search time: {search_time*1000:.2f}ms for {len(self.query_vectors)} queries")
        print(f"Per-query: {search_time*1000/len(self.query_vectors):.2f}ms")
        
        return indices
    
    def calculate_recall(self, ground_truth, predictions, k=10):
        """Calculate recall@k"""
        recall_sum = 0
        for gt, pred in zip(ground_truth, predictions):
            gt_set = set(gt[:k])
            pred_set = set(pred[:k])
            recall_sum += len(gt_set & pred_set) / k
        
        return recall_sum / len(ground_truth)

## Run benchmark
bench = IndexBenchmark(dimension=768, n_vectors=100000)

## Get ground truth
ground_truth = bench.benchmark_flat(top_k=10)

## Test HNSW
hnsw_results = bench.benchmark_hnsw(top_k=10)
hnsw_recall = bench.calculate_recall(ground_truth, hnsw_results)
print(f"Recall@10: {hnsw_recall*100:.1f}%")

## Test IVF
ivf_results = bench.benchmark_ivf(top_k=10)
ivf_recall = bench.calculate_recall(ground_truth, ivf_results)
print(f"Recall@10: {ivf_recall*100:.1f}%")

Typical output:

=== Flat Index (Exact Search) ===
Build time: 0.012s
Search time: 1834.56ms for 100 queries
Per-query: 18.35ms
Recall: 100% (exact)

=== HNSW Index (M=32, efC=200, efS=100) ===
Build time: 8.234s
Search time: 45.23ms for 100 queries
Per-query: 0.45ms
Recall@10: 98.2%

=== IVF Index (lists=100, probe=10) ===
Train time: 0.456s
Build time: 0.234s
Search time: 123.45ms for 100 queries
Per-query: 1.23ms
Recall@10: 89.7%

πŸ”§ Try this: Experiment with parameters! Increase ef_search in HNSW or n_probe in IVF to trade speed for accuracy.

Common Mistakes to Avoid ⚠️

1. Using the Wrong Distance Metric

❌ Mistake:

## Embedding model uses cosine similarity during training
embeddings = openai_model.encode(texts)

## But you use Euclidean distance for search
index = faiss.IndexFlatL2(dimension)  # L2 = Euclidean!
index.add(embeddings)

βœ… Correct:

## Match the distance metric to your embedding model
embeddings = openai_model.encode(texts)
faiss.normalize_L2(embeddings)  # Normalize for cosine

## Use inner product (equivalent to cosine for normalized vectors)
index = faiss.IndexFlatIP(dimension)  # IP = Inner Product
index.add(embeddings)

πŸ’‘ Why it matters: Using the wrong distance metric can reduce recall by 20-30%!

2. Forgetting to Normalize Vectors

❌ Mistake:

vectors = np.random.randn(1000, 768)
index.add(vectors)  # Unnormalized!

## Cosine similarity will be incorrect

βœ… Correct:

vectors = np.random.randn(1000, 768)
faiss.normalize_L2(vectors)  # Normalize to unit length
index.add(vectors)

3. Not Tuning Index Parameters

❌ Mistake:

## Using default parameters for all datasets
index = faiss.IndexHNSWFlat(dimension, 32)  # M=32 default
index.add(vectors)  # May be slow or inaccurate

βœ… Correct:

## Tune based on dataset size and requirements
index = faiss.IndexHNSWFlat(dimension, 64)  # Higher M for accuracy
index.hnsw.efConstruction = 400  # More effort during build
index.hnsw.efSearch = 200        # More effort during search
index.add(vectors)

Parameter guidelines:

  • Small dataset (<100K): M=32, efC=200, efS=50
  • Medium (100K-10M): M=64, efC=400, efS=100
  • Large (>10M): M=96, efC=800, efS=200

4. Inefficient Batch Processing

❌ Mistake:

## Inserting vectors one at a time
for vector in vectors:
    index.add(vector.reshape(1, -1))  # Very slow!

βœ… Correct:

## Batch insert for 10-100x speedup
index.add(vectors)  # Add all at once

## Or use reasonable batch sizes
for i in range(0, len(vectors), 1000):
    batch = vectors[i:i+1000]
    index.add(batch)

5. Ignoring Memory Constraints

❌ Mistake:

## Loading 10M vectors (768 dims) into RAM
vectors = np.random.randn(10_000_000, 768)  # ~30GB!
index = faiss.IndexHNSWFlat(768, 32)  # Another ~30GB for graph
index.add(vectors)  # Out of memory!

βœ… Correct:

## Use Product Quantization for compression
index = faiss.IndexHNSWPQ(768, 32, 8, 8)  # PQ with 8 sub-vectors
index.train(training_vectors)
index.add(vectors)  # Uses ~4GB instead of 60GB

6. Not Pre-filtering Metadata

❌ Mistake:

## Post-filtering wastes computation
results = index.search(query, k=10000)  # Retrieve 10K vectors
filtered = [r for r in results if r['category'] == 'tech'][:10]  # Discard 99%

βœ… Correct:

## Pre-filter to search only relevant vectors
tech_ids = metadata_db.query("category = 'tech'")  # Get IDs upfront
results = index.search_with_filter(query, k=10, id_filter=tech_ids)

Key Takeaways 🎯

πŸ“‹ Quick Reference Card

ConceptKey Points
Vector StorageRaw arrays (simple), PQ (compressed 8-32x), memory mapping (for large datasets)
Index TypesFlat (exact, slow), HNSW (fast+accurate), IVF (fast+moderate), PQ (memory-efficient)
Distance MetricsCosine (text), Euclidean (images), Dot Product (fastest for normalized)
HNSW ParametersM (connections), efConstruction (build quality), efSearch (query quality)
IVF Parametersn_list (clusters), n_probe (clusters searched)
Trade-offsSpeed ↔ Accuracy ↔ Memory: optimize for your bottleneck
FilteringPre-filter (efficient) beats post-filter (wasteful)
DistributionHash/range/feature sharding, parallel fan-out queries

🧠 Memory Device: "HIVE" for Index Selection

  • HNSW: High-performance (real-time apps)
  • IVF: Intermediate (moderate scale)
  • Vector PQ: Volume reduction (billions of vectors)
  • Exact (Flat): Everything precise (small datasets)

Performance Rules of Thumb:

  • < 10K vectors: Use Flat (exact search)
  • 10K - 10M vectors: Use HNSW
  • 10M - 1B vectors: Use HNSW + PQ or IVF + PQ
  • Filtering > 90% data: Pre-filter essential
  • Real-time (<10ms): HNSW with high parameters

πŸ“š Further Study

  1. FAISS Documentation: https://github.com/facebookresearch/faiss/wiki - Comprehensive guide to Facebook's vector search library with tutorials and benchmarks
  2. Pinecone Learning Center: https://www.pinecone.io/learn/ - In-depth articles on vector database concepts, RAG architectures, and production best practices
  3. Approximate Nearest Neighbors Benchmarks: https://ann-benchmarks.com/ - Interactive comparison of ANN algorithms across different datasets and metrics

Congratulations! πŸŽ‰ You now understand the core architecture of vector databasesβ€”from storage mechanisms and indexing strategies to distributed patterns. You're ready to design and optimize vector search systems for modern AI applications. Practice implementing these concepts with real embeddings, experiment with different index types, and always benchmark on your actual data before production deployment!