ANN Algorithms
Deep dive into Approximate Nearest Neighbor search methods and their trade-offs between speed and accuracy.
ANN Algorithms for Vector Search
Master Approximate Nearest Neighbor (ANN) algorithms with free flashcards and spaced repetition practice. This lesson covers graph-based methods like HNSW, tree-based structures like Annoy, locality-sensitive hashing, and quantization techniquesβessential concepts for building scalable AI search systems and modern RAG applications.
Welcome π―
When you're searching through millions or billions of high-dimensional vectors in a production AI system, exact nearest neighbor search becomes impossibly slow. This is where Approximate Nearest Neighbor (ANN) algorithms shine. These clever techniques trade a small amount of accuracy for massive speed improvementsβoften delivering 100x to 1000x faster searches while maintaining 95%+ recall.
In this lesson, you'll discover the core algorithms powering modern vector databases: the hierarchical graph navigation of HNSW, the binary tree partitioning of Annoy, the hash-bucketing magic of LSH, and the compression brilliance of Product Quantization. By understanding these algorithms, you'll know how to choose the right approach for your specific use case, tune parameters for optimal performance, and debug issues when your vector search isn't performing as expected.
π‘ Why This Matters: Every major vector databaseβfrom Pinecone to Weaviate to Qdrantβimplements one or more of these algorithms under the hood. Understanding them transforms you from a database user into someone who can architect intelligent, high-performance search systems.
Core Concepts: The ANN Algorithm Landscape πΊοΈ
The Fundamental Trade-off
All ANN algorithms balance three competing factors:
| Factor | Description | Trade-off |
|---|---|---|
| Speed | Query latency (ms) | Faster β Less accurate |
| Recall | % of true neighbors found | Higher β Slower/more memory |
| Memory | Index size overhead | Lower β Less accurate |
Recall is the key metric: if you want the top 10 nearest neighbors, and your ANN algorithm returns 9 that would be in the true top-10, that's 90% recall. Production systems typically target 95-99% recall.
Algorithm Categories
ANN algorithms fall into four main families:
βββββββββββββββββββββββββββββββββββββββββββββββββββ β ANN ALGORITHM TAXONOMY β βββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π GRAPH-BASED β β ββ HNSW (Hierarchical Navigable Small World)β β ββ NSG (Navigable Small Graph) β β ββ DiskANN β β β β π³ TREE-BASED β β ββ Annoy (Approximate Nearest Neighbors) β β ββ KD-Tree variants β β ββ Ball Tree β β β β #οΈβ£ HASH-BASED β β ββ LSH (Locality-Sensitive Hashing) β β ββ Multi-Probe LSH β β ββ Cross-Polytope LSH β β β β ποΈ QUANTIZATION β β ββ Product Quantization (PQ) β β ββ Scalar Quantization (SQ) β β ββ Binary Quantization β β β βββββββββββββββββββββββββββββββββββββββββββββββββββ
1. HNSW: The Graph Navigation Champion π
Hierarchical Navigable Small World (HNSW) is currently the most popular ANN algorithm for its exceptional balance of speed, accuracy, and ease of use.
How HNSW Works
HNSW builds a multi-layer graph where:
- Each vector is a node
- Edges connect similar vectors
- Upper layers are sparse (skip lists for fast navigation)
- Bottom layer is dense (contains all vectors)
HNSW STRUCTURE (3 layers)
Layer 2 (sparse):
(A)ββββββββββββββββββββββββ(F)
β β
β β
Layer 1 (medium):
(A)βββ(B)βββββββ(D)βββ(E)βββ(F)
β β β β β
β β β β β
Layer 0 (all vectors):
(A)β(B)β(C)β(D)β(E)β(F)β(G)β(H)
β β β β β β β β
(I)β(J)β(K)β(L)β(M)β(N)β(O)β(P)
Search Process:
- Start at a random entry point in the top layer
- Navigate to the nearest neighbor using greedy search
- Drop down one layer
- Repeat until reaching the bottom layer
- Perform a local search to find k nearest neighbors
π‘ Why It's Fast: By starting at sparse upper layers, HNSW quickly zooms in on the right region, then refines at lower layers. It's like using a map's zoom levelsβstart zoomed out, then drill down.
HNSW Parameters
| Parameter | Description | Impact |
|---|---|---|
| M | Max edges per node | β = Better recall, more memory |
| ef_construction | Candidates during build | β = Better quality, slower build |
| ef_search | Candidates during query | β = Better recall, slower search |
Typical Production Values:
M = 16 # Good default balance
ef_construction = 200 # Build once, search many times
ef_search = 50 # Tune based on latency requirements
β οΈ Common Mistake: Setting ef_search < k (fewer candidates than results requested). Always ensure ef_search >= k.
2. Annoy: Binary Tree Partitioning π³
Annoy (Approximate Nearest Neighbors Oh Yeah) was developed by Spotify for music recommendations. It builds multiple random projection trees to partition the vector space.
How Annoy Works
Tree Construction:
- Pick two random points from your dataset
- Draw a hyperplane halfway between them
- Split all points based on which side they fall on
- Recursively partition each side
- Build multiple trees (typically 10-100) with different random splits
ANNOY RANDOM PROJECTION TREE
Root
β
ββββββββββββ΄βββββββββββ
β Hyperplane 1 β
ββββββ΄βββββ ββββββ΄βββββ
β H-plane2β β H-plane3β
βββ΄ββ βββ΄ββ βββ΄ββ βββ΄ββ
βA,Bβ βC,Dβ βE,Fβ βG,Hβ
βββββ βββββ βββββ βββββ
Leaves contain actual vectors
Search Process:
- Traverse each tree from root to leaf
- Collect all candidates from the reached leaves
- Compute exact distances to all candidates
- Return the k closest
π‘ Why Multiple Trees: A single tree might unluckily separate similar vectors. Multiple trees with different random splits ensure you'll find neighbors in at least some trees.
Annoy Parameters
from annoy import AnnoyIndex
f = 128 # Vector dimensionality
t = AnnoyIndex(f, 'angular') # or 'euclidean'
## Add vectors
for i, vector in enumerate(vectors):
t.add_item(i, vector)
## Build with n_trees
t.build(n_trees=10) # More trees = better recall, more memory
t.save('index.ann')
## Search with search_k candidates
neighbors = t.get_nns_by_vector(query_vector, k=10, search_k=100)
| Parameter | Description | Typical Range |
|---|---|---|
| n_trees | Number of trees to build | 10-100 (more for higher recall) |
| search_k | Nodes to inspect during search | n_trees * k to n_trees * k * 10 |
Advantages:
- π Memory-mapped: Can search without loading entire index into RAM
- π¦ Compact: Index size is close to raw vector data
- π― Predictable: Performance scales smoothly with parameters
Disadvantages:
- π Static: Must rebuild entire index to add vectors
- π Lower recall: Typically doesn't match HNSW at same speed
3. Locality-Sensitive Hashing (LSH) #οΈβ£
LSH uses special hash functions that map similar vectors to the same hash buckets with high probability.
The LSH Magic Trick
Normal hash functions (like MD5) aim to spread similar inputs far apart. LSH does the opposite:
REGULAR HASH vs LSH Regular Hash (MD5): Input: "hello" β Hash: a3f8c9... Input: "hallo" β Hash: 8d2e1f... (totally different!) LSH for vectors: Vector A: [0.9, 0.1, 0.2] β Bucket: 10110 Vector B: [0.85, 0.15, 0.25] β Bucket: 10110 (same!) Vector C: [0.1, 0.9, 0.8] β Bucket: 01001 (different)
Random Hyperplane LSH
The most common LSH for cosine similarity uses random hyperplanes:
import numpy as np
class LSHIndex:
def __init__(self, dim, n_bits=16, n_tables=10):
self.dim = dim
self.n_bits = n_bits
self.n_tables = n_tables
# Create random hyperplanes for each table
self.hyperplanes = [
np.random.randn(n_bits, dim)
for _ in range(n_tables)
]
self.tables = [{} for _ in range(n_tables)]
def hash_vector(self, vector, table_idx):
# Project onto hyperplanes, check which side
projections = np.dot(self.hyperplanes[table_idx], vector)
# Convert to binary hash
bits = (projections > 0).astype(int)
return tuple(bits)
def insert(self, vector, item_id):
for i in range(self.n_tables):
hash_val = self.hash_vector(vector, i)
if hash_val not in self.tables[i]:
self.tables[i][hash_val] = []
self.tables[i][hash_val].append(item_id)
def query(self, vector, k=10):
candidates = set()
for i in range(self.n_tables):
hash_val = self.hash_vector(vector, i)
if hash_val in self.tables[i]:
candidates.update(self.tables[i][hash_val])
# Rank candidates by actual distance
return list(candidates)[:k] # Simplified
LSH Parameters
| Parameter | Effect |
|---|---|
| n_bits | Hash length (more bits = finer buckets) |
| n_tables | Independent hash tables (more = better recall) |
π‘ The LSH Insight: Similar vectors will land in the same bucket in at least one table. More tables = more chances to find neighbors.
Advantages:
- β‘ Sub-linear search: O(n^Ο) where Ο < 1
- πΎ Memory efficient: Just store hash buckets
- β Supports updates: Easy to add/remove vectors
Disadvantages:
- π― Lower recall: Hard to tune for high recall
- π Dimension dependent: Performance degrades in very high dimensions
4. Product Quantization (PQ) ποΈ
Product Quantization compresses vectors by breaking them into chunks and replacing each chunk with the nearest centroid from a learned codebook. Think of it as "lossy compression for vectors."
How PQ Works
Step 1: Split Vectors into Subvectors
## Original 128-dim vector
vector = [0.1, 0.5, 0.3, ..., 0.8] # 128 floats
## Split into 8 subvectors of 16 dimensions each
subvectors = [
vector[0:16], # Subvector 1
vector[16:32], # Subvector 2
vector[32:48], # Subvector 3
# ... 8 total
]
Step 2: Learn Codebooks
For each subvector position, cluster all training subvectors into M centroids (typically M=256):
CODEBOOK for Subvector 1 (256 centroids) ββββββββββββββββββββββββββββββββββββ β ID 0: [0.12, 0.45, 0.32, ...] β β ID 1: [0.89, 0.01, 0.56, ...] β β ID 2: [0.34, 0.67, 0.11, ...] β β ... β β ID 255: [0.55, 0.23, 0.78, ...] β ββββββββββββββββββββββββββββββββββββ
Step 3: Encode Vectors
Replace each subvector with its nearest centroid ID:
## Original: 128 floats Γ 4 bytes = 512 bytes
vector = [0.1, 0.5, 0.3, ..., 0.8]
## Encoded: 8 bytes (one ID per subvector)
encoded = [42, 17, 201, 88, 156, 3, 99, 211]
## β Each ID is 0-255 (1 byte)
Compression Ratio: 512 bytes β 8 bytes = 64x compression!
PQ Search
To search without decompressing:
def pq_search(query_vector, encoded_db, codebooks, k=10):
# 1. Split query into subvectors
query_subs = split_into_subvectors(query_vector)
# 2. Precompute distances from query to all centroids
distance_tables = []
for i, sub in enumerate(query_subs):
# Distance from query_sub to each of 256 centroids
distances = compute_distances(sub, codebooks[i])
distance_tables.append(distances)
# 3. Score each database vector using lookup tables
scores = []
for encoded_vec in encoded_db:
score = 0
for i, centroid_id in enumerate(encoded_vec):
# Just look up precomputed distance!
score += distance_tables[i][centroid_id]
scores.append(score)
# 4. Return top-k
return get_topk(scores, k)
π‘ The PQ Speedup: We compute distances to only 8Γ256=2048 centroids once, then use table lookups for every database vector. No floating-point math per vector!
PQ Parameters
| Parameter | Typical Value | Trade-off |
|---|---|---|
| m (subvectors) | 8, 16, 32 | More = better accuracy, larger tables |
| k (centroids) | 256 (fits in 1 byte) | More = better accuracy, more memory |
PQ + IVFPQ: Often combined with inverted file index (IVF) for two-stage search:
- Coarse search: Find relevant clusters using IVF
- Fine search: Use PQ within selected clusters
from faiss import IndexIVFPQ
## Create IVF with PQ
quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFPQ(
quantizer,
dim, # Vector dimension
n_clusters=1024, # IVF clusters
m=8, # PQ subvectors
bits_per_code=8 # 256 centroids
)
5. Scalar Quantization (SQ) π
A simpler compression approach: convert float32 to int8.
## Original float32 (4 bytes per dimension)
vector = np.array([0.123, 0.876, -0.432, 0.591], dtype=np.float32)
## Memory: 4 dimensions Γ 4 bytes = 16 bytes
## Find min/max
vmin, vmax = vector.min(), vector.max()
## Quantize to int8 (1 byte per dimension)
quantized = np.round(
(vector - vmin) / (vmax - vmin) * 255
).astype(np.uint8)
## Memory: 4 dimensions Γ 1 byte = 4 bytes
## Result: [0, 255, 42, 196]
## Compression: 4x, but some precision lost
Advantages of SQ:
- π― Better recall than PQ (less information loss)
- π» SIMD-friendly: Fast with modern CPUs
- π§ Simple: No training needed
Disadvantages:
- π¦ Lower compression: Only 4x vs PQ's 64x
- π Dimension dependent: Savings scale with original dimension
Examples: Choosing and Implementing ANN Algorithms π§
Example 1: Building an HNSW Index with FAISS
Scenario: You have 1 million product embeddings (384 dimensions) for semantic search.
import faiss
import numpy as np
## Generate sample data
n_vectors = 1_000_000
dim = 384
vectors = np.random.randn(n_vectors, dim).astype('float32')
## Normalize for cosine similarity
faiss.normalize_L2(vectors)
## Create HNSW index
M = 32 # Number of connections per layer
ef_construction = 200 # Quality during construction
index = faiss.IndexHNSWFlat(dim, M)
index.hnsw.efConstruction = ef_construction
## Build the index
print("Building HNSW index...")
index.add(vectors)
print(f"Index built with {index.ntotal} vectors")
## Search with different ef_search values
query = np.random.randn(1, dim).astype('float32')
faiss.normalize_L2(query)
for ef in [16, 32, 64, 128]:
index.hnsw.efSearch = ef
distances, indices = index.search(query, k=10)
print(f"ef_search={ef}: Found {len(indices[0])} neighbors")
Expected Output:
Building HNSW index...
Index built with 1000000 vectors
ef_search=16: Found 10 neighbors
ef_search=32: Found 10 neighbors # Better recall
ef_search=64: Found 10 neighbors # Even better
ef_search=128: Found 10 neighbors # Highest quality
π‘ Tuning Insight: Monitor latency vs recall. Start with ef_search = 50, increase if recall is too low, decrease if latency is too high.
Example 2: Memory-Efficient Search with IVF + PQ
Scenario: You have 10 million vectors but limited RAM. Compress with Product Quantization.
import faiss
import numpy as np
n_vectors = 10_000_000
dim = 768 # Large language model embeddings
vectors = np.random.randn(n_vectors, dim).astype('float32')
## Create IVF + PQ index
n_clusters = 4096 # Number of IVF clusters
m = 64 # Split into 64 subvectors (768/64 = 12 dims each)
bits = 8 # 256 centroids per subvector
quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFPQ(quantizer, dim, n_clusters, m, bits)
## Train on sample data (PQ needs training)
print("Training index...")
training_sample = vectors[:100_000] # Use 100k for training
index.train(training_sample)
## Add all vectors
print("Adding vectors...")
index.add(vectors)
print(f"\nMemory comparison:")
print(f"Original: {n_vectors * dim * 4 / 1e9:.2f} GB")
print(f"With PQ: {n_vectors * m / 1e9:.2f} GB")
print(f"Compression: {(dim * 4) / m:.1f}x")
## Search
index.nprobe = 32 # Search top 32 clusters
query = np.random.randn(1, dim).astype('float32')
distances, indices = index.search(query, k=10)
Output:
Training index...
Adding vectors...
Memory comparison:
Original: 30.72 GB
With PQ: 0.64 GB
Compression: 48.0x
π‘ Production Tip: Train PQ on representative data. If your distribution changes, retrain the codebooks.
Example 3: Dynamic Updates with LSH
Scenario: Real-time vector search where vectors are constantly added/removed.
import numpy as np
from collections import defaultdict
class SimpleLSH:
def __init__(self, dim, n_bits=16, n_tables=20):
self.dim = dim
self.n_bits = n_bits
self.n_tables = n_tables
# Random hyperplanes for each table
self.planes = [
np.random.randn(n_bits, dim)
for _ in range(n_tables)
]
# Hash tables: {hash -> [vector_ids]}
self.tables = [
defaultdict(list) for _ in range(n_tables)
]
# Store actual vectors
self.vectors = {}
def _hash(self, vector, table_idx):
"""Compute binary hash for a vector"""
projections = self.planes[table_idx] @ vector
bits = tuple((projections > 0).astype(int))
return bits
def insert(self, vector_id, vector):
"""Add a vector to the index"""
self.vectors[vector_id] = vector
for i in range(self.n_tables):
hash_val = self._hash(vector, i)
self.tables[i][hash_val].append(vector_id)
def remove(self, vector_id):
"""Remove a vector from the index"""
if vector_id not in self.vectors:
return
vector = self.vectors[vector_id]
for i in range(self.n_tables):
hash_val = self._hash(vector, i)
self.tables[i][hash_val].remove(vector_id)
del self.vectors[vector_id]
def query(self, query_vector, k=10):
"""Find k nearest neighbors"""
# Collect candidates from all tables
candidates = set()
for i in range(self.n_tables):
hash_val = self._hash(query_vector, i)
candidates.update(self.tables[i][hash_val])
# Rank by actual distance
distances = [
(vid, np.linalg.norm(self.vectors[vid] - query_vector))
for vid in candidates
]
distances.sort(key=lambda x: x[1])
return [vid for vid, _ in distances[:k]]
## Usage
lsh = SimpleLSH(dim=128, n_bits=16, n_tables=20)
## Add vectors dynamically
for i in range(10000):
vec = np.random.randn(128).astype('float32')
lsh.insert(f"vec_{i}", vec)
## Search
query = np.random.randn(128).astype('float32')
results = lsh.query(query, k=10)
print(f"Found neighbors: {results}")
## Remove vector (instant!)
lsh.remove("vec_5000")
Key Advantage: Updates are O(n_tables) = O(1) with respect to database size.
Example 4: Hybrid Approach - HNSW + SQ
Scenario: Balance speed, accuracy, and memory.
import faiss
import numpy as np
n_vectors = 5_000_000
dim = 512
vectors = np.random.randn(n_vectors, dim).astype('float32')
## Create HNSW with Scalar Quantization
M = 32
index = faiss.IndexHNSWSQ(dim, faiss.ScalarQuantizer.QT_8bit, M)
index.hnsw.efConstruction = 200
## Train scalar quantizer (learns min/max per dimension)
print("Training scalar quantizer...")
index.train(vectors[:100_000])
## Add vectors
print("Adding vectors...")
index.add(vectors)
print(f"\nMemory savings:")
print(f"Float32: {n_vectors * dim * 4 / 1e9:.2f} GB")
print(f"Int8 (SQ): {n_vectors * dim * 1 / 1e9:.2f} GB")
print(f"Savings: 4x compression")
## Search with high quality
index.hnsw.efSearch = 100
query = np.random.randn(1, dim).astype('float32')
distances, indices = index.search(query, k=10)
print(f"\nTop-10 neighbors: {indices[0]}")
Why This Works:
- HNSW provides excellent graph navigation
- SQ reduces memory by 4x with minimal recall loss
- Still much faster than exhaustive search
Common Mistakes β οΈ
1. Not Normalizing Vectors for Cosine Similarity
β Wrong:
index = faiss.IndexHNSWFlat(dim, 32)
index.add(vectors) # Vectors not normalized
β Right:
import faiss
faiss.normalize_L2(vectors) # Normalize first!
index = faiss.IndexHNSWFlat(dim, 32)
index.add(vectors)
Why: HNSW uses L2 distance by default. For cosine similarity, you must normalize vectors so that L2 distance approximates cosine.
2. Setting ef_search < k in HNSW
β Wrong:
index.hnsw.efSearch = 10
results = index.search(query, k=50) # Asking for more than candidates!
β Right:
index.hnsw.efSearch = 100 # Always >= k
results = index.search(query, k=50)
3. Not Training PQ/IVF Indices
β Wrong:
index = faiss.IndexIVFPQ(quantizer, dim, 1024, 8, 8)
index.add(vectors) # Error: index not trained!
β Right:
index = faiss.IndexIVFPQ(quantizer, dim, 1024, 8, 8)
index.train(training_vectors) # Train first
index.add(vectors)
4. Using Too Few nprobe in IVF Search
β Wrong:
index.nprobe = 1 # Only search 1 cluster - terrible recall!
β Right:
index.nprobe = 32 # Search 32 clusters - much better
π‘ Rule of Thumb: Start with nprobe = sqrt(n_clusters), tune based on recall.
5. Forgetting to Set Quantizer Training
β Wrong:
quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFPQ(quantizer, dim, 1024, 8, 8)
index.train(vectors) # Quantizer also needs vectors
β Right:
quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFPQ(quantizer, dim, 1024, 8, 8)
quantizer.train(vectors) # Train quantizer
index.train(vectors) # Then train index
6. Not Benchmarking Recall
Always measure recall to ensure your approximation is acceptable:
def measure_recall(index, queries, ground_truth, k=10):
"""Measure recall@k against ground truth"""
_, predicted = index.search(queries, k)
recalls = []
for i in range(len(queries)):
true_neighbors = set(ground_truth[i][:k])
pred_neighbors = set(predicted[i])
recall = len(true_neighbors & pred_neighbors) / k
recalls.append(recall)
return np.mean(recalls)
## Get ground truth with exact search
exact_index = faiss.IndexFlatL2(dim)
exact_index.add(vectors)
_, ground_truth = exact_index.search(queries, k=10)
## Measure your ANN index
recall = measure_recall(ann_index, queries, ground_truth, k=10)
print(f"Recall@10: {recall:.2%}")
Key Takeaways π―
π ANN Algorithm Quick Reference
| Algorithm | Best For | Updates | Recall | Memory |
|---|---|---|---|---|
| HNSW | High recall, static datasets | β Slow | βββββ | Medium |
| Annoy | Memory-mapped, read-heavy | β Rebuild | ββββ | Low |
| LSH | Dynamic datasets, streaming | β Fast | βββ | Low |
| IVF+PQ | Huge datasets, limited RAM | β οΈ Retrain | ββββ | Very Low |
| HNSW+SQ | Balance speed/memory/recall | β Slow | βββββ | Low |
π Decision Framework
Choose HNSW when:
- Recall is critical (95%+ required)
- Dataset fits in memory
- Updates are infrequent
- You need predictable performance
Choose Annoy when:
- Dataset exceeds RAM (use memory mapping)
- You can rebuild index offline
- You need compact disk storage
- Spotify-level recall (90-95%) is acceptable
Choose LSH when:
- Vectors arrive in real-time
- You need immediate updates/deletes
- Lower recall (85-90%) is acceptable
- Memory is very limited
Choose IVF+PQ when:
- Dataset is massive (10M+ vectors)
- RAM is constrained
- You can afford retraining periodically
- You need 50-100x compression
Parameter Tuning Priorities:
- Start simple: Default parameters often work well
- Measure baseline: Get recall/latency metrics
- Tune for recall: Increase ef_search, nprobe, search_k
- Optimize latency: Decrease parameters once recall is met
- Monitor memory: Add quantization if needed
π Further Study
- FAISS Documentation: https://github.com/facebookresearch/faiss/wiki - Comprehensive guide to Meta's vector search library
- HNSW Paper: https://arxiv.org/abs/1603.09320 - Original research on Hierarchical Navigable Small World graphs
- ANN Benchmarks: https://ann-benchmarks.com/ - Compare algorithm performance across datasets
π§ Memory Device for Algorithm Selection:
- High recall β HNSW
- Annoy for Archived/static data
- LSH for Live updates
- PQ for Petabyte-scale compression
Ready to test your knowledge? Complete the practice questions below to reinforce your understanding of ANN algorithms! π