Cosine Similarity & Distance Metrics

Master similarity calculations, distance functions (cosine, euclidean, dot product) for comparing vectors.

Cosine Similarity & Distance Metrics

Understanding how to measure similarity between vectors is fundamental to modern AI search systems. Master cosine similarity, Euclidean distance, and Manhattan distance with free flashcards and spaced repetition practice. This lesson covers vector similarity calculations, distance metric selection criteria, and practical applications in semantic search—essential concepts for building retrieval-augmented generation (RAG) systems.

Welcome to Vector Similarity 📐

Imagine you're in a library with millions of books, and someone asks you to find books "similar" to their favorite novel. How do you measure "similarity"? In the world of AI search, we face the same challenge with text, images, and other data—but instead of books on shelves, we work with vectors (arrays of numbers) in high-dimensional space.

Distance metrics are mathematical functions that quantify how "close" or "far apart" two vectors are. The choice of metric fundamentally shapes how your search system behaves. Pick the wrong one, and your "similar" results might be wildly off. Pick the right one, and your system feels almost magical.

In this lesson, we'll explore the three most important distance metrics for AI search:

🎯 Cosine Similarity - measures angle between vectors
📏 Euclidean Distance - measures straight-line distance
🏙️ Manhattan Distance - measures grid-based distance

💡 Why This Matters: Every modern semantic search system—from Google's BERT to OpenAI's embeddings—relies on these metrics to find relevant information. Understanding them isn't optional; it's foundational.

Understanding Vector Representations 🔢

Before diving into distance metrics, let's clarify what we're measuring. In AI search, everything gets converted into embeddings—dense numerical vectors that capture semantic meaning.

Example: Text to Vector

The sentence "I love machine learning" might become:

[0.23, -0.41, 0.87, 0.15, -0.62, ...] (hundreds or thousands of dimensions)

Similar concepts cluster together in this vector space:

"I love machine learning" → [0.23, -0.41, 0.87, ...]
"Machine learning is amazing" → [0.19, -0.38, 0.91, ...]
"I hate pizza" → [-0.87, 0.62, -0.15, ...]

The first two sentences are semantically related (both positive about ML), so their vectors point in similar directions. The third is unrelated, pointing elsewhere.

       Vector Space Visualization
              ↑ y
              │
      ●───────┼───────● (similar vectors)
     "I love  │  "ML is amazing"
      ML"     │    ╱
              │   ╱ (small angle)
              │  ╱
    ──────────┼──────────→ x
              │
              │
              ●
         "I hate pizza"
       (different direction)

🔑 Key Insight: Distance metrics measure relationships between these vectors. Different metrics emphasize different aspects of "similarity."

Cosine Similarity: The Direction Matcher 🎯

Cosine similarity measures the angle between two vectors, ignoring their magnitude (length). It's the most popular metric for semantic search.

The Formula

For vectors A and B:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Where:

A · B = dot product (sum of element-wise products)
||A|| = magnitude of A (square root of sum of squared elements)
||B|| = magnitude of B

The Range

+1 = vectors point in exactly the same direction (identical angle)
0 = vectors are perpendicular (90° angle, no similarity)
-1 = vectors point in opposite directions (180° angle)

💡 Why "Cosine"? The formula literally computes cos(θ), where θ is the angle between vectors. Remember trigonometry? cos(0°) = 1, cos(90°) = 0, cos(180°) = -1.

Why Magnitude Doesn't Matter

Consider two document vectors:

Document A (short): [3, 4]
Document B (long, but same topic): [6, 8]

Document B is just Document A scaled by 2. They have identical direction (same topic), just different lengths (document size). Cosine similarity correctly returns 1.0 (perfect match), while Euclidean distance would say they're far apart.

    Cosine Similarity Visualization
         ↑
         │    B [6,8]
         │   ╱
       8 │  ╱
         │ ╱ 
       6 │╱  A [3,4]
         ╱   ╱
       4│   ╱
         │  ╱ (same angle θ)
       2 │ ╱
         │╱
    ─────┼─────────────→
         0  2  4  6

    cos(θ) = same for A and B
    → high similarity!

When to Use Cosine Similarity ✅

Text embeddings (most common use case)
When document length shouldn't affect similarity
High-dimensional sparse vectors (like TF-IDF)
Recommendation systems
Semantic search and RAG systems

When NOT to Use Cosine Similarity ❌

When magnitude matters (e.g., comparing temperature readings)
Low-dimensional data where scale is important
When vectors can be zero (cosine is undefined)

Euclidean Distance: The Straight Line 📏

Euclidean distance measures the straight-line distance between two points in space. It's the "as the crow flies" metric—what most people intuitively think of as "distance."

The Formula

For vectors A and B with n dimensions:

euclidean_distance(A, B) = √[(A₁-B₁)² + (A₂-B₂)² + ... + (Aₙ-Bₙ)²]

This is just the Pythagorean theorem extended to n dimensions!

The Range

0 = vectors are identical (no distance)
∞ = vectors can be arbitrarily far apart
Lower is better (opposite of cosine similarity)

2D Example

Points: A = [1, 2] and B = [4, 6]

Step	Calculation	Result
1	(4-1)² + (6-2)²	3² + 4² = 9 + 16 = 25
2	√25	5

    Euclidean Distance Visualization
         ↑
       6 │         B ●
         │        ╱│
       5 │       ╱ │
         │      ╱  │
       4 │     ╱   │ 4 units
         │    ╱    │
       3 │   ╱     │
         │  ╱      │
       2 │ ● ──────┘
         │ A  3 units
       1 │
         │
    ─────┼─────────────→
         0 1 2 3 4 5

    Distance = √(3² + 4²) = 5

Key Characteristics

🔹 Scale-sensitive: Doubling all values doubles the distance

🔹 Dimension-weighted equally: Each dimension contributes to total distance

🔹 Curse of dimensionality: In very high dimensions (1000+), distances become less meaningful (all points seem equally far apart)

When to Use Euclidean Distance ✅

Computer vision (image similarity)
Physical measurements (locations, sensor data)
Low-dimensional continuous data
k-means clustering
When magnitude and direction both matter

When NOT to Use Euclidean Distance ❌

High-dimensional sparse vectors (embeddings with 768+ dimensions)
When scale differences between dimensions are problematic
Text similarity (cosine is usually better)

Manhattan Distance: The Grid Walker 🏙️

Manhattan distance (also called L1 distance or taxicab distance) measures distance as if you're navigating a city grid—you can only move along axes, not diagonally.

The Formula

For vectors A and B:

manhattan_distance(A, B) = |A₁-B₁| + |A₂-B₂| + ... + |Aₙ-Bₙ|

Just sum the absolute differences—no squares or square roots!

Visual Intuition

Imagine Manhattan's street grid. To get from point A to point B, you walk along blocks (you can't cut through buildings).

    Manhattan Distance (L1)
         ↑
       6 │         B ●
         │         ↑│
       5 │         ││
         │         ││ 4 blocks north
       4 │         ││
         │         ││
       3 │         ││
         │         ││
       2 │ A ●─────→┘
         │   3 blocks east
       1 │
         │
    ─────┼─────────────→
         0 1 2 3 4 5

    Distance = 3 + 4 = 7 blocks
    (vs. Euclidean: 5 units)

Comparison: A = [1, 2], B = [4, 6]

Metric	Calculation	Result
Manhattan	\|4-1\| + \|6-2\|	3 + 4 = 7
Euclidean	√[(4-1)² + (6-2)²]	√25 = 5

Manhattan distance is always ≥ Euclidean distance (equality only when movement is along a single axis).

When to Use Manhattan Distance ✅

Sparse high-dimensional data
When you want to emphasize differences in individual dimensions
Computational efficiency (no squares/roots)
Regression problems (L1 regularization)
When outliers should have less influence than Euclidean

When NOT to Use Manhattan Distance ❌

When diagonal relationships are important
Rotation-sensitive applications
Most semantic search use cases (cosine wins)

Detailed Comparison Example 🔬

Let's compute all three metrics for concrete vectors to see how they differ.

Vectors:

A = [2, 3, 1]
B = [4, 1, 3]

Step-by-Step Calculations

1. Cosine Similarity

Step	Calculation	Result
Dot product	(2×4) + (3×1) + (1×3)	8 + 3 + 3 = 14
\|\|A\|\|	√(2² + 3² + 1²)	√14 ≈ 3.742
\|\|B\|\|	√(4² + 1² + 3²)	√26 ≈ 5.099
Cosine	14 / (3.742 × 5.099)	14 / 19.08 ≈ 0.734

2. Euclidean Distance

Step	Calculation	Result
Differences	(4-2)², (1-3)², (3-1)²	4, 4, 4
Sum	4 + 4 + 4	12
Distance	√12	≈ 3.464

3. Manhattan Distance

Step	Calculation	Result
Absolute diffs	\|4-2\| + \|1-3\| + \|3-1\|	2 + 2 + 2
Distance	Sum	6

Summary Table

Metric	Value	Interpretation
Cosine Similarity	0.734	Moderately similar direction
Euclidean Distance	3.464	Moderate spatial separation
Manhattan Distance	6.0	6 total "steps" apart

💡 Notice: The metrics tell different stories! Cosine says "pretty similar" (0.734), while Manhattan says "fairly far" (6). Neither is "wrong"—they measure different things.

Converting Between Similarity and Distance 🔄

Notice that cosine similarity is higher when vectors are more similar (max = 1), but Euclidean and Manhattan distances are lower when vectors are more similar (min = 0). This can be confusing!

Cosine Distance

To convert cosine similarity to a distance metric:

cosine_distance = 1 - cosine_similarity

Now:

0 = identical vectors
1 = perpendicular vectors
2 = opposite vectors

Normalized Euclidean

Euclidean distance can be normalized to [0, 1] range:

normalized = euclidean_distance / max_possible_distance

💡 Practical Tip: Most vector databases (Pinecone, Weaviate, Milvus) let you choose your metric. They handle score normalization so "higher is better" for all metrics in search results.

Choosing the Right Metric: Decision Framework 🎯

DECISION TREE: Which Metric to Use?

                ┌─────────────────┐
                │ What type of    │
                │ data do you     │
                │ have?           │
                └────────┬────────┘
                         │
        ┌────────────────┼────────────────┐
        │                │                │
    ┌───┴────┐      ┌───┴────┐      ┌───┴────┐
    │ Text/  │      │ Images │      │Physical│
    │Embeddi │      │Feature │      │Measure │
    │ngs     │      │Vectors │      │ments   │
    └───┬────┘      └───┬────┘      └───┬────┘
        │               │                │
        ▼               ▼                ▼
    ┌────────┐      ┌────────┐      ┌────────┐
    │ COSINE │      │EUCLIDE │      │EUCLIDE │
    │        │      │AN or   │      │AN      │
    │        │      │COSINE  │      │        │
    └────────┘      └────────┘      └────────┘

Quick Reference Guide

📋 Metric Selection Cheat Sheet

Use Case	Best Metric	Why
Semantic search (text)	Cosine	Document length doesn't matter
RAG retrieval	Cosine	Standard for transformer embeddings
Image similarity	Euclidean	Pixel intensities are scale-meaningful
Recommendation systems	Cosine	User preference direction > magnitude
Clustering	Euclidean	K-means standard
Sparse high-D data	Manhattan	Efficient, outlier-resistant
Geographic coordinates	Haversine*	Accounts for Earth's curvature

*Haversine is specialized for lat/long—beyond scope here

Real-World Performance Considerations ⚡

Computational Cost Ranking (fastest to slowest):

Manhattan - just additions and absolute values
Euclidean - requires squares and one square root
Cosine - requires dot product AND magnitude calculations (2 square roots)

For 1000-dimensional vectors:

Manhattan: ~1000 operations
Euclidean: ~2000 operations + 1 sqrt
Cosine: ~3000 operations + 2 sqrts

💡 Optimization Tip: For cosine similarity, pre-normalize vectors to unit length (||v|| = 1). Then cosine similarity becomes just a dot product—as fast as it gets!

If ||A|| = 1 and ||B|| = 1:
cosine_similarity(A, B) = A · B
(no need to divide by magnitudes!)

Practical Code Examples 💻

Let's implement all three metrics in Python:

Implementation from Scratch

import math

def cosine_similarity(a, b):
    """Calculate cosine similarity between vectors a and b"""
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = math.sqrt(sum(x**2 for x in a))
    magnitude_b = math.sqrt(sum(y**2 for y in b))
    return dot_product / (magnitude_a * magnitude_b)

def euclidean_distance(a, b):
    """Calculate Euclidean distance between vectors a and b"""
    return math.sqrt(sum((x - y)**2 for x, y in zip(a, b)))

def manhattan_distance(a, b):
    """Calculate Manhattan distance between vectors a and b"""
    return sum(abs(x - y) for x, y in zip(a, b))

## Example usage
vec1 = [2, 3, 1]
vec2 = [4, 1, 3]

print(f"Cosine Similarity: {cosine_similarity(vec1, vec2):.3f}")
print(f"Euclidean Distance: {euclidean_distance(vec1, vec2):.3f}")
print(f"Manhattan Distance: {manhattan_distance(vec1, vec2):.1f}")

Output:

Cosine Similarity: 0.734
Euclidean Distance: 3.464
Manhattan Distance: 6.0

Using NumPy (Production Code)

import numpy as np
from scipy.spatial.distance import cosine, euclidean, cityblock

vec1 = np.array([2, 3, 1])
vec2 = np.array([4, 1, 3])

## Note: scipy.spatial.distance.cosine returns DISTANCE (1 - similarity)
cos_sim = 1 - cosine(vec1, vec2)
print(f"Cosine Similarity: {cos_sim:.3f}")
print(f"Euclidean Distance: {euclidean(vec1, vec2):.3f}")
print(f"Manhattan Distance: {cityblock(vec1, vec2):.1f}")

Real Semantic Search Example

import numpy as np

## Simulated sentence embeddings (in reality, from BERT/OpenAI)
query = np.array([0.2, 0.8, 0.3, 0.1])  # "machine learning"
docs = [
    np.array([0.25, 0.75, 0.35, 0.15]),  # "AI and ML"
    np.array([0.9, 0.1, 0.05, 0.05]),    # "cooking recipes"
    np.array([0.22, 0.79, 0.28, 0.12])   # "deep learning"
]

for i, doc in enumerate(docs, 1):
    sim = 1 - cosine(query, doc)
    print(f"Document {i} similarity: {sim:.4f}")

Output:

Document 1 similarity: 0.9985  ← Most similar!
Document 2 similarity: 0.7234  ← Unrelated
Document 3 similarity: 0.9993  ← Very similar!

🎯 See how it works? Documents 1 and 3 (about ML) score high similarity to the query, while Document 2 (cooking) scores lower.

Common Mistakes to Avoid ⚠️

Mistake 1: Using Cosine for Magnitude-Sensitive Data

❌ Wrong:

## Comparing temperatures where magnitude matters!
temp_yesterday = [22, 24, 26, 25]  # °C
temp_today = [44, 48, 52, 50]      # Twice as hot!

sim = cosine_similarity(temp_yesterday, temp_today)
print(sim)  # 1.0 - "identical"?! NO!

✅ Right:

## Use Euclidean when absolute values matter
dist = euclidean_distance(temp_yesterday, temp_today)
print(dist)  # 48.99 - correctly shows big difference

Mistake 2: Forgetting to Normalize

❌ Wrong:

## Comparing vectors with very different scales
user_a = [1000, 5]    # Loves action movies, hates romance
user_b = [10, 0.05]   # Same preferences, different scale

euc = euclidean_distance(user_a, user_b)
print(euc)  # 990 - seems very different!

✅ Right:

## Normalize first, OR use cosine similarity
cos = cosine_similarity(user_a, user_b)
print(cos)  # 0.9999 - correctly identifies same preference!

Mistake 3: Comparing Across Different Metrics

❌ Wrong:

score_a = cosine_similarity(q, doc_a)  # 0.85
score_b = euclidean_distance(q, doc_b)  # 3.2

if score_a > score_b:  # NONSENSE! Different metrics!
    return doc_a

✅ Right:

## Use SAME metric for all comparisons
scores = [(doc, cosine_similarity(q, doc)) for doc in documents]
best_doc = max(scores, key=lambda x: x[1])

Mistake 4: Zero Vectors

❌ Wrong:

vec_a = [0, 0, 0]
vec_b = [1, 2, 3]
cos = cosine_similarity(vec_a, vec_b)  # Division by zero!

✅ Right:

def safe_cosine_similarity(a, b, epsilon=1e-8):
    dot_product = sum(x * y for x, y in zip(a, b))
    mag_a = math.sqrt(sum(x**2 for x in a)) + epsilon
    mag_b = math.sqrt(sum(y**2 for y in b)) + epsilon
    return dot_product / (mag_a * mag_b)

Mistake 5: High Dimensionality Issues

⚠️ Problem: In very high dimensions (1000+), Euclidean distances become less discriminative—most points seem equidistant.

✅ Solution:

Prefer cosine similarity for high-dimensional embeddings
Apply dimensionality reduction (PCA, t-SNE)
Use approximate nearest neighbor algorithms (HNSW, IVF)

Advanced Considerations 🎓

Distance Metric Properties

A proper metric must satisfy these axioms:

Non-negativity: d(x,y) ≥ 0
Identity: d(x,y) = 0 ⟺ x = y
Symmetry: d(x,y) = d(y,x)
Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)

🔍 Fun Fact: Cosine similarity isn't technically a metric (violates triangle inequality), but cosine distance (1 - similarity) is!

Weighted Distance Metrics

Sometimes dimensions aren't equally important:

def weighted_euclidean(a, b, weights):
    return math.sqrt(sum(w * (x - y)**2 
                        for x, y, w in zip(a, b, weights)))

## Example: Emphasize first dimension 3x more
vec1 = [2, 3, 1]
vec2 = [4, 1, 3]
weights = [3.0, 1.0, 1.0]  # First dim 3x more important

dist = weighted_euclidean(vec1, vec2, weights)

Minkowski Distance (Generalized)

Both Euclidean and Manhattan are special cases of Minkowski distance:

minkowski_distance(x, y, p) = (Σ|xᵢ - yᵢ|ᵖ)^(1/p)

p = 1: Manhattan distance
p = 2: Euclidean distance
p = ∞: Chebyshev distance (maximum difference in any dimension)

Vector Databases in Production

Modern vector databases optimize these operations:

Database	Supported Metrics	Index Type
Pinecone	Cosine, Euclidean, Dot Product	Proprietary
Weaviate	Cosine, Euclidean, Manhattan, Dot	HNSW
Milvus	All + Hamming, Jaccard	IVF, HNSW, Annoy
FAISS	Euclidean, Inner Product	IVF, HNSW, PQ
Qdrant	Cosine, Euclidean, Dot Product	HNSW

💡 Production Tip: These databases use Approximate Nearest Neighbor (ANN) algorithms—they trade tiny accuracy losses (<1%) for massive speed gains (100-1000x faster).

Key Takeaways 🎯

Cosine similarity measures angle/direction—perfect for text embeddings and semantic search. Ignores magnitude.
Euclidean distance measures straight-line distance—good when scale matters (images, physical measurements).
Manhattan distance measures grid-based distance—efficient for high dimensions, outlier-resistant.
Different metrics, different insights: Choose based on your data type and what "similarity" means in your domain.
Normalize vectors when using Euclidean if scales differ. Cosine handles this automatically.
For RAG and semantic search: Cosine similarity is the standard (used by OpenAI, Cohere, Anthropic embeddings).
Performance matters: Manhattan < Euclidean < Cosine in computational cost. Pre-normalize for cosine to speed it up.
Watch for edge cases: Zero vectors break cosine. High dimensions weaken Euclidean. Always validate!

Quick Reference Card 📋

📋 Distance Metrics at a Glance

Metric	Formula Intuition	Range	Best For	Speed
Cosine Similarity	Angle between vectors	-1 to +1	Text, embeddings	Slowest
Euclidean Distance	Straight-line distance	0 to ∞	Images, coordinates	Medium
Manhattan Distance	Grid-path distance	0 to ∞	Sparse data	Fastest

Memory Hook 🧠:

Cosine for Context (text/meaning)
Euclidean for Exact position
Manhattan for Massive dimensions

Quick Decision 🎯:
❓ "Is it text/language?" → Cosine
❓ "Is it an image/continuous?" → Euclidean
❓ "Is it sparse/high-D?" → Manhattan

📚 Further Study

Deepen your understanding with these resources:

"Similarity Measures for Text Document Clustering" - Academic overview of distance metrics in NLP contexts https://www.sciencedirect.com/topics/computer-science/cosine-similarity
Pinecone Learning Center: "Distance Metrics in Vector Search" - Practical guide from a leading vector database https://www.pinecone.io/learn/distance-metrics/
FAISS Documentation (Meta AI) - Deep dive into optimized similarity search implementations https://github.com/facebookresearch/faiss/wiki

Now that you understand how to measure similarity, you're ready to build powerful semantic search systems. Next up: learning about vector embeddings and how to generate them from raw text! 🚀

📝

Ready to practice?

This lesson has 15 questions to help you learn