Vector Embeddings

Learn embedding models (Word2Vec, Sentence Transformers, OpenAI embeddings) and how to generate dense representations.

Vector Embeddings

Master vector embeddings with free flashcards and spaced repetition practice. This lesson covers the mathematical foundations of embeddings, dimensionality and distance metrics, and practical applications in modern AI search systems—essential concepts for building semantic search and RAG (Retrieval Augmented Generation) solutions.

Welcome 🎯

Welcome to the world of vector embeddings, the mathematical backbone of modern AI search! If you've ever wondered how search engines understand that "puppy" and "dog" are related, or how ChatGPT retrieves relevant context from massive document collections, you're about to discover the answer.

Vector embeddings transform words, sentences, images, and other data into numerical representations that capture meaning. Unlike traditional keyword matching, embeddings enable semantic search—finding information based on conceptual similarity rather than exact text matches. This lesson will guide you through the core principles, mathematical foundations, and real-world applications that make embeddings indispensable in 2026's AI landscape.

What Are Vector Embeddings? 📊

Vector embeddings (or simply "embeddings") are numerical representations of data in multi-dimensional space. They convert discrete objects like words, sentences, or images into continuous vectors of real numbers, where semantically similar items are positioned close together.

The Core Concept

Think of embeddings as coordinates on a map of meaning. Just as GPS coordinates (latitude, longitude) represent physical locations, embeddings represent conceptual locations:

Traditional Keyword Matching:
"dog" ≠ "puppy" ≠ "canine" (exact match only)

Vector Embedding Space:
           puppy •
              ╱  ╲
            ╱     ╲
         dog •    • canine
              ╲   ╱
               ╲ ╱
              • pet

Distance encodes semantic similarity!

Key principle: If two concepts are semantically similar, their embedding vectors will be close together when we measure the distance between them.

From Words to Numbers

Embeddings solve a fundamental problem in AI: computers process numbers, not words. Early approaches used one-hot encoding:

Word	One-Hot Vector	Dimensionality
cat	[1, 0, 0, 0, 0]	Vocabulary size
dog	[0, 1, 0, 0, 0]	(e.g., 50,000)
puppy	[0, 0, 1, 0, 0]	Very sparse!

❌ Problem with one-hot encoding: Every word is equally distant from every other word—"cat" is as similar to "dog" as it is to "telescope".

✅ Solution with embeddings: Dense vectors in lower-dimensional space where semantic relationships are preserved:

Word	Dense Embedding	Dimensionality
cat	[0.2, -0.5, 0.8, 0.1]	Fixed & small
dog	[0.3, -0.4, 0.7, 0.2]	(e.g., 384, 768,
puppy	[0.4, -0.3, 0.6, 0.3]	1536 dims)

Notice how "dog" and "puppy" have similar values across dimensions—they're neighbors in embedding space!

Mathematical Representation

Formally, an embedding is a function E that maps from discrete space to continuous vector space:

E: X → ℝⁿ

Where:

X = input space (words, sentences, images)
ℝⁿ = n-dimensional real number space
n = embedding dimensionality (typically 128-4096)

Example: The sentence "AI search is powerful" might map to:

E("AI search is powerful") = [0.23, -0.45, 0.67, ..., 0.12] ∈ ℝ³⁸⁴

Dimensionality: The Hidden Structure 🎲

Dimensionality refers to the number of values in an embedding vector. This isn't arbitrary—each dimension captures different aspects of meaning.

Understanding Dimensions

While we can't visualize 384-dimensional space, think of each dimension as encoding a particular semantic feature:

Simplified 4D Embedding Space:
Dimension 1: Animal ←→ Object
Dimension 2: Large ←→ Small  
Dimension 3: Domestic ←→ Wild
Dimension 4: Common ←→ Rare

"elephant" → [0.9, 0.9, 0.3, 0.4]
"mouse"    → [0.9, 0.1, 0.5, 0.6]
"car"      → [0.1, 0.7, 0.8, 0.7]

💡 In reality: Modern embeddings don't have interpretable dimensions—each dimension contributes to many semantic features simultaneously through learned patterns.

Common Dimensionality Sizes

Model	Dimensions	Use Case
Word2Vec	50-300	Legacy word embeddings
BERT-base	768	General-purpose text
OpenAI text-embedding-3-small	1536	Modern semantic search
Cohere embed-v3	1024	Multilingual retrieval
Custom domain models	128-384	Speed-optimized apps

The Dimensionality Trade-off

Higher dimensions:

✅ Capture more nuanced semantic information
✅ Better distinguish fine-grained differences
❌ Require more storage (important at scale!)
❌ Slower similarity computations
❌ Risk of overfitting to training data

Lower dimensions:

✅ Faster search and computation
✅ Less storage overhead
❌ May lose subtle semantic distinctions
❌ Lower retrieval accuracy

🔧 Try this: Calculate storage for 1 million embeddings:

768 dimensions × 4 bytes (float32) × 1M vectors = 3.07 GB
1536 dimensions × 4 bytes × 1M vectors = 6.14 GB

At billion-scale, dimensionality choices have major infrastructure implications!

Distance Metrics: Measuring Similarity 📏

Distance metrics (or similarity measures) quantify how close or far apart two embeddings are. This is crucial because retrieval in semantic search means finding the nearest neighbors to a query embedding.

Three Essential Metrics

1. Cosine Similarity ⭐ (Most Popular)

Measures the angle between vectors, ignoring magnitude:

Formula:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
                        = cos(θ)

Where:

A · B = dot product
||A|| = magnitude (length) of vector A
θ = angle between vectors

Range: -1 (opposite) to +1 (identical)

1.0 = perfectly similar
0.0 = orthogonal (unrelated)
-1.0 = perfectly opposite

Why it's popular: Cosine similarity is scale-invariant. Whether you say "dog" or "dog dog dog dog", the direction (meaning) stays the same.

Visual Intuition (2D):

     A (0.8, 0.6)
      •
     /|
    / |
   /  |θ small → high similarity
  /   |
 •────•────→
 B (0.7, 0.5)

     C (-0.6, 0.8)
      •
      |╲
      | ╲
      |  ╲ θ large → low similarity
      |   ╲
 •────•────╲→
 A (0.8, 0.6)

Example calculation:

A = [0.5, 0.8, 0.3]
B = [0.6, 0.7, 0.4]

A · B = 0.5×0.6 + 0.8×0.7 + 0.3×0.4 = 0.3 + 0.56 + 0.12 = 0.98
||A|| = √(0.5² + 0.8² + 0.3²) = √(0.25 + 0.64 + 0.09) = √0.98 ≈ 0.99
||B|| = √(0.6² + 0.7² + 0.4²) = √(0.36 + 0.49 + 0.16) = √1.01 ≈ 1.00

cosine_similarity = 0.98 / (0.99 × 1.00) ≈ 0.99 (very similar!)

2. Euclidean Distance (L2 Distance)

Measures the straight-line distance between points:

Formula:

euclidean_distance(A, B) = ||A - B|| = √(Σ(Aᵢ - Bᵢ)²)

Range: 0 (identical) to ∞ (opposite)

Smaller = more similar
Larger = less similar

Use case: When magnitude matters (e.g., "very happy" vs "slightly happy")

Visual (2D):

  A •────────• B
     ╲      /
      ╲    /  Direct distance
       ╲  /   = √((x₂-x₁)² + (y₂-y₁)²)
        •
      Origin

Example:

A = [1, 2, 3]
B = [4, 5, 6]

Distance = √((4-1)² + (5-2)² + (6-3)²)
         = √(9 + 9 + 9)
         = √27 ≈ 5.20

3. Dot Product

Simple multiplication and sum:

Formula:

dot_product(A, B) = Σ(Aᵢ × Bᵢ)

Range: -∞ to +∞

Higher = more similar (when vectors are normalized)

Speed advantage: Fastest to compute—no square roots or divisions!

💡 Pro tip: If your embeddings are normalized (length = 1), then:

Dot product = Cosine similarity
Many modern embedding models output pre-normalized vectors for this reason!

Choosing the Right Metric

Metric	When to Use	Speed
Cosine	Most text embeddings, normalized data	Medium
Euclidean	When magnitude matters (intensity, scale)	Medium
Dot Product	Pre-normalized embeddings, speed-critical	Fast ⚡

🎯 In practice: Most RAG systems use cosine similarity or dot product on normalized vectors—they're mathematically equivalent and work excellently for semantic search.

How Embeddings Are Created 🧠

Embeddings don't appear by magic—they're learned by neural networks trained on massive datasets.

The Training Process

1. Architecture: Modern embeddings use transformer models (BERT, GPT, custom encoders)

2. Training objective: Networks learn to:

Place similar items close together
Push dissimilar items apart
Preserve semantic relationships

3. Common training methods:

Method	How It Works	Example
Skip-gram	Predict context from word	Word2Vec
CBOW	Predict word from context	Word2Vec
Masked Language Modeling	Predict masked tokens	BERT
Contrastive Learning	Similar pairs close, dissimilar far	Sentence-BERT

Example: Contrastive Training

Training Data: Sentence Pairs + Labels

Positive pair (similar):
"The cat sleeps"     → [0.2, 0.8, ...]
"A feline rests"     → [0.3, 0.7, ...]  ← Push together
                         Distance: small ✓

Negative pair (dissimilar):
"The cat sleeps"     → [0.2, 0.8, ...]
"Quantum physics"    → [-0.6, 0.1, ...] ← Push apart
                         Distance: large ✓

Loss function: Rewards the model for:

Making similar pairs have high cosine similarity
Making dissimilar pairs have low cosine similarity

After millions of examples, the model learns rich semantic representations!

Pre-trained vs Custom Embeddings

Pre-trained models (recommended for most use cases):

✅ Trained on billions of tokens
✅ Strong general-purpose understanding
✅ Ready to use immediately
Examples: OpenAI embeddings, Cohere, Sentence-Transformers

Custom fine-tuned models:

✅ Optimized for domain-specific terminology
✅ Better performance on specialized tasks
❌ Requires labeled training data
❌ Significant compute and expertise needed

🤔 Did you know? OpenAI's text-embedding-3-large was trained on a mixture of unsupervised data and supervised datasets with human feedback—combining broad coverage with task-specific optimization!

Real-World Applications 🌍

Example 1: Semantic Search Engine

Scenario: A customer searches "how do I reset my password" in a help center.

Traditional keyword search:

Looks for exact matches: "reset", "password"
Misses articles titled "Forgot your login credentials?" or "Account recovery process"

Vector embedding search:

Step 1: Encode query
"how do I reset my password" → [0.23, -0.45, 0.67, ..., 0.12]

Step 2: Compare with document embeddings
Doc 1: "Forgot your login?"        → [0.25, -0.43, 0.65, ...]  similarity: 0.94 ⭐
Doc 2: "Changing profile picture"  → [0.10, 0.72, -0.30, ...] similarity: 0.42
Doc 3: "Account recovery guide"    → [0.24, -0.41, 0.68, ...]  similarity: 0.92 ⭐

Step 3: Return top results
1. "Forgot your login?" (94% match)
2. "Account recovery guide" (92% match)

Result: Users find answers even when wording differs completely! 🎯

Example 2: Retrieval Augmented Generation (RAG)

Scenario: An AI assistant answering questions about company policies.

Without RAG: Model only knows what it was trained on (outdated, hallucinates)

With RAG:

User Query: "What's our remote work policy?"
       ↓
  [Embed query]
       ↓
┌──────────────────────────────┐
│  Vector Database Search      │
│                              │
│  Policy Doc 1: [0.21, ...]   │ ← similarity: 0.88
│  Policy Doc 2: [0.52, ...]   │ ← similarity: 0.45
│  Policy Doc 3: [0.19, ...]   │ ← similarity: 0.91 ⭐ Best!
│  Policy Doc 4: [-0.32, ...]  │ ← similarity: 0.33
└──────────────────────────────┘
       ↓
  [Retrieve Doc 3]
       ↓
┌──────────────────────────────┐
│  LLM Generation              │
│                              │
│  Context: [Policy Doc 3]     │
│  Query: "remote work?"       │
│  → Generate answer           │
└──────────────────────────────┘
       ↓
"Employees may work remotely up to 3 days/week..."

Benefits:

✅ Answers grounded in actual documents
✅ Up-to-date information
✅ Reduces hallucinations
✅ Citable sources

Example 3: Recommendation Systems

Scenario: Music streaming service recommending songs.

Traditional approach: Genre tags, artist matching

Embedding approach: Represent songs as vectors capturing:

Musical style (tempo, instrumentation)
Mood and energy
Lyrical themes
Cultural context

User liked:
"Bohemian Rhapsody" → [0.7, 0.3, 0.8, -0.2, ...]

Similar songs by vector distance:
"Stairway to Heaven"  → [0.6, 0.4, 0.7, -0.1, ...] distance: 0.15 ⭐
"Sweet Child O' Mine" → [0.5, 0.5, 0.6, 0.0, ...]  distance: 0.28 ⭐
"Baby Shark"          → [0.1, 0.9, 0.2, 0.8, ...]  distance: 1.42 ✗

The system discovers nuanced similarities that simple genre tags miss!

Scenario: Search for images using text descriptions.

Key insight: Embeddings can represent different data types in the same vector space!

      Shared Embedding Space
┌─────────────────────────────────┐
│                                 │
│  "sunset beach" → [0.4, 0.8, ...]│
│        ↓ small distance         │
│  🌅 (image) → [0.5, 0.7, ...]   │
│                                 │
│  "city traffic" → [-0.3, 0.2, ...]│
│        ↓ small distance         │
│  🚗 (image) → [-0.4, 0.3, ...]  │
│                                 │
└─────────────────────────────────┘

Models like CLIP (OpenAI) learn to embed images and text together, enabling:

Text → image search
Image → text descriptions
Image → similar image search

Common Mistakes ⚠️

Mistake 1: Comparing Embeddings from Different Models

❌ Wrong:

embedding_A = model_1.encode("cat")  # 384 dimensions
embedding_B = model_2.encode("dog")  # 768 dimensions
similarity = cosine_similarity(embedding_A, embedding_B)  # Error or meaningless!

✅ Right: Always use embeddings from the same model for comparison. Each model has its own "semantic coordinate system."

Mistake 2: Ignoring Normalization

❌ Wrong:

## Using dot product on non-normalized vectors
raw_embedding = [5.2, 3.8, 6.1, ...]  # Large magnitudes
similarity = dot_product(embedding_A, embedding_B)  # Magnitude dominates!

✅ Right: Normalize embeddings if using dot product:

normalized = embedding / np.linalg.norm(embedding)  # Length = 1

Mistake 3: Wrong Distance Metric for Your Use Case

❌ Wrong: Using Euclidean distance when you want semantic similarity:

"I love this product" → [0.9, 0.1]
"I love this product product product" → [2.7, 0.3]
Euclidean distance: 1.8 (treated as different, but meaning is identical!)

✅ Right: Use cosine similarity for semantic tasks—it ignores magnitude.

Mistake 4: Not Chunking Long Documents

❌ Wrong: Embedding a 50-page document as one vector:

Most models have token limits (512, 8192, etc.)
Averaging loses fine-grained information
Single vector can't represent multiple topics well

✅ Right: Split into chunks (paragraphs, sections):

Document → [Chunk 1 embedding, Chunk 2 embedding, ...]

Then search at chunk level for precise retrieval.

Mistake 5: Forgetting to Update Embeddings

❌ Wrong: Embedding documents once at system launch, never updating.

✅ Right: Re-embed when:

Documents are edited or added
You switch to a better embedding model
Your domain terminology evolves

💡 Pro tip: Use versioning for embeddings—track which model/version created them!

Mistake 6: Overlooking Computational Costs at Scale

❌ Wrong: Computing embeddings in real-time for every query at million-doc scale:

Embedding API calls: $$ + latency
Brute-force similarity: O(n) comparisons

✅ Right:

Pre-compute and cache document embeddings
Use vector databases with indexing (HNSW, IVF) for sub-linear search
Batch embedding requests when possible

Key Takeaways 🎓

📋 Quick Reference Card: Vector Embeddings

What They Are	Numerical representations (vectors) of data that capture semantic meaning
Core Purpose	Enable computers to measure similarity between concepts, not just keywords
Typical Dimensions	128-1536 (higher = more nuanced, but slower & bigger)
Best Distance Metric	Cosine similarity (or dot product on normalized vectors)
Created By	Neural networks (transformers) trained on massive text corpora
Key Applications	Semantic search, RAG systems, recommendations, multi-modal search
Critical Rule	Never compare embeddings from different models!
Storage Formula	dimensions × 4 bytes × number_of_vectors
When to Normalize	When using dot product instead of cosine similarity
Chunking Strategy	Split long documents (500-1000 tokens/chunk) for precise retrieval

Essential Formulas

Cosine Similarity:

cos(θ) = (A · B) / (||A|| × ||B||)
Range: [-1, 1]
Higher = more similar

Euclidean Distance:

d = √(Σ(Aᵢ - Bᵢ)²)
Range: [0, ∞]
Lower = more similar

Dot Product:

A · B = Σ(Aᵢ × Bᵢ)
For normalized vectors: equals cosine similarity

Mental Model: The Meaning Map 🗺️

Think of embeddings as GPS coordinates for concepts:

Each dimension = a direction in meaning-space
Distance = semantic difference
Close neighbors = related concepts
Vector math captures relationships ("king" - "man" + "woman" ≈ "queen")

When to Use What

Semantic Search: Use pre-trained embeddings (OpenAI, Cohere) + cosine similarity

RAG Systems: Chunk documents → embed → store in vector DB → retrieve top-k → feed to LLM

Domain-Specific: Consider fine-tuning on your data if general models underperform

Real-Time Applications: Pre-compute embeddings, use fast vector databases (Pinecone, Weaviate, Qdrant)

📚 Further Study

Dive Deeper:

Understanding Vector Embeddings - Pinecone Learning Center - Comprehensive guide with interactive examples and code tutorials
Sentence Transformers Documentation - Open-source library for state-of-the-art embeddings with pre-trained models
OpenAI Embeddings Guide - Official documentation covering best practices, use cases, and API reference

Practice Exercises:

Implement cosine similarity from scratch in Python
Build a simple semantic search over your own documents
Compare different embedding models on a retrieval task
Visualize embeddings in 2D using t-SNE or UMAP

Next Steps: Now that you understand embeddings, explore vector databases (storage & indexing), retrieval strategies (hybrid search, reranking), and evaluation metrics (recall@k, MRR) in the next lessons!

🚀 Congratulations! You've mastered the foundational concept behind modern AI search. Vector embeddings are your gateway to building intelligent, semantic-aware applications in 2026 and beyond!

📝

Ready to practice?

This lesson has 15 questions to help you learn

Vector Embeddings

Vector Embeddings

Welcome 🎯

What Are Vector Embeddings? 📊

The Core Concept

From Words to Numbers

Mathematical Representation

Dimensionality: The Hidden Structure 🎲

Understanding Dimensions

Common Dimensionality Sizes

The Dimensionality Trade-off

Distance Metrics: Measuring Similarity 📏

Three Essential Metrics

1. Cosine Similarity ⭐ (Most Popular)

2. Euclidean Distance (L2 Distance)

3. Dot Product

Choosing the Right Metric

How Embeddings Are Created 🧠

The Training Process

Example: Contrastive Training

Pre-trained vs Custom Embeddings

Real-World Applications 🌍

Example 1: Semantic Search Engine

Example 2: Retrieval Augmented Generation (RAG)

Example 3: Recommendation Systems

Example 4: Multi-modal Search

Common Mistakes ⚠️

Mistake 1: Comparing Embeddings from Different Models

Mistake 2: Ignoring Normalization

Mistake 3: Wrong Distance Metric for Your Use Case

Mistake 4: Not Chunking Long Documents

Mistake 5: Forgetting to Update Embeddings

Mistake 6: Overlooking Computational Costs at Scale

Key Takeaways 🎓

📋 Quick Reference Card: Vector Embeddings

Essential Formulas

Mental Model: The Meaning Map 🗺️

When to Use What

📚 Further Study