You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

Vector Embeddings

Learn embedding models (Word2Vec, Sentence Transformers, OpenAI embeddings) and how to generate dense representations.

Vector Embeddings

Master vector embeddings with free flashcards and spaced repetition practice. This lesson covers the mathematical foundations of embeddings, dimensionality and distance metrics, and practical applications in modern AI search systemsβ€”essential concepts for building semantic search and RAG (Retrieval Augmented Generation) solutions.

Welcome 🎯

Welcome to the world of vector embeddings, the mathematical backbone of modern AI search! If you've ever wondered how search engines understand that "puppy" and "dog" are related, or how ChatGPT retrieves relevant context from massive document collections, you're about to discover the answer.

Vector embeddings transform words, sentences, images, and other data into numerical representations that capture meaning. Unlike traditional keyword matching, embeddings enable semantic searchβ€”finding information based on conceptual similarity rather than exact text matches. This lesson will guide you through the core principles, mathematical foundations, and real-world applications that make embeddings indispensable in 2026's AI landscape.

What Are Vector Embeddings? πŸ“Š

Vector embeddings (or simply "embeddings") are numerical representations of data in multi-dimensional space. They convert discrete objects like words, sentences, or images into continuous vectors of real numbers, where semantically similar items are positioned close together.

The Core Concept

Think of embeddings as coordinates on a map of meaning. Just as GPS coordinates (latitude, longitude) represent physical locations, embeddings represent conceptual locations:

Traditional Keyword Matching:
"dog" β‰  "puppy" β‰  "canine" (exact match only)

Vector Embedding Space:
           puppy β€’
              β•±  β•²
            β•±     β•²
         dog β€’    β€’ canine
              β•²   β•±
               β•² β•±
              β€’ pet

Distance encodes semantic similarity!

Key principle: If two concepts are semantically similar, their embedding vectors will be close together when we measure the distance between them.

From Words to Numbers

Embeddings solve a fundamental problem in AI: computers process numbers, not words. Early approaches used one-hot encoding:

WordOne-Hot VectorDimensionality
cat[1, 0, 0, 0, 0]Vocabulary size
dog[0, 1, 0, 0, 0](e.g., 50,000)
puppy[0, 0, 1, 0, 0]Very sparse!

❌ Problem with one-hot encoding: Every word is equally distant from every other wordβ€”"cat" is as similar to "dog" as it is to "telescope".

βœ… Solution with embeddings: Dense vectors in lower-dimensional space where semantic relationships are preserved:

WordDense EmbeddingDimensionality
cat[0.2, -0.5, 0.8, 0.1]Fixed & small
dog[0.3, -0.4, 0.7, 0.2](e.g., 384, 768,
puppy[0.4, -0.3, 0.6, 0.3]1536 dims)

Notice how "dog" and "puppy" have similar values across dimensionsβ€”they're neighbors in embedding space!

Mathematical Representation

Formally, an embedding is a function E that maps from discrete space to continuous vector space:

E: X β†’ ℝⁿ

Where:

  • X = input space (words, sentences, images)
  • ℝⁿ = n-dimensional real number space
  • n = embedding dimensionality (typically 128-4096)

Example: The sentence "AI search is powerful" might map to:

E("AI search is powerful") = [0.23, -0.45, 0.67, ..., 0.12] ∈ ℝ³⁸⁴

Dimensionality: The Hidden Structure 🎲

Dimensionality refers to the number of values in an embedding vector. This isn't arbitraryβ€”each dimension captures different aspects of meaning.

Understanding Dimensions

While we can't visualize 384-dimensional space, think of each dimension as encoding a particular semantic feature:

Simplified 4D Embedding Space:
Dimension 1: Animal ←→ Object
Dimension 2: Large ←→ Small  
Dimension 3: Domestic ←→ Wild
Dimension 4: Common ←→ Rare

"elephant" β†’ [0.9, 0.9, 0.3, 0.4]
"mouse"    β†’ [0.9, 0.1, 0.5, 0.6]
"car"      β†’ [0.1, 0.7, 0.8, 0.7]

πŸ’‘ In reality: Modern embeddings don't have interpretable dimensionsβ€”each dimension contributes to many semantic features simultaneously through learned patterns.

Common Dimensionality Sizes

ModelDimensionsUse Case
Word2Vec50-300Legacy word embeddings
BERT-base768General-purpose text
OpenAI text-embedding-3-small1536Modern semantic search
Cohere embed-v31024Multilingual retrieval
Custom domain models128-384Speed-optimized apps

The Dimensionality Trade-off

Higher dimensions:

  • βœ… Capture more nuanced semantic information
  • βœ… Better distinguish fine-grained differences
  • ❌ Require more storage (important at scale!)
  • ❌ Slower similarity computations
  • ❌ Risk of overfitting to training data

Lower dimensions:

  • βœ… Faster search and computation
  • βœ… Less storage overhead
  • ❌ May lose subtle semantic distinctions
  • ❌ Lower retrieval accuracy

πŸ”§ Try this: Calculate storage for 1 million embeddings:

  • 768 dimensions Γ— 4 bytes (float32) Γ— 1M vectors = 3.07 GB
  • 1536 dimensions Γ— 4 bytes Γ— 1M vectors = 6.14 GB

At billion-scale, dimensionality choices have major infrastructure implications!

Distance Metrics: Measuring Similarity πŸ“

Distance metrics (or similarity measures) quantify how close or far apart two embeddings are. This is crucial because retrieval in semantic search means finding the nearest neighbors to a query embedding.

Three Essential Metrics

Measures the angle between vectors, ignoring magnitude:

Formula:

cosine_similarity(A, B) = (A Β· B) / (||A|| Γ— ||B||)
                        = cos(ΞΈ)

Where:

  • A Β· B = dot product
  • ||A|| = magnitude (length) of vector A
  • ΞΈ = angle between vectors

Range: -1 (opposite) to +1 (identical)

  • 1.0 = perfectly similar
  • 0.0 = orthogonal (unrelated)
  • -1.0 = perfectly opposite

Why it's popular: Cosine similarity is scale-invariant. Whether you say "dog" or "dog dog dog dog", the direction (meaning) stays the same.

Visual Intuition (2D):

     A (0.8, 0.6)
      β€’
     /|
    / |
   /  |ΞΈ small β†’ high similarity
  /   |
 ‒────‒────→
 B (0.7, 0.5)

     C (-0.6, 0.8)
      β€’
      |β•²
      | β•²
      |  β•² ΞΈ large β†’ low similarity
      |   β•²
 ‒────‒────╲→
 A (0.8, 0.6)

Example calculation:

A = [0.5, 0.8, 0.3]
B = [0.6, 0.7, 0.4]

A Β· B = 0.5Γ—0.6 + 0.8Γ—0.7 + 0.3Γ—0.4 = 0.3 + 0.56 + 0.12 = 0.98
||A|| = √(0.5Β² + 0.8Β² + 0.3Β²) = √(0.25 + 0.64 + 0.09) = √0.98 β‰ˆ 0.99
||B|| = √(0.6Β² + 0.7Β² + 0.4Β²) = √(0.36 + 0.49 + 0.16) = √1.01 β‰ˆ 1.00

cosine_similarity = 0.98 / (0.99 Γ— 1.00) β‰ˆ 0.99 (very similar!)
2. Euclidean Distance (L2 Distance)

Measures the straight-line distance between points:

Formula:

euclidean_distance(A, B) = ||A - B|| = √(Σ(Aᡒ - Bᡒ)²)

Range: 0 (identical) to ∞ (opposite)

  • Smaller = more similar
  • Larger = less similar

Use case: When magnitude matters (e.g., "very happy" vs "slightly happy")

Visual (2D):

  A ‒────────‒ B
     β•²      /
      β•²    /  Direct distance
       β•²  /   = √((xβ‚‚-x₁)Β² + (yβ‚‚-y₁)Β²)
        β€’
      Origin

Example:

A = [1, 2, 3]
B = [4, 5, 6]

Distance = √((4-1)² + (5-2)² + (6-3)²)
         = √(9 + 9 + 9)
         = √27 β‰ˆ 5.20
3. Dot Product

Simple multiplication and sum:

Formula:

dot_product(A, B) = Ξ£(Aα΅’ Γ— Bα΅’)

Range: -∞ to +∞

  • Higher = more similar (when vectors are normalized)

Speed advantage: Fastest to computeβ€”no square roots or divisions!

πŸ’‘ Pro tip: If your embeddings are normalized (length = 1), then:

  • Dot product = Cosine similarity
  • Many modern embedding models output pre-normalized vectors for this reason!

Choosing the Right Metric

MetricWhen to UseSpeed
CosineMost text embeddings, normalized dataMedium
EuclideanWhen magnitude matters (intensity, scale)Medium
Dot ProductPre-normalized embeddings, speed-criticalFast ⚑

🎯 In practice: Most RAG systems use cosine similarity or dot product on normalized vectorsβ€”they're mathematically equivalent and work excellently for semantic search.

How Embeddings Are Created 🧠

Embeddings don't appear by magicβ€”they're learned by neural networks trained on massive datasets.

The Training Process

1. Architecture: Modern embeddings use transformer models (BERT, GPT, custom encoders)

2. Training objective: Networks learn to:

  • Place similar items close together
  • Push dissimilar items apart
  • Preserve semantic relationships

3. Common training methods:

MethodHow It WorksExample
Skip-gramPredict context from wordWord2Vec
CBOWPredict word from contextWord2Vec
Masked Language ModelingPredict masked tokensBERT
Contrastive LearningSimilar pairs close, dissimilar farSentence-BERT

Example: Contrastive Training

Training Data: Sentence Pairs + Labels

Positive pair (similar):
"The cat sleeps"     β†’ [0.2, 0.8, ...]
"A feline rests"     β†’ [0.3, 0.7, ...]  ← Push together
                         Distance: small βœ“

Negative pair (dissimilar):
"The cat sleeps"     β†’ [0.2, 0.8, ...]
"Quantum physics"    β†’ [-0.6, 0.1, ...] ← Push apart
                         Distance: large βœ“

Loss function: Rewards the model for:

  • Making similar pairs have high cosine similarity
  • Making dissimilar pairs have low cosine similarity

After millions of examples, the model learns rich semantic representations!

Pre-trained vs Custom Embeddings

Pre-trained models (recommended for most use cases):

  • βœ… Trained on billions of tokens
  • βœ… Strong general-purpose understanding
  • βœ… Ready to use immediately
  • Examples: OpenAI embeddings, Cohere, Sentence-Transformers

Custom fine-tuned models:

  • βœ… Optimized for domain-specific terminology
  • βœ… Better performance on specialized tasks
  • ❌ Requires labeled training data
  • ❌ Significant compute and expertise needed

πŸ€” Did you know? OpenAI's text-embedding-3-large was trained on a mixture of unsupervised data and supervised datasets with human feedbackβ€”combining broad coverage with task-specific optimization!

Real-World Applications 🌍

Example 1: Semantic Search Engine

Scenario: A customer searches "how do I reset my password" in a help center.

Traditional keyword search:

  • Looks for exact matches: "reset", "password"
  • Misses articles titled "Forgot your login credentials?" or "Account recovery process"

Vector embedding search:

Step 1: Encode query
"how do I reset my password" β†’ [0.23, -0.45, 0.67, ..., 0.12]

Step 2: Compare with document embeddings
Doc 1: "Forgot your login?"        β†’ [0.25, -0.43, 0.65, ...]  similarity: 0.94 ⭐
Doc 2: "Changing profile picture"  β†’ [0.10, 0.72, -0.30, ...] similarity: 0.42
Doc 3: "Account recovery guide"    β†’ [0.24, -0.41, 0.68, ...]  similarity: 0.92 ⭐

Step 3: Return top results
1. "Forgot your login?" (94% match)
2. "Account recovery guide" (92% match)

Result: Users find answers even when wording differs completely! 🎯

Example 2: Retrieval Augmented Generation (RAG)

Scenario: An AI assistant answering questions about company policies.

Without RAG: Model only knows what it was trained on (outdated, hallucinates)

With RAG:

User Query: "What's our remote work policy?"
       ↓
  [Embed query]
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Vector Database Search      β”‚
β”‚                              β”‚
β”‚  Policy Doc 1: [0.21, ...]   β”‚ ← similarity: 0.88
β”‚  Policy Doc 2: [0.52, ...]   β”‚ ← similarity: 0.45
β”‚  Policy Doc 3: [0.19, ...]   β”‚ ← similarity: 0.91 ⭐ Best!
β”‚  Policy Doc 4: [-0.32, ...]  β”‚ ← similarity: 0.33
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
  [Retrieve Doc 3]
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LLM Generation              β”‚
β”‚                              β”‚
β”‚  Context: [Policy Doc 3]     β”‚
β”‚  Query: "remote work?"       β”‚
β”‚  β†’ Generate answer           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
"Employees may work remotely up to 3 days/week..."

Benefits:

  • βœ… Answers grounded in actual documents
  • βœ… Up-to-date information
  • βœ… Reduces hallucinations
  • βœ… Citable sources

Example 3: Recommendation Systems

Scenario: Music streaming service recommending songs.

Traditional approach: Genre tags, artist matching

Embedding approach: Represent songs as vectors capturing:

  • Musical style (tempo, instrumentation)
  • Mood and energy
  • Lyrical themes
  • Cultural context
User liked:
"Bohemian Rhapsody" β†’ [0.7, 0.3, 0.8, -0.2, ...]

Similar songs by vector distance:
"Stairway to Heaven"  β†’ [0.6, 0.4, 0.7, -0.1, ...] distance: 0.15 ⭐
"Sweet Child O' Mine" β†’ [0.5, 0.5, 0.6, 0.0, ...]  distance: 0.28 ⭐
"Baby Shark"          β†’ [0.1, 0.9, 0.2, 0.8, ...]  distance: 1.42 βœ—

The system discovers nuanced similarities that simple genre tags miss!

Scenario: Search for images using text descriptions.

Key insight: Embeddings can represent different data types in the same vector space!

      Shared Embedding Space
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                 β”‚
β”‚  "sunset beach" β†’ [0.4, 0.8, ...]β”‚
β”‚        ↓ small distance         β”‚
β”‚  πŸŒ… (image) β†’ [0.5, 0.7, ...]   β”‚
β”‚                                 β”‚
β”‚  "city traffic" β†’ [-0.3, 0.2, ...]β”‚
β”‚        ↓ small distance         β”‚
β”‚  πŸš— (image) β†’ [-0.4, 0.3, ...]  β”‚
β”‚                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Models like CLIP (OpenAI) learn to embed images and text together, enabling:

  • Text β†’ image search
  • Image β†’ text descriptions
  • Image β†’ similar image search

Common Mistakes ⚠️

Mistake 1: Comparing Embeddings from Different Models

❌ Wrong:

embedding_A = model_1.encode("cat")  # 384 dimensions
embedding_B = model_2.encode("dog")  # 768 dimensions
similarity = cosine_similarity(embedding_A, embedding_B)  # Error or meaningless!

βœ… Right: Always use embeddings from the same model for comparison. Each model has its own "semantic coordinate system."

Mistake 2: Ignoring Normalization

❌ Wrong:

## Using dot product on non-normalized vectors
raw_embedding = [5.2, 3.8, 6.1, ...]  # Large magnitudes
similarity = dot_product(embedding_A, embedding_B)  # Magnitude dominates!

βœ… Right: Normalize embeddings if using dot product:

normalized = embedding / np.linalg.norm(embedding)  # Length = 1

Mistake 3: Wrong Distance Metric for Your Use Case

❌ Wrong: Using Euclidean distance when you want semantic similarity:

  • "I love this product" β†’ [0.9, 0.1]
  • "I love this product product product" β†’ [2.7, 0.3]
  • Euclidean distance: 1.8 (treated as different, but meaning is identical!)

βœ… Right: Use cosine similarity for semantic tasksβ€”it ignores magnitude.

Mistake 4: Not Chunking Long Documents

❌ Wrong: Embedding a 50-page document as one vector:

  • Most models have token limits (512, 8192, etc.)
  • Averaging loses fine-grained information
  • Single vector can't represent multiple topics well

βœ… Right: Split into chunks (paragraphs, sections):

Document β†’ [Chunk 1 embedding, Chunk 2 embedding, ...]

Then search at chunk level for precise retrieval.

Mistake 5: Forgetting to Update Embeddings

❌ Wrong: Embedding documents once at system launch, never updating.

βœ… Right: Re-embed when:

  • Documents are edited or added
  • You switch to a better embedding model
  • Your domain terminology evolves

πŸ’‘ Pro tip: Use versioning for embeddingsβ€”track which model/version created them!

Mistake 6: Overlooking Computational Costs at Scale

❌ Wrong: Computing embeddings in real-time for every query at million-doc scale:

  • Embedding API calls: $$ + latency
  • Brute-force similarity: O(n) comparisons

βœ… Right:

  • Pre-compute and cache document embeddings
  • Use vector databases with indexing (HNSW, IVF) for sub-linear search
  • Batch embedding requests when possible

Key Takeaways πŸŽ“

πŸ“‹ Quick Reference Card: Vector Embeddings

What They AreNumerical representations (vectors) of data that capture semantic meaning
Core PurposeEnable computers to measure similarity between concepts, not just keywords
Typical Dimensions128-1536 (higher = more nuanced, but slower & bigger)
Best Distance MetricCosine similarity (or dot product on normalized vectors)
Created ByNeural networks (transformers) trained on massive text corpora
Key ApplicationsSemantic search, RAG systems, recommendations, multi-modal search
Critical RuleNever compare embeddings from different models!
Storage Formuladimensions Γ— 4 bytes Γ— number_of_vectors
When to NormalizeWhen using dot product instead of cosine similarity
Chunking StrategySplit long documents (500-1000 tokens/chunk) for precise retrieval

Essential Formulas

Cosine Similarity:

cos(ΞΈ) = (A Β· B) / (||A|| Γ— ||B||)
Range: [-1, 1]
Higher = more similar

Euclidean Distance:

d = √(Σ(Aᡒ - Bᡒ)²)
Range: [0, ∞]
Lower = more similar

Dot Product:

A Β· B = Ξ£(Aα΅’ Γ— Bα΅’)
For normalized vectors: equals cosine similarity

Mental Model: The Meaning Map πŸ—ΊοΈ

Think of embeddings as GPS coordinates for concepts:

  • Each dimension = a direction in meaning-space
  • Distance = semantic difference
  • Close neighbors = related concepts
  • Vector math captures relationships ("king" - "man" + "woman" β‰ˆ "queen")

When to Use What

Semantic Search: Use pre-trained embeddings (OpenAI, Cohere) + cosine similarity

RAG Systems: Chunk documents β†’ embed β†’ store in vector DB β†’ retrieve top-k β†’ feed to LLM

Domain-Specific: Consider fine-tuning on your data if general models underperform

Real-Time Applications: Pre-compute embeddings, use fast vector databases (Pinecone, Weaviate, Qdrant)

πŸ“š Further Study

Dive Deeper:

  1. Understanding Vector Embeddings - Pinecone Learning Center - Comprehensive guide with interactive examples and code tutorials
  2. Sentence Transformers Documentation - Open-source library for state-of-the-art embeddings with pre-trained models
  3. OpenAI Embeddings Guide - Official documentation covering best practices, use cases, and API reference

Practice Exercises:

  • Implement cosine similarity from scratch in Python
  • Build a simple semantic search over your own documents
  • Compare different embedding models on a retrieval task
  • Visualize embeddings in 2D using t-SNE or UMAP

Next Steps: Now that you understand embeddings, explore vector databases (storage & indexing), retrieval strategies (hybrid search, reranking), and evaluation metrics (recall@k, MRR) in the next lessons!

πŸš€ Congratulations! You've mastered the foundational concept behind modern AI search. Vector embeddings are your gateway to building intelligent, semantic-aware applications in 2026 and beyond!