Vector Embeddings
Learn embedding models (Word2Vec, Sentence Transformers, OpenAI embeddings) and how to generate dense representations.
Vector Embeddings
Master vector embeddings with free flashcards and spaced repetition practice. This lesson covers the mathematical foundations of embeddings, dimensionality and distance metrics, and practical applications in modern AI search systemsβessential concepts for building semantic search and RAG (Retrieval Augmented Generation) solutions.
Welcome π―
Welcome to the world of vector embeddings, the mathematical backbone of modern AI search! If you've ever wondered how search engines understand that "puppy" and "dog" are related, or how ChatGPT retrieves relevant context from massive document collections, you're about to discover the answer.
Vector embeddings transform words, sentences, images, and other data into numerical representations that capture meaning. Unlike traditional keyword matching, embeddings enable semantic searchβfinding information based on conceptual similarity rather than exact text matches. This lesson will guide you through the core principles, mathematical foundations, and real-world applications that make embeddings indispensable in 2026's AI landscape.
What Are Vector Embeddings? π
Vector embeddings (or simply "embeddings") are numerical representations of data in multi-dimensional space. They convert discrete objects like words, sentences, or images into continuous vectors of real numbers, where semantically similar items are positioned close together.
The Core Concept
Think of embeddings as coordinates on a map of meaning. Just as GPS coordinates (latitude, longitude) represent physical locations, embeddings represent conceptual locations:
Traditional Keyword Matching:
"dog" β "puppy" β "canine" (exact match only)
Vector Embedding Space:
puppy β’
β± β²
β± β²
dog β’ β’ canine
β² β±
β² β±
β’ pet
Distance encodes semantic similarity!
Key principle: If two concepts are semantically similar, their embedding vectors will be close together when we measure the distance between them.
From Words to Numbers
Embeddings solve a fundamental problem in AI: computers process numbers, not words. Early approaches used one-hot encoding:
| Word | One-Hot Vector | Dimensionality |
|---|---|---|
| cat | [1, 0, 0, 0, 0] | Vocabulary size |
| dog | [0, 1, 0, 0, 0] | (e.g., 50,000) |
| puppy | [0, 0, 1, 0, 0] | Very sparse! |
β Problem with one-hot encoding: Every word is equally distant from every other wordβ"cat" is as similar to "dog" as it is to "telescope".
β Solution with embeddings: Dense vectors in lower-dimensional space where semantic relationships are preserved:
| Word | Dense Embedding | Dimensionality |
|---|---|---|
| cat | [0.2, -0.5, 0.8, 0.1] | Fixed & small |
| dog | [0.3, -0.4, 0.7, 0.2] | (e.g., 384, 768, |
| puppy | [0.4, -0.3, 0.6, 0.3] | 1536 dims) |
Notice how "dog" and "puppy" have similar values across dimensionsβthey're neighbors in embedding space!
Mathematical Representation
Formally, an embedding is a function E that maps from discrete space to continuous vector space:
E: X β ββΏ
Where:
- X = input space (words, sentences, images)
- ββΏ = n-dimensional real number space
- n = embedding dimensionality (typically 128-4096)
Example: The sentence "AI search is powerful" might map to:
E("AI search is powerful") = [0.23, -0.45, 0.67, ..., 0.12] β βΒ³βΈβ΄
Dimensionality: The Hidden Structure π²
Dimensionality refers to the number of values in an embedding vector. This isn't arbitraryβeach dimension captures different aspects of meaning.
Understanding Dimensions
While we can't visualize 384-dimensional space, think of each dimension as encoding a particular semantic feature:
Simplified 4D Embedding Space: Dimension 1: Animal ββ Object Dimension 2: Large ββ Small Dimension 3: Domestic ββ Wild Dimension 4: Common ββ Rare "elephant" β [0.9, 0.9, 0.3, 0.4] "mouse" β [0.9, 0.1, 0.5, 0.6] "car" β [0.1, 0.7, 0.8, 0.7]
π‘ In reality: Modern embeddings don't have interpretable dimensionsβeach dimension contributes to many semantic features simultaneously through learned patterns.
Common Dimensionality Sizes
| Model | Dimensions | Use Case |
|---|---|---|
| Word2Vec | 50-300 | Legacy word embeddings |
| BERT-base | 768 | General-purpose text |
| OpenAI text-embedding-3-small | 1536 | Modern semantic search |
| Cohere embed-v3 | 1024 | Multilingual retrieval |
| Custom domain models | 128-384 | Speed-optimized apps |
The Dimensionality Trade-off
Higher dimensions:
- β Capture more nuanced semantic information
- β Better distinguish fine-grained differences
- β Require more storage (important at scale!)
- β Slower similarity computations
- β Risk of overfitting to training data
Lower dimensions:
- β Faster search and computation
- β Less storage overhead
- β May lose subtle semantic distinctions
- β Lower retrieval accuracy
π§ Try this: Calculate storage for 1 million embeddings:
- 768 dimensions Γ 4 bytes (float32) Γ 1M vectors = 3.07 GB
- 1536 dimensions Γ 4 bytes Γ 1M vectors = 6.14 GB
At billion-scale, dimensionality choices have major infrastructure implications!
Distance Metrics: Measuring Similarity π
Distance metrics (or similarity measures) quantify how close or far apart two embeddings are. This is crucial because retrieval in semantic search means finding the nearest neighbors to a query embedding.
Three Essential Metrics
1. Cosine Similarity β (Most Popular)
Measures the angle between vectors, ignoring magnitude:
Formula:
cosine_similarity(A, B) = (A Β· B) / (||A|| Γ ||B||)
= cos(ΞΈ)
Where:
- A Β· B = dot product
- ||A|| = magnitude (length) of vector A
- ΞΈ = angle between vectors
Range: -1 (opposite) to +1 (identical)
- 1.0 = perfectly similar
- 0.0 = orthogonal (unrelated)
- -1.0 = perfectly opposite
Why it's popular: Cosine similarity is scale-invariant. Whether you say "dog" or "dog dog dog dog", the direction (meaning) stays the same.
Visual Intuition (2D):
A (0.8, 0.6)
β’
/|
/ |
/ |ΞΈ small β high similarity
/ |
β’βββββ’βββββ
B (0.7, 0.5)
C (-0.6, 0.8)
β’
|β²
| β²
| β² ΞΈ large β low similarity
| β²
β’βββββ’βββββ²β
A (0.8, 0.6)
Example calculation:
A = [0.5, 0.8, 0.3]
B = [0.6, 0.7, 0.4]
A Β· B = 0.5Γ0.6 + 0.8Γ0.7 + 0.3Γ0.4 = 0.3 + 0.56 + 0.12 = 0.98
||A|| = β(0.5Β² + 0.8Β² + 0.3Β²) = β(0.25 + 0.64 + 0.09) = β0.98 β 0.99
||B|| = β(0.6Β² + 0.7Β² + 0.4Β²) = β(0.36 + 0.49 + 0.16) = β1.01 β 1.00
cosine_similarity = 0.98 / (0.99 Γ 1.00) β 0.99 (very similar!)
2. Euclidean Distance (L2 Distance)
Measures the straight-line distance between points:
Formula:
euclidean_distance(A, B) = ||A - B|| = β(Ξ£(Aα΅’ - Bα΅’)Β²)
Range: 0 (identical) to β (opposite)
- Smaller = more similar
- Larger = less similar
Use case: When magnitude matters (e.g., "very happy" vs "slightly happy")
Visual (2D):
A β’βββββββββ’ B
β² /
β² / Direct distance
β² / = β((xβ-xβ)Β² + (yβ-yβ)Β²)
β’
Origin
Example:
A = [1, 2, 3]
B = [4, 5, 6]
Distance = β((4-1)Β² + (5-2)Β² + (6-3)Β²)
= β(9 + 9 + 9)
= β27 β 5.20
3. Dot Product
Simple multiplication and sum:
Formula:
dot_product(A, B) = Ξ£(Aα΅’ Γ Bα΅’)
Range: -β to +β
- Higher = more similar (when vectors are normalized)
Speed advantage: Fastest to computeβno square roots or divisions!
π‘ Pro tip: If your embeddings are normalized (length = 1), then:
- Dot product = Cosine similarity
- Many modern embedding models output pre-normalized vectors for this reason!
Choosing the Right Metric
| Metric | When to Use | Speed |
|---|---|---|
| Cosine | Most text embeddings, normalized data | Medium |
| Euclidean | When magnitude matters (intensity, scale) | Medium |
| Dot Product | Pre-normalized embeddings, speed-critical | Fast β‘ |
π― In practice: Most RAG systems use cosine similarity or dot product on normalized vectorsβthey're mathematically equivalent and work excellently for semantic search.
How Embeddings Are Created π§
Embeddings don't appear by magicβthey're learned by neural networks trained on massive datasets.
The Training Process
1. Architecture: Modern embeddings use transformer models (BERT, GPT, custom encoders)
2. Training objective: Networks learn to:
- Place similar items close together
- Push dissimilar items apart
- Preserve semantic relationships
3. Common training methods:
| Method | How It Works | Example |
|---|---|---|
| Skip-gram | Predict context from word | Word2Vec |
| CBOW | Predict word from context | Word2Vec |
| Masked Language Modeling | Predict masked tokens | BERT |
| Contrastive Learning | Similar pairs close, dissimilar far | Sentence-BERT |
Example: Contrastive Training
Training Data: Sentence Pairs + Labels
Positive pair (similar):
"The cat sleeps" β [0.2, 0.8, ...]
"A feline rests" β [0.3, 0.7, ...] β Push together
Distance: small β
Negative pair (dissimilar):
"The cat sleeps" β [0.2, 0.8, ...]
"Quantum physics" β [-0.6, 0.1, ...] β Push apart
Distance: large β
Loss function: Rewards the model for:
- Making similar pairs have high cosine similarity
- Making dissimilar pairs have low cosine similarity
After millions of examples, the model learns rich semantic representations!
Pre-trained vs Custom Embeddings
Pre-trained models (recommended for most use cases):
- β Trained on billions of tokens
- β Strong general-purpose understanding
- β Ready to use immediately
- Examples: OpenAI embeddings, Cohere, Sentence-Transformers
Custom fine-tuned models:
- β Optimized for domain-specific terminology
- β Better performance on specialized tasks
- β Requires labeled training data
- β Significant compute and expertise needed
π€ Did you know? OpenAI's text-embedding-3-large was trained on a mixture of unsupervised data and supervised datasets with human feedbackβcombining broad coverage with task-specific optimization!
Real-World Applications π
Example 1: Semantic Search Engine
Scenario: A customer searches "how do I reset my password" in a help center.
Traditional keyword search:
- Looks for exact matches: "reset", "password"
- Misses articles titled "Forgot your login credentials?" or "Account recovery process"
Vector embedding search:
Step 1: Encode query "how do I reset my password" β [0.23, -0.45, 0.67, ..., 0.12] Step 2: Compare with document embeddings Doc 1: "Forgot your login?" β [0.25, -0.43, 0.65, ...] similarity: 0.94 β Doc 2: "Changing profile picture" β [0.10, 0.72, -0.30, ...] similarity: 0.42 Doc 3: "Account recovery guide" β [0.24, -0.41, 0.68, ...] similarity: 0.92 β Step 3: Return top results 1. "Forgot your login?" (94% match) 2. "Account recovery guide" (92% match)
Result: Users find answers even when wording differs completely! π―
Example 2: Retrieval Augmented Generation (RAG)
Scenario: An AI assistant answering questions about company policies.
Without RAG: Model only knows what it was trained on (outdated, hallucinates)
With RAG:
User Query: "What's our remote work policy?"
β
[Embed query]
β
ββββββββββββββββββββββββββββββββ
β Vector Database Search β
β β
β Policy Doc 1: [0.21, ...] β β similarity: 0.88
β Policy Doc 2: [0.52, ...] β β similarity: 0.45
β Policy Doc 3: [0.19, ...] β β similarity: 0.91 β Best!
β Policy Doc 4: [-0.32, ...] β β similarity: 0.33
ββββββββββββββββββββββββββββββββ
β
[Retrieve Doc 3]
β
ββββββββββββββββββββββββββββββββ
β LLM Generation β
β β
β Context: [Policy Doc 3] β
β Query: "remote work?" β
β β Generate answer β
ββββββββββββββββββββββββββββββββ
β
"Employees may work remotely up to 3 days/week..."
Benefits:
- β Answers grounded in actual documents
- β Up-to-date information
- β Reduces hallucinations
- β Citable sources
Example 3: Recommendation Systems
Scenario: Music streaming service recommending songs.
Traditional approach: Genre tags, artist matching
Embedding approach: Represent songs as vectors capturing:
- Musical style (tempo, instrumentation)
- Mood and energy
- Lyrical themes
- Cultural context
User liked: "Bohemian Rhapsody" β [0.7, 0.3, 0.8, -0.2, ...] Similar songs by vector distance: "Stairway to Heaven" β [0.6, 0.4, 0.7, -0.1, ...] distance: 0.15 β "Sweet Child O' Mine" β [0.5, 0.5, 0.6, 0.0, ...] distance: 0.28 β "Baby Shark" β [0.1, 0.9, 0.2, 0.8, ...] distance: 1.42 β
The system discovers nuanced similarities that simple genre tags miss!
Example 4: Multi-modal Search
Scenario: Search for images using text descriptions.
Key insight: Embeddings can represent different data types in the same vector space!
Shared Embedding Space
βββββββββββββββββββββββββββββββββββ
β β
β "sunset beach" β [0.4, 0.8, ...]β
β β small distance β
β π
(image) β [0.5, 0.7, ...] β
β β
β "city traffic" β [-0.3, 0.2, ...]β
β β small distance β
β π (image) β [-0.4, 0.3, ...] β
β β
βββββββββββββββββββββββββββββββββββ
Models like CLIP (OpenAI) learn to embed images and text together, enabling:
- Text β image search
- Image β text descriptions
- Image β similar image search
Common Mistakes β οΈ
Mistake 1: Comparing Embeddings from Different Models
β Wrong:
embedding_A = model_1.encode("cat") # 384 dimensions
embedding_B = model_2.encode("dog") # 768 dimensions
similarity = cosine_similarity(embedding_A, embedding_B) # Error or meaningless!
β Right: Always use embeddings from the same model for comparison. Each model has its own "semantic coordinate system."
Mistake 2: Ignoring Normalization
β Wrong:
## Using dot product on non-normalized vectors
raw_embedding = [5.2, 3.8, 6.1, ...] # Large magnitudes
similarity = dot_product(embedding_A, embedding_B) # Magnitude dominates!
β Right: Normalize embeddings if using dot product:
normalized = embedding / np.linalg.norm(embedding) # Length = 1
Mistake 3: Wrong Distance Metric for Your Use Case
β Wrong: Using Euclidean distance when you want semantic similarity:
- "I love this product" β [0.9, 0.1]
- "I love this product product product" β [2.7, 0.3]
- Euclidean distance: 1.8 (treated as different, but meaning is identical!)
β Right: Use cosine similarity for semantic tasksβit ignores magnitude.
Mistake 4: Not Chunking Long Documents
β Wrong: Embedding a 50-page document as one vector:
- Most models have token limits (512, 8192, etc.)
- Averaging loses fine-grained information
- Single vector can't represent multiple topics well
β Right: Split into chunks (paragraphs, sections):
Document β [Chunk 1 embedding, Chunk 2 embedding, ...]
Then search at chunk level for precise retrieval.
Mistake 5: Forgetting to Update Embeddings
β Wrong: Embedding documents once at system launch, never updating.
β Right: Re-embed when:
- Documents are edited or added
- You switch to a better embedding model
- Your domain terminology evolves
π‘ Pro tip: Use versioning for embeddingsβtrack which model/version created them!
Mistake 6: Overlooking Computational Costs at Scale
β Wrong: Computing embeddings in real-time for every query at million-doc scale:
- Embedding API calls: $$ + latency
- Brute-force similarity: O(n) comparisons
β Right:
- Pre-compute and cache document embeddings
- Use vector databases with indexing (HNSW, IVF) for sub-linear search
- Batch embedding requests when possible
Key Takeaways π
π Quick Reference Card: Vector Embeddings
| What They Are | Numerical representations (vectors) of data that capture semantic meaning |
| Core Purpose | Enable computers to measure similarity between concepts, not just keywords |
| Typical Dimensions | 128-1536 (higher = more nuanced, but slower & bigger) |
| Best Distance Metric | Cosine similarity (or dot product on normalized vectors) |
| Created By | Neural networks (transformers) trained on massive text corpora |
| Key Applications | Semantic search, RAG systems, recommendations, multi-modal search |
| Critical Rule | Never compare embeddings from different models! |
| Storage Formula | dimensions Γ 4 bytes Γ number_of_vectors |
| When to Normalize | When using dot product instead of cosine similarity |
| Chunking Strategy | Split long documents (500-1000 tokens/chunk) for precise retrieval |
Essential Formulas
Cosine Similarity:
cos(ΞΈ) = (A Β· B) / (||A|| Γ ||B||)
Range: [-1, 1]
Higher = more similar
Euclidean Distance:
d = β(Ξ£(Aα΅’ - Bα΅’)Β²)
Range: [0, β]
Lower = more similar
Dot Product:
A Β· B = Ξ£(Aα΅’ Γ Bα΅’)
For normalized vectors: equals cosine similarity
Mental Model: The Meaning Map πΊοΈ
Think of embeddings as GPS coordinates for concepts:
- Each dimension = a direction in meaning-space
- Distance = semantic difference
- Close neighbors = related concepts
- Vector math captures relationships ("king" - "man" + "woman" β "queen")
When to Use What
Semantic Search: Use pre-trained embeddings (OpenAI, Cohere) + cosine similarity
RAG Systems: Chunk documents β embed β store in vector DB β retrieve top-k β feed to LLM
Domain-Specific: Consider fine-tuning on your data if general models underperform
Real-Time Applications: Pre-compute embeddings, use fast vector databases (Pinecone, Weaviate, Qdrant)
π Further Study
Dive Deeper:
- Understanding Vector Embeddings - Pinecone Learning Center - Comprehensive guide with interactive examples and code tutorials
- Sentence Transformers Documentation - Open-source library for state-of-the-art embeddings with pre-trained models
- OpenAI Embeddings Guide - Official documentation covering best practices, use cases, and API reference
Practice Exercises:
- Implement cosine similarity from scratch in Python
- Build a simple semantic search over your own documents
- Compare different embedding models on a retrieval task
- Visualize embeddings in 2D using t-SNE or UMAP
Next Steps: Now that you understand embeddings, explore vector databases (storage & indexing), retrieval strategies (hybrid search, reranking), and evaluation metrics (recall@k, MRR) in the next lessons!
π Congratulations! You've mastered the foundational concept behind modern AI search. Vector embeddings are your gateway to building intelligent, semantic-aware applications in 2026 and beyond!