Reranking Models
Generated content
Reranking Models in Hybrid Retrieval
Master reranking models with free flashcards and spaced repetition practice to solidify your understanding. This lesson covers cross-encoders, scoring mechanisms, reranking architectures, and optimization strategiesβessential concepts for building high-performance AI search systems that deliver truly relevant results.
Welcome to Reranking Models π―
Welcome to one of the most impactful components in modern retrieval systems! If you've ever wondered how search engines manage to show you the perfect result at the top of the page, reranking models are often the secret sauce. While initial retrieval methods (like BM25 or vector search) cast a wide net to find potentially relevant documents, reranking models act as the final judge, carefully examining each candidate and reordering them to surface the best matches.
Think of it like a talent show: the initial auditions (first-stage retrieval) let many contestants through, but the judges (rerankers) carefully evaluate each performance to determine the final rankings. This two-stage approach combines the speed of simple retrieval with the accuracy of sophisticated neural models.
π‘ Why this matters: Reranking can improve your search relevance by 20-40% compared to retrieval alone, and it's become the standard architecture in production RAG systems at companies like Google, Microsoft, and OpenAI.
Core Concepts: Understanding Reranking Models π
What is Reranking?
Reranking is a second-stage refinement process that takes a candidate set of documents (typically 10-100 items) retrieved by a first-stage system and reorders them using a more sophisticated model. The key insight is this: you can't afford to run an expensive model on millions of documents, but you can afford to run it on the top 50-100 candidates.
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β TWO-STAGE RETRIEVAL PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
π Document Corpus (1M+ docs)
β
β
ββββββββββββββββββββββββ
β STAGE 1: RETRIEVAL β β‘ Fast & Broad
β β’ BM25 / Vector DB β β’ Process entire corpus
β β’ Top-k = 100 β β’ Simple similarity
ββββββββββββ¬ββββββββββββ
β
β (100 candidates)
ββββββββββββββββββββββββ
β STAGE 2: RERANKING β π― Slow & Precise
β β’ Cross-encoder β β’ Deep interaction
β β’ Top-k = 10 β β’ Final ordering
ββββββββββββ¬ββββββββββββ
β
β
π Final Results (10 docs)
Cross-Encoders: The Heart of Reranking π§
The most powerful reranking models are cross-encoders. Unlike bi-encoders (used in vector search) that encode queries and documents separately, cross-encoders process the query and document together as a single input.
Architecture comparison:
| Model Type | Input Processing | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| Bi-encoder | Separate encoding encode(query) vs encode(doc) |
β‘β‘β‘ Very Fast | βββ Good | First-stage retrieval |
| Cross-encoder | Joint encoding encode(query + doc) |
π Slow | βββββ Excellent | Second-stage reranking |
Why cross-encoders are more accurate:
When you concatenate the query and document together, the transformer's attention mechanism can compute interactions between every query token and every document token. This enables:
- Token-level matching: Understanding that "apple" in the query might match "iPhone" in a document
- Context awareness: Recognizing that "python" + "programming" means the language, not the snake
- Semantic relationships: Capturing subtle relevance signals that simple cosine similarity misses
π‘ Technical insight: A bi-encoder computes dot_product(query_vector, doc_vector) with ~768 dimensions of interaction. A cross-encoder computes attention across query_length Γ doc_length Γ hidden_dim = potentially millions of interaction points!
How Scoring Works π
Reranking models typically output a relevance score for each (query, document) pair. Here's the typical architecture:
INPUT: "[CLS] what is python [SEP] Python is a programming language [SEP]"
β
β
βββββββββββββββββββ
β Transformer β (12-24 layers)
β (BERT, RoBERTa) β
ββββββββββ¬ββββββββββ
β
β
[CLS] embedding (contextual representation)
β
β
ββββββββββββββββββ
β Classification β (linear layer)
β Head β
ββββββββββ¬ββββββββ
β
β
Relevance Score: 0.87 (0 to 1)
The [CLS] token at the beginning serves as an aggregate representation of the entire input, and a simple linear layer projects it to a relevance score.
Common scoring functions:
- Binary classification: Output probability that document is relevant (0-1)
- Regression: Direct relevance score prediction
- Pointwise ranking: Score each item independently
- Pairwise ranking: Trained on document pairs (which is more relevant?)
- Listwise ranking: Considers the entire ranking at once
Popular Reranking Models π
The reranking landscape has evolved rapidly. Here are the key players:
| Model | Base Architecture | Parameters | Strengths |
|---|---|---|---|
| MS MARCO Cross-Encoder | MiniLM | 33M-110M | Fast, good baseline |
| ColBERT v2 | BERT + late interaction | 110M | Balanced speed/accuracy |
| MonoT5 | T5 | 220M-3B | State-of-the-art accuracy |
| RankT5 | T5 (specialized) | 220M-780M | Optimized for ranking |
| BGE-reranker | XLM-RoBERTa | 560M | Multilingual support |
| Cohere Rerank | Proprietary | Unknown | API-based, high quality |
MonoT5 architecture is particularly interesting: instead of regression, it frames reranking as text generation. Given "Query: X Document: Y Relevant:", it generates "true" or "false" and uses the probability of generating "true" as the relevance score.
Training Strategies π
1. Pointwise Training
The simplest approach: treat each (query, document, label) triple independently.
Loss = CrossEntropy(predicted_score, true_label)
- Label 1 = relevant
- Label 0 = not relevant
2. Pairwise Training
More effective: train on pairs of documents for the same query.
Loss = max(0, margin - (score_relevant - score_non_relevant))
The model learns that relevant documents should score higher than non-relevant ones by at least margin.
3. Listwise Training
Most sophisticated: optimize ranking metrics directly (like NDCG).
Loss = -NDCG(predicted_ranking, true_ranking)
π‘ Pro tip: Pairwise training with hard negatives (documents that are similar but not relevant) produces the best rerankers for RAG systems.
Optimization Techniques β‘
Reranking can be expensive. Here are strategies to speed it up:
1. Distillation
Train a smaller "student" model to mimic a larger "teacher" reranker:
βββββββββββββββ
β Teacher β 3B parameters
β (MonoT5) β Scores: [0.9, 0.7, 0.3, ...]
ββββββββ¬βββββββ
β Distill knowledge
β
βββββββββββββββ
β Student β 110M parameters
β (MiniLM) β Learns to match teacher scores
βββββββββββββββ
You can achieve 95% of the teacher's accuracy at 10x the speed!
2. Cascading
Use multiple reranking stages with increasing sophistication:
100 candidates β Fast reranker (20ms) β 30 candidates
β
30 candidates β Slow reranker (200ms) β 10 candidates
3. Quantization
Reduce model precision from FP32 β INT8:
- 4x smaller memory footprint
- 2-4x faster inference
- <1% accuracy loss
4. Dynamic Padding
Instead of padding all inputs to max length (512 tokens), batch similar-length documents together to reduce wasted computation.
5. Early Exit
For cross-encoders with many layers (24+), you can sometimes exit after layer 12-18 for obvious cases (very relevant or clearly irrelevant) and only use full depth for ambiguous pairs.
Real-World Examples π
Example 1: E-commerce Search
Scenario: A user searches for "waterproof running shoes size 10"
Stage 1 - Initial Retrieval (Vector Search)
Retrieves 100 products based on embedding similarity:
- Waterproof hiking boots size 10 β
- Running shoes size 9 (not waterproof) ~
- Waterproof jacket ~
- Trail running shoes size 10 (water-resistant) ~
- Swimming shoes size 10 ~
Stage 2 - Reranking (Cross-Encoder)
The reranker processes each candidate:
Input: "[CLS] waterproof running shoes size 10 [SEP]
Nike Air Zoom Pegasus 40 GTX - Waterproof running shoe
with Gore-Tex membrane. Available in size 10. [SEP]"
Score: 0.94 β Rank #1 β
The reranker understands:
- "GTX" and "Gore-Tex" β waterproof (domain knowledge)
- "running shoe" exactly matches intent
- "size 10" exact match
- "Pegasus" is a running shoe model (not hiking)
Result: Perfect product surfaces at the top, even if its embedding wasn't the closest match.
Example 2: Customer Support RAG System
Query: "My payment failed but I was still charged"
Initial retrieval returns 50 help articles using BM25 keyword matching:
- "How to update payment methods"
- "Understanding failed transactions"
- "Refund policy for duplicate charges"
- "Payment processing times"
- ...
Reranking analysis:
| Article Title | Initial Score | Reranked Score | Why? |
|---|---|---|---|
| Refund policy for duplicate charges | 0.67 (#3) | 0.93 (#1) | Addresses both "charged" AND "failed" |
| Understanding failed transactions | 0.71 (#2) | 0.88 (#2) | Explains failures but not resolution |
| How to update payment methods | 0.75 (#1) | 0.54 (#7) | High keyword match but wrong intent |
The reranker correctly identifies that the user needs resolution of a duplicate charge, not just information about payment failures.
Example 3: Legal Document Discovery
Query: "precedents for breach of contract in software licensing agreements"
Challenge: Legal text has:
- Formal language requiring deep understanding
- Citations and cross-references
- Subtle distinctions between similar cases
Reranker advantages:
## Pseudo-code for legal reranking
for doc in candidates:
# Cross-encoder captures:
# 1. "breach" β "violation" (synonymy)
# 2. "contract" β "agreement" (legal equivalence)
# 3. "software licensing" β "SaaS terms" (domain concept)
# 4. "precedents" requires case law, not statutes
score = cross_encoder.predict(
query + doc.title + doc.snippet + doc.citations
)
A case about "violation of SaaS subscription terms" scores highly even though it doesn't use the exact words "breach of software licensing contract".
Example 4: Multi-hop Question Answering
Query: "What college did the inventor of the transistor attend?"
This requires:
- Finding who invented the transistor (multiple people: Bardeen, Brattain, Shockley)
- Finding their educational backgrounds
Initial retrieval might return:
- Document A: "The transistor was invented at Bell Labs in 1947"
- Document B: "William Shockley received his PhD from MIT"
- Document C: "John Bardeen attended University of Wisconsin"
A simple vector search might rank Document A highest (most similar to "transistor inventor"), but it doesn't answer the question!
Reranker reasoning:
βββββββββββββββββββββββββββββββββββββββββββ
β Query: "college of transistor inventor"β
βββββββββββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββ
β Cross-Encoder β
β Attention Pattern: β
β β
β "inventor" ββ "Shockley", "Bardeen"
β "college" ββ "MIT", "University" β
β "transistor" ββ "Bell Labs context" β
βββββββββββββββββββββββ
β
β
Document B & C ranked higher
(contain answer, even if less query overlap)
Common Mistakes β οΈ
Mistake 1: Using Reranking as First-Stage Retrieval
The error:
## WRONG: Running cross-encoder on entire corpus
for doc in all_10_million_documents:
score = reranker.predict(query, doc) # TOO SLOW!
Why it fails: A cross-encoder takes ~50ms per document. For 10M documents, that's 5,787 days of computation!
The fix: Always use fast retrieval first:
## RIGHT: Two-stage pipeline
candidates = vector_db.search(query, top_k=100) # ~10ms
reranked = reranker.rerank(query, candidates) # ~500ms
Mistake 2: Ignoring Document Length Bias
The problem: Cross-encoders often favor longer documents because they have more tokens to match the query, even if they're less relevant.
The fix: Normalize scores or train with length-balanced datasets:
## Normalize by document length
adjusted_score = raw_score / log(1 + doc_length)
Mistake 3: Not Using Hard Negatives in Training
Weak training data:
Query: "python programming tutorial"
Positive: "Learn Python Programming" β
Negative: "How to bake a cake" β (too easy)
Strong training data:
Query: "python programming tutorial"
Positive: "Learn Python Programming" β
Negative: "Python snake species guide" β (hard negative!)
Hard negatives (documents that match keywords but not intent) force the model to learn semantic understanding.
Mistake 4: Forgetting to Truncate Long Documents
The issue: Rerankers have max input length (usually 512 tokens). Naively truncating can remove the relevant part:
## WRONG: Might cut off the answer
truncated = doc[:512]
## BETTER: Include beginning + end
truncated = doc[:256] + " ... " + doc[-256:]
## BEST: Use passage-level reranking
for passage in split_into_passages(doc):
score = reranker.predict(query, passage)
final_score = max(scores) # or aggregate
Mistake 5: Not Caching Reranking Results
Waste: Running the same reranker on the same query repeatedly.
Solution: Cache results with query hash:
import hashlib
def rerank_with_cache(query, candidates):
cache_key = hashlib.md5(query.encode()).hexdigest()
if cache_key in rerank_cache:
return rerank_cache[cache_key]
results = reranker.rerank(query, candidates)
rerank_cache[cache_key] = results
return results
Mistake 6: Using Inconsistent Score Ranges
The trap: Different rerankers output different score ranges:
- Some output [0, 1]
- Some output [-β, +β]
- Some output logits
This matters when combining scores from multiple systems.
The fix: Normalize all scores:
## Min-max normalization
normalized = (score - min_score) / (max_score - min_score)
## Or softmax across candidates
from scipy.special import softmax
normalized_scores = softmax(raw_scores)
Key Takeaways π
π Quick Reference Card: Reranking Essentials
| Core Concept | Two-stage retrieval: fast first pass, accurate reranking second pass |
| Key Architecture | Cross-encoders (joint query+doc encoding) for maximum accuracy |
| Typical Pipeline | Retrieve 100-1000 β Rerank β Return top 10 |
| Speed/Accuracy Tradeoff | Bi-encoders: fast but less accurate | Cross-encoders: slow but highly accurate |
| Best Training Strategy | Pairwise learning with hard negatives |
| Popular Models | MS MARCO Cross-Encoder, MonoT5, BGE-reranker, ColBERT v2 |
| Optimization Tactics | Distillation, cascading, quantization, caching |
| Common Pitfall | Using reranker for first-stage retrieval (too slow!) |
When to use reranking:
- β You need maximum relevance accuracy
- β You have a candidate set <1000 documents
- β You can tolerate 100-500ms additional latency
- β Initial retrieval isn't accurate enough
When to skip reranking:
- β Your initial retrieval is already highly accurate
- β You need sub-50ms response times
- β You're dealing with simple keyword matching tasks
- β Computational resources are severely limited
π§ Memory device for the pipeline:
R.A.G. = Retrieve, Assess, Generate
- Retrieve: Fast initial search (BM25, vectors)
- Assess: Careful reranking (cross-encoder)
- Generate: LLM produces answer
π‘ Final Pro Tip: Start simple! Begin with a lightweight reranker like ms-marco-MiniLM-L-6-v2 (23MB, very fast) before graduating to heavier models. You can often get 80% of the benefit at 20% of the cost.
π Further Study
Sentence-Transformers Documentation on Cross-Encoders: https://www.sbert.net/examples/applications/cross-encoder/README.html - Practical implementation guide with code examples
"ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT": https://arxiv.org/abs/2004.12832 - Seminal paper on late interaction architectures that balance speed and accuracy
Cohere Rerank API Guide: https://docs.cohere.com/docs/reranking - Production-ready reranking service with excellent documentation on best practices
Congratulations! You now understand how reranking models work, why they're critical for modern AI search, and how to implement them effectively. Practice building two-stage pipelines to see the dramatic improvement in search quality! π