Reranking Models

Generated content

Reranking Models in Hybrid Retrieval

Master reranking models with free flashcards and spaced repetition practice to solidify your understanding. This lesson covers cross-encoders, scoring mechanisms, reranking architectures, and optimization strategies—essential concepts for building high-performance AI search systems that deliver truly relevant results.

Welcome to Reranking Models 🎯

Welcome to one of the most impactful components in modern retrieval systems! If you've ever wondered how search engines manage to show you the perfect result at the top of the page, reranking models are often the secret sauce. While initial retrieval methods (like BM25 or vector search) cast a wide net to find potentially relevant documents, reranking models act as the final judge, carefully examining each candidate and reordering them to surface the best matches.

Think of it like a talent show: the initial auditions (first-stage retrieval) let many contestants through, but the judges (rerankers) carefully evaluate each performance to determine the final rankings. This two-stage approach combines the speed of simple retrieval with the accuracy of sophisticated neural models.

💡 Why this matters: Reranking can improve your search relevance by 20-40% compared to retrieval alone, and it's become the standard architecture in production RAG systems at companies like Google, Microsoft, and OpenAI.

Core Concepts: Understanding Reranking Models 🔍

What is Reranking?

Reranking is a second-stage refinement process that takes a candidate set of documents (typically 10-100 items) retrieved by a first-stage system and reorders them using a more sophisticated model. The key insight is this: you can't afford to run an expensive model on millions of documents, but you can afford to run it on the top 50-100 candidates.

┌─────────────────────────────────────────────────┐
│         TWO-STAGE RETRIEVAL PIPELINE            │
└─────────────────────────────────────────────────┘

📚 Document Corpus (1M+ docs)
           │
           ↓
┌──────────────────────┐
│   STAGE 1: RETRIEVAL │  ⚡ Fast & Broad
│  • BM25 / Vector DB  │  • Process entire corpus
│  • Top-k = 100       │  • Simple similarity
└──────────┬───────────┘
           │
           ↓ (100 candidates)
┌──────────────────────┐
│   STAGE 2: RERANKING │  🎯 Slow & Precise
│  • Cross-encoder     │  • Deep interaction
│  • Top-k = 10        │  • Final ordering
└──────────┬───────────┘
           │
           ↓
    📊 Final Results (10 docs)

Cross-Encoders: The Heart of Reranking 🧠

The most powerful reranking models are cross-encoders. Unlike bi-encoders (used in vector search) that encode queries and documents separately, cross-encoders process the query and document together as a single input.

Architecture comparison:

Model Type	Input Processing	Speed	Accuracy	Use Case
Bi-encoder	Separate encoding encode(query) vs encode(doc)	⚡⚡⚡ Very Fast	⭐⭐⭐ Good	First-stage retrieval
Cross-encoder	Joint encoding encode(query + doc)	🐌 Slow	⭐⭐⭐⭐⭐ Excellent	Second-stage reranking

Why cross-encoders are more accurate:

When you concatenate the query and document together, the transformer's attention mechanism can compute interactions between every query token and every document token. This enables:

Token-level matching: Understanding that "apple" in the query might match "iPhone" in a document
Context awareness: Recognizing that "python" + "programming" means the language, not the snake
Semantic relationships: Capturing subtle relevance signals that simple cosine similarity misses

💡 Technical insight: A bi-encoder computes dot_product(query_vector, doc_vector) with ~768 dimensions of interaction. A cross-encoder computes attention across query_length × doc_length × hidden_dim = potentially millions of interaction points!

How Scoring Works 📊

Reranking models typically output a relevance score for each (query, document) pair. Here's the typical architecture:

    INPUT: "[CLS] what is python [SEP] Python is a programming language [SEP]"
           │
           ↓
    ┌─────────────────┐
    │ Transformer      │  (12-24 layers)
    │ (BERT, RoBERTa)  │
    └────────┬─────────┘
             │
             ↓
      [CLS] embedding  (contextual representation)
             │
             ↓
    ┌────────────────┐
    │ Classification │   (linear layer)
    │ Head           │
    └────────┬───────┘
             │
             ↓
       Relevance Score: 0.87  (0 to 1)

The [CLS] token at the beginning serves as an aggregate representation of the entire input, and a simple linear layer projects it to a relevance score.

Common scoring functions:

Binary classification: Output probability that document is relevant (0-1)
Regression: Direct relevance score prediction
Pointwise ranking: Score each item independently
Pairwise ranking: Trained on document pairs (which is more relevant?)
Listwise ranking: Considers the entire ranking at once

Popular Reranking Models 🏆

The reranking landscape has evolved rapidly. Here are the key players:

Model	Base Architecture	Parameters	Strengths
MS MARCO Cross-Encoder	MiniLM	33M-110M	Fast, good baseline
ColBERT v2	BERT + late interaction	110M	Balanced speed/accuracy
MonoT5	T5	220M-3B	State-of-the-art accuracy
RankT5	T5 (specialized)	220M-780M	Optimized for ranking
BGE-reranker	XLM-RoBERTa	560M	Multilingual support
Cohere Rerank	Proprietary	Unknown	API-based, high quality

MonoT5 architecture is particularly interesting: instead of regression, it frames reranking as text generation. Given "Query: X Document: Y Relevant:", it generates "true" or "false" and uses the probability of generating "true" as the relevance score.

Training Strategies 📚

1. Pointwise Training

The simplest approach: treat each (query, document, label) triple independently.

Loss = CrossEntropy(predicted_score, true_label)

Label 1 = relevant
Label 0 = not relevant

2. Pairwise Training

More effective: train on pairs of documents for the same query.

Loss = max(0, margin - (score_relevant - score_non_relevant))

The model learns that relevant documents should score higher than non-relevant ones by at least margin.

3. Listwise Training

Most sophisticated: optimize ranking metrics directly (like NDCG).

Loss = -NDCG(predicted_ranking, true_ranking)

💡 Pro tip: Pairwise training with hard negatives (documents that are similar but not relevant) produces the best rerankers for RAG systems.

Optimization Techniques ⚡

Reranking can be expensive. Here are strategies to speed it up:

1. Distillation

Train a smaller "student" model to mimic a larger "teacher" reranker:

    ┌─────────────┐
    │ Teacher     │  3B parameters
    │ (MonoT5)    │  Scores: [0.9, 0.7, 0.3, ...]
    └──────┬──────┘
           │ Distill knowledge
           ↓
    ┌─────────────┐
    │ Student     │  110M parameters
    │ (MiniLM)    │  Learns to match teacher scores
    └─────────────┘

You can achieve 95% of the teacher's accuracy at 10x the speed!

2. Cascading

Use multiple reranking stages with increasing sophistication:

100 candidates → Fast reranker (20ms) → 30 candidates
                                         ↓
30 candidates  → Slow reranker (200ms) → 10 candidates

3. Quantization

Reduce model precision from FP32 → INT8:

4x smaller memory footprint
2-4x faster inference
<1% accuracy loss

4. Dynamic Padding

Instead of padding all inputs to max length (512 tokens), batch similar-length documents together to reduce wasted computation.

5. Early Exit

For cross-encoders with many layers (24+), you can sometimes exit after layer 12-18 for obvious cases (very relevant or clearly irrelevant) and only use full depth for ambiguous pairs.

Real-World Examples 🌍

Example 1: E-commerce Search

Scenario: A user searches for "waterproof running shoes size 10"

Stage 1 - Initial Retrieval (Vector Search)

Retrieves 100 products based on embedding similarity:

Waterproof hiking boots size 10 ✓
Running shoes size 9 (not waterproof) ~
Waterproof jacket ~
Trail running shoes size 10 (water-resistant) ~
Swimming shoes size 10 ~

Stage 2 - Reranking (Cross-Encoder)

The reranker processes each candidate:

Input: "[CLS] waterproof running shoes size 10 [SEP] 
        Nike Air Zoom Pegasus 40 GTX - Waterproof running shoe 
        with Gore-Tex membrane. Available in size 10. [SEP]"

Score: 0.94 → Rank #1 ✅

The reranker understands:

"GTX" and "Gore-Tex" → waterproof (domain knowledge)
"running shoe" exactly matches intent
"size 10" exact match
"Pegasus" is a running shoe model (not hiking)

Result: Perfect product surfaces at the top, even if its embedding wasn't the closest match.

Example 2: Customer Support RAG System

Query: "My payment failed but I was still charged"

Initial retrieval returns 50 help articles using BM25 keyword matching:

"How to update payment methods"
"Understanding failed transactions"
"Refund policy for duplicate charges"
"Payment processing times"
...

Reranking analysis:

Article Title	Initial Score	Reranked Score	Why?
Refund policy for duplicate charges	0.67 (#3)	0.93 (#1)	Addresses both "charged" AND "failed"
Understanding failed transactions	0.71 (#2)	0.88 (#2)	Explains failures but not resolution
How to update payment methods	0.75 (#1)	0.54 (#7)	High keyword match but wrong intent

The reranker correctly identifies that the user needs resolution of a duplicate charge, not just information about payment failures.

Example 3: Legal Document Discovery

Query: "precedents for breach of contract in software licensing agreements"

Challenge: Legal text has:

Formal language requiring deep understanding
Citations and cross-references
Subtle distinctions between similar cases

Reranker advantages:

## Pseudo-code for legal reranking
for doc in candidates:
    # Cross-encoder captures:
    # 1. "breach" ↔ "violation" (synonymy)
    # 2. "contract" ↔ "agreement" (legal equivalence)
    # 3. "software licensing" ↔ "SaaS terms" (domain concept)
    # 4. "precedents" requires case law, not statutes
    
    score = cross_encoder.predict(
        query + doc.title + doc.snippet + doc.citations
    )

A case about "violation of SaaS subscription terms" scores highly even though it doesn't use the exact words "breach of software licensing contract".

Example 4: Multi-hop Question Answering

Query: "What college did the inventor of the transistor attend?"

This requires:

Finding who invented the transistor (multiple people: Bardeen, Brattain, Shockley)
Finding their educational backgrounds

Initial retrieval might return:

Document A: "The transistor was invented at Bell Labs in 1947"
Document B: "William Shockley received his PhD from MIT"
Document C: "John Bardeen attended University of Wisconsin"

A simple vector search might rank Document A highest (most similar to "transistor inventor"), but it doesn't answer the question!

Reranker reasoning:

┌─────────────────────────────────────────┐
│  Query: "college of transistor inventor"│
└─────────────────────────────────────────┘
              │
              ↓
    ┌─────────────────────┐
    │ Cross-Encoder       │
    │ Attention Pattern:  │
    │                     │
    │ "inventor" ←→ "Shockley", "Bardeen"
    │ "college" ←→ "MIT", "University"     │
    │ "transistor" ←→ "Bell Labs context"  │
    └─────────────────────┘
              │
              ↓
    Document B & C ranked higher
    (contain answer, even if less query overlap)

Common Mistakes ⚠️

Mistake 1: Using Reranking as First-Stage Retrieval

The error:

## WRONG: Running cross-encoder on entire corpus
for doc in all_10_million_documents:
    score = reranker.predict(query, doc)  # TOO SLOW!

Why it fails: A cross-encoder takes ~50ms per document. For 10M documents, that's 5,787 days of computation!

The fix: Always use fast retrieval first:

## RIGHT: Two-stage pipeline
candidates = vector_db.search(query, top_k=100)  # ~10ms
reranked = reranker.rerank(query, candidates)    # ~500ms

Mistake 2: Ignoring Document Length Bias

The problem: Cross-encoders often favor longer documents because they have more tokens to match the query, even if they're less relevant.

The fix: Normalize scores or train with length-balanced datasets:

## Normalize by document length
adjusted_score = raw_score / log(1 + doc_length)

Mistake 3: Not Using Hard Negatives in Training

Weak training data:

Query: "python programming tutorial"
Positive: "Learn Python Programming" ✓
Negative: "How to bake a cake" ✗ (too easy)

Strong training data:

Query: "python programming tutorial"
Positive: "Learn Python Programming" ✓
Negative: "Python snake species guide" ✗ (hard negative!)

Hard negatives (documents that match keywords but not intent) force the model to learn semantic understanding.

Mistake 4: Forgetting to Truncate Long Documents

The issue: Rerankers have max input length (usually 512 tokens). Naively truncating can remove the relevant part:

## WRONG: Might cut off the answer
truncated = doc[:512]

## BETTER: Include beginning + end
truncated = doc[:256] + " ... " + doc[-256:]

## BEST: Use passage-level reranking
for passage in split_into_passages(doc):
    score = reranker.predict(query, passage)
final_score = max(scores)  # or aggregate

Mistake 5: Not Caching Reranking Results

Waste: Running the same reranker on the same query repeatedly.

Solution: Cache results with query hash:

import hashlib

def rerank_with_cache(query, candidates):
    cache_key = hashlib.md5(query.encode()).hexdigest()
    if cache_key in rerank_cache:
        return rerank_cache[cache_key]
    
    results = reranker.rerank(query, candidates)
    rerank_cache[cache_key] = results
    return results

Mistake 6: Using Inconsistent Score Ranges

The trap: Different rerankers output different score ranges:

Some output [0, 1]
Some output [-∞, +∞]
Some output logits

This matters when combining scores from multiple systems.

The fix: Normalize all scores:

## Min-max normalization
normalized = (score - min_score) / (max_score - min_score)

## Or softmax across candidates
from scipy.special import softmax
normalized_scores = softmax(raw_scores)

Key Takeaways 🎓

📋 Quick Reference Card: Reranking Essentials

Core Concept	Two-stage retrieval: fast first pass, accurate reranking second pass
Key Architecture	Cross-encoders (joint query+doc encoding) for maximum accuracy
Typical Pipeline	Retrieve 100-1000 → Rerank → Return top 10
Speed/Accuracy Tradeoff	Bi-encoders: fast but less accurate \| Cross-encoders: slow but highly accurate
Best Training Strategy	Pairwise learning with hard negatives
Popular Models	MS MARCO Cross-Encoder, MonoT5, BGE-reranker, ColBERT v2
Optimization Tactics	Distillation, cascading, quantization, caching
Common Pitfall	Using reranker for first-stage retrieval (too slow!)

When to use reranking:

✅ You need maximum relevance accuracy
✅ You have a candidate set <1000 documents
✅ You can tolerate 100-500ms additional latency
✅ Initial retrieval isn't accurate enough

When to skip reranking:

❌ Your initial retrieval is already highly accurate
❌ You need sub-50ms response times
❌ You're dealing with simple keyword matching tasks
❌ Computational resources are severely limited

🧠 Memory device for the pipeline:

R.A.G. = Retrieve, Assess, Generate

Retrieve: Fast initial search (BM25, vectors)
Assess: Careful reranking (cross-encoder)
Generate: LLM produces answer

💡 Final Pro Tip: Start simple! Begin with a lightweight reranker like ms-marco-MiniLM-L-6-v2 (23MB, very fast) before graduating to heavier models. You can often get 80% of the benefit at 20% of the cost.

📚 Further Study

Sentence-Transformers Documentation on Cross-Encoders: https://www.sbert.net/examples/applications/cross-encoder/README.html - Practical implementation guide with code examples
"ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT": https://arxiv.org/abs/2004.12832 - Seminal paper on late interaction architectures that balance speed and accuracy
Cohere Rerank API Guide: https://docs.cohere.com/docs/reranking - Production-ready reranking service with excellent documentation on best practices

Congratulations! You now understand how reranking models work, why they're critical for modern AI search, and how to implement them effectively. Practice building two-stage pipelines to see the dramatic improvement in search quality! 🚀

📝

Ready to practice?

This lesson has 15 questions to help you learn