You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

Reranking Models

Generated content

Reranking Models in Hybrid Retrieval

Master reranking models with free flashcards and spaced repetition practice to solidify your understanding. This lesson covers cross-encoders, scoring mechanisms, reranking architectures, and optimization strategiesβ€”essential concepts for building high-performance AI search systems that deliver truly relevant results.

Welcome to Reranking Models 🎯

Welcome to one of the most impactful components in modern retrieval systems! If you've ever wondered how search engines manage to show you the perfect result at the top of the page, reranking models are often the secret sauce. While initial retrieval methods (like BM25 or vector search) cast a wide net to find potentially relevant documents, reranking models act as the final judge, carefully examining each candidate and reordering them to surface the best matches.

Think of it like a talent show: the initial auditions (first-stage retrieval) let many contestants through, but the judges (rerankers) carefully evaluate each performance to determine the final rankings. This two-stage approach combines the speed of simple retrieval with the accuracy of sophisticated neural models.

πŸ’‘ Why this matters: Reranking can improve your search relevance by 20-40% compared to retrieval alone, and it's become the standard architecture in production RAG systems at companies like Google, Microsoft, and OpenAI.

Core Concepts: Understanding Reranking Models πŸ”

What is Reranking?

Reranking is a second-stage refinement process that takes a candidate set of documents (typically 10-100 items) retrieved by a first-stage system and reorders them using a more sophisticated model. The key insight is this: you can't afford to run an expensive model on millions of documents, but you can afford to run it on the top 50-100 candidates.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         TWO-STAGE RETRIEVAL PIPELINE            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“š Document Corpus (1M+ docs)
           β”‚
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   STAGE 1: RETRIEVAL β”‚  ⚑ Fast & Broad
β”‚  β€’ BM25 / Vector DB  β”‚  β€’ Process entire corpus
β”‚  β€’ Top-k = 100       β”‚  β€’ Simple similarity
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           ↓ (100 candidates)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   STAGE 2: RERANKING β”‚  🎯 Slow & Precise
β”‚  β€’ Cross-encoder     β”‚  β€’ Deep interaction
β”‚  β€’ Top-k = 10        β”‚  β€’ Final ordering
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           ↓
    πŸ“Š Final Results (10 docs)

Cross-Encoders: The Heart of Reranking 🧠

The most powerful reranking models are cross-encoders. Unlike bi-encoders (used in vector search) that encode queries and documents separately, cross-encoders process the query and document together as a single input.

Architecture comparison:

Model Type Input Processing Speed Accuracy Use Case
Bi-encoder Separate encoding
encode(query) vs encode(doc)
⚑⚑⚑ Very Fast ⭐⭐⭐ Good First-stage retrieval
Cross-encoder Joint encoding
encode(query + doc)
🐌 Slow ⭐⭐⭐⭐⭐ Excellent Second-stage reranking

Why cross-encoders are more accurate:

When you concatenate the query and document together, the transformer's attention mechanism can compute interactions between every query token and every document token. This enables:

  • Token-level matching: Understanding that "apple" in the query might match "iPhone" in a document
  • Context awareness: Recognizing that "python" + "programming" means the language, not the snake
  • Semantic relationships: Capturing subtle relevance signals that simple cosine similarity misses

πŸ’‘ Technical insight: A bi-encoder computes dot_product(query_vector, doc_vector) with ~768 dimensions of interaction. A cross-encoder computes attention across query_length Γ— doc_length Γ— hidden_dim = potentially millions of interaction points!

How Scoring Works πŸ“Š

Reranking models typically output a relevance score for each (query, document) pair. Here's the typical architecture:

    INPUT: "[CLS] what is python [SEP] Python is a programming language [SEP]"
           β”‚
           ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Transformer      β”‚  (12-24 layers)
    β”‚ (BERT, RoBERTa)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             ↓
      [CLS] embedding  (contextual representation)
             β”‚
             ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Classification β”‚   (linear layer)
    β”‚ Head           β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             ↓
       Relevance Score: 0.87  (0 to 1)

The [CLS] token at the beginning serves as an aggregate representation of the entire input, and a simple linear layer projects it to a relevance score.

Common scoring functions:

  1. Binary classification: Output probability that document is relevant (0-1)
  2. Regression: Direct relevance score prediction
  3. Pointwise ranking: Score each item independently
  4. Pairwise ranking: Trained on document pairs (which is more relevant?)
  5. Listwise ranking: Considers the entire ranking at once

The reranking landscape has evolved rapidly. Here are the key players:

Model Base Architecture Parameters Strengths
MS MARCO Cross-Encoder MiniLM 33M-110M Fast, good baseline
ColBERT v2 BERT + late interaction 110M Balanced speed/accuracy
MonoT5 T5 220M-3B State-of-the-art accuracy
RankT5 T5 (specialized) 220M-780M Optimized for ranking
BGE-reranker XLM-RoBERTa 560M Multilingual support
Cohere Rerank Proprietary Unknown API-based, high quality

MonoT5 architecture is particularly interesting: instead of regression, it frames reranking as text generation. Given "Query: X Document: Y Relevant:", it generates "true" or "false" and uses the probability of generating "true" as the relevance score.

Training Strategies πŸ“š

1. Pointwise Training

The simplest approach: treat each (query, document, label) triple independently.

Loss = CrossEntropy(predicted_score, true_label)
  • Label 1 = relevant
  • Label 0 = not relevant

2. Pairwise Training

More effective: train on pairs of documents for the same query.

Loss = max(0, margin - (score_relevant - score_non_relevant))

The model learns that relevant documents should score higher than non-relevant ones by at least margin.

3. Listwise Training

Most sophisticated: optimize ranking metrics directly (like NDCG).

Loss = -NDCG(predicted_ranking, true_ranking)

πŸ’‘ Pro tip: Pairwise training with hard negatives (documents that are similar but not relevant) produces the best rerankers for RAG systems.

Optimization Techniques ⚑

Reranking can be expensive. Here are strategies to speed it up:

1. Distillation

Train a smaller "student" model to mimic a larger "teacher" reranker:

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Teacher     β”‚  3B parameters
    β”‚ (MonoT5)    β”‚  Scores: [0.9, 0.7, 0.3, ...]
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚ Distill knowledge
           ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Student     β”‚  110M parameters
    β”‚ (MiniLM)    β”‚  Learns to match teacher scores
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

You can achieve 95% of the teacher's accuracy at 10x the speed!

2. Cascading

Use multiple reranking stages with increasing sophistication:

100 candidates β†’ Fast reranker (20ms) β†’ 30 candidates
                                         ↓
30 candidates  β†’ Slow reranker (200ms) β†’ 10 candidates

3. Quantization

Reduce model precision from FP32 β†’ INT8:

  • 4x smaller memory footprint
  • 2-4x faster inference
  • <1% accuracy loss

4. Dynamic Padding

Instead of padding all inputs to max length (512 tokens), batch similar-length documents together to reduce wasted computation.

5. Early Exit

For cross-encoders with many layers (24+), you can sometimes exit after layer 12-18 for obvious cases (very relevant or clearly irrelevant) and only use full depth for ambiguous pairs.

Real-World Examples 🌍

Scenario: A user searches for "waterproof running shoes size 10"

Stage 1 - Initial Retrieval (Vector Search)

Retrieves 100 products based on embedding similarity:

  • Waterproof hiking boots size 10 βœ“
  • Running shoes size 9 (not waterproof) ~
  • Waterproof jacket ~
  • Trail running shoes size 10 (water-resistant) ~
  • Swimming shoes size 10 ~

Stage 2 - Reranking (Cross-Encoder)

The reranker processes each candidate:

Input: "[CLS] waterproof running shoes size 10 [SEP] 
        Nike Air Zoom Pegasus 40 GTX - Waterproof running shoe 
        with Gore-Tex membrane. Available in size 10. [SEP]"

Score: 0.94 β†’ Rank #1 βœ…

The reranker understands:

  • "GTX" and "Gore-Tex" β†’ waterproof (domain knowledge)
  • "running shoe" exactly matches intent
  • "size 10" exact match
  • "Pegasus" is a running shoe model (not hiking)

Result: Perfect product surfaces at the top, even if its embedding wasn't the closest match.

Example 2: Customer Support RAG System

Query: "My payment failed but I was still charged"

Initial retrieval returns 50 help articles using BM25 keyword matching:

  1. "How to update payment methods"
  2. "Understanding failed transactions"
  3. "Refund policy for duplicate charges"
  4. "Payment processing times"
  5. ...

Reranking analysis:

Article Title Initial Score Reranked Score Why?
Refund policy for duplicate charges 0.67 (#3) 0.93 (#1) Addresses both "charged" AND "failed"
Understanding failed transactions 0.71 (#2) 0.88 (#2) Explains failures but not resolution
How to update payment methods 0.75 (#1) 0.54 (#7) High keyword match but wrong intent

The reranker correctly identifies that the user needs resolution of a duplicate charge, not just information about payment failures.

Query: "precedents for breach of contract in software licensing agreements"

Challenge: Legal text has:

  • Formal language requiring deep understanding
  • Citations and cross-references
  • Subtle distinctions between similar cases

Reranker advantages:

## Pseudo-code for legal reranking
for doc in candidates:
    # Cross-encoder captures:
    # 1. "breach" ↔ "violation" (synonymy)
    # 2. "contract" ↔ "agreement" (legal equivalence)
    # 3. "software licensing" ↔ "SaaS terms" (domain concept)
    # 4. "precedents" requires case law, not statutes
    
    score = cross_encoder.predict(
        query + doc.title + doc.snippet + doc.citations
    )

A case about "violation of SaaS subscription terms" scores highly even though it doesn't use the exact words "breach of software licensing contract".

Example 4: Multi-hop Question Answering

Query: "What college did the inventor of the transistor attend?"

This requires:

  1. Finding who invented the transistor (multiple people: Bardeen, Brattain, Shockley)
  2. Finding their educational backgrounds

Initial retrieval might return:

  • Document A: "The transistor was invented at Bell Labs in 1947"
  • Document B: "William Shockley received his PhD from MIT"
  • Document C: "John Bardeen attended University of Wisconsin"

A simple vector search might rank Document A highest (most similar to "transistor inventor"), but it doesn't answer the question!

Reranker reasoning:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Query: "college of transistor inventor"β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Cross-Encoder       β”‚
    β”‚ Attention Pattern:  β”‚
    β”‚                     β”‚
    β”‚ "inventor" ←→ "Shockley", "Bardeen"
    β”‚ "college" ←→ "MIT", "University"     β”‚
    β”‚ "transistor" ←→ "Bell Labs context"  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              ↓
    Document B & C ranked higher
    (contain answer, even if less query overlap)

Common Mistakes ⚠️

Mistake 1: Using Reranking as First-Stage Retrieval

The error:

## WRONG: Running cross-encoder on entire corpus
for doc in all_10_million_documents:
    score = reranker.predict(query, doc)  # TOO SLOW!

Why it fails: A cross-encoder takes ~50ms per document. For 10M documents, that's 5,787 days of computation!

The fix: Always use fast retrieval first:

## RIGHT: Two-stage pipeline
candidates = vector_db.search(query, top_k=100)  # ~10ms
reranked = reranker.rerank(query, candidates)    # ~500ms

Mistake 2: Ignoring Document Length Bias

The problem: Cross-encoders often favor longer documents because they have more tokens to match the query, even if they're less relevant.

The fix: Normalize scores or train with length-balanced datasets:

## Normalize by document length
adjusted_score = raw_score / log(1 + doc_length)

Mistake 3: Not Using Hard Negatives in Training

Weak training data:

Query: "python programming tutorial"
Positive: "Learn Python Programming" βœ“
Negative: "How to bake a cake" βœ— (too easy)

Strong training data:

Query: "python programming tutorial"
Positive: "Learn Python Programming" βœ“
Negative: "Python snake species guide" βœ— (hard negative!)

Hard negatives (documents that match keywords but not intent) force the model to learn semantic understanding.

Mistake 4: Forgetting to Truncate Long Documents

The issue: Rerankers have max input length (usually 512 tokens). Naively truncating can remove the relevant part:

## WRONG: Might cut off the answer
truncated = doc[:512]

## BETTER: Include beginning + end
truncated = doc[:256] + " ... " + doc[-256:]

## BEST: Use passage-level reranking
for passage in split_into_passages(doc):
    score = reranker.predict(query, passage)
final_score = max(scores)  # or aggregate

Mistake 5: Not Caching Reranking Results

Waste: Running the same reranker on the same query repeatedly.

Solution: Cache results with query hash:

import hashlib

def rerank_with_cache(query, candidates):
    cache_key = hashlib.md5(query.encode()).hexdigest()
    if cache_key in rerank_cache:
        return rerank_cache[cache_key]
    
    results = reranker.rerank(query, candidates)
    rerank_cache[cache_key] = results
    return results

Mistake 6: Using Inconsistent Score Ranges

The trap: Different rerankers output different score ranges:

  • Some output [0, 1]
  • Some output [-∞, +∞]
  • Some output logits

This matters when combining scores from multiple systems.

The fix: Normalize all scores:

## Min-max normalization
normalized = (score - min_score) / (max_score - min_score)

## Or softmax across candidates
from scipy.special import softmax
normalized_scores = softmax(raw_scores)

Key Takeaways πŸŽ“

πŸ“‹ Quick Reference Card: Reranking Essentials

Core Concept Two-stage retrieval: fast first pass, accurate reranking second pass
Key Architecture Cross-encoders (joint query+doc encoding) for maximum accuracy
Typical Pipeline Retrieve 100-1000 β†’ Rerank β†’ Return top 10
Speed/Accuracy Tradeoff Bi-encoders: fast but less accurate | Cross-encoders: slow but highly accurate
Best Training Strategy Pairwise learning with hard negatives
Popular Models MS MARCO Cross-Encoder, MonoT5, BGE-reranker, ColBERT v2
Optimization Tactics Distillation, cascading, quantization, caching
Common Pitfall Using reranker for first-stage retrieval (too slow!)

When to use reranking:

  • βœ… You need maximum relevance accuracy
  • βœ… You have a candidate set <1000 documents
  • βœ… You can tolerate 100-500ms additional latency
  • βœ… Initial retrieval isn't accurate enough

When to skip reranking:

  • ❌ Your initial retrieval is already highly accurate
  • ❌ You need sub-50ms response times
  • ❌ You're dealing with simple keyword matching tasks
  • ❌ Computational resources are severely limited

🧠 Memory device for the pipeline:

R.A.G. = Retrieve, Assess, Generate

  • Retrieve: Fast initial search (BM25, vectors)
  • Assess: Careful reranking (cross-encoder)
  • Generate: LLM produces answer

πŸ’‘ Final Pro Tip: Start simple! Begin with a lightweight reranker like ms-marco-MiniLM-L-6-v2 (23MB, very fast) before graduating to heavier models. You can often get 80% of the benefit at 20% of the cost.

πŸ“š Further Study

  1. Sentence-Transformers Documentation on Cross-Encoders: https://www.sbert.net/examples/applications/cross-encoder/README.html - Practical implementation guide with code examples

  2. "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT": https://arxiv.org/abs/2004.12832 - Seminal paper on late interaction architectures that balance speed and accuracy

  3. Cohere Rerank API Guide: https://docs.cohere.com/docs/reranking - Production-ready reranking service with excellent documentation on best practices

Congratulations! You now understand how reranking models work, why they're critical for modern AI search, and how to implement them effectively. Practice building two-stage pipelines to see the dramatic improvement in search quality! πŸš€