Every token in an LLM's context window competes for attention — literally. The quality of what you put in that window determines the quality of what comes out. Rerankers are the tool that curates that context. This article explains how they work, why they matter, and how to implement them.
The problem: why embeddings aren't enough
Most RAG systems use embedding models (bi-encoders) to find relevant documents. You encode your query into a vector, encode your documents into vectors, and find the closest ones by cosine similarity.
The fundamental flaw: a bi-encoder must compress the entire semantic meaning of a document into a single vector — typically 768 or 1024 floating-point numbers.
Imagine summarizing a 500-page novel into a single sentence, then trying to answer specific questions using only that sentence.
That's what embeddings do. You get the gist, but you lose:
- • Negation — "python not snake" ≈ "python snake" in embedding space
- • Qualification — "companies that failed due to security" matches security success stories
- • Temporal context — "before 2020" vs "after 2020" compress identically
- • Subtle relationships — "X causes Y" vs "X prevents Y" are nearly identical vectors
[0.23, -0.11, 0.87, ... 768 numbers]What rerankers actually do
A reranker (cross-encoder) takes a completely different approach. Instead of encoding query and document separately, it processes them together as a single input:
[CLS] query tokens [SEP] document tokens [SEP]0.94The critical difference: every query token can attend to every document token. When the model sees "python not snake" alongside a document about Python programming, the attention mechanism directly weighs the negation "not" against "snake" in the document. It understands the relationship, not just the topic.
The architecture spectrum
| Approach | Speed | Precision | How it works |
|---|---|---|---|
| Bi-encoder | Very fast | Moderate | Query & doc encoded separately → cosine similarity |
| Cross-encoder | Slow | Very high | Query + doc fed together → full attention → score |
| Late interaction (ColBERT) | Fast | High | Token-level embeddings → MaxSim matching |
The two-stage pipeline
If cross-encoders are so much better, why not use them for everything?
Six orders of magnitude difference. You can't cross-encode everything. So we use both.
The two-stage pipeline: fast recall, then precise reranking, then generation
The numbers speak for themselves:
The model landscape (2025)
The reranker ecosystem has exploded. Here's what's actually worth using:
Cohere Rerank 3.5
proprietaryEasiest to deploy. API-based, 100+ languages, handles JSON/tables/code. Best for getting started fast.
BGE Reranker (BAAI)
open-sourceLayerwise variant lets you pick layer 8 (fast) to layer 40 (precise) — continuous speed/quality tradeoff without retraining.
Jina Reranker v3
open-sourceListwise: processes all candidates in one pass for cross-document comparison. 131K context window for long documents.
mxbai-rerank-v2
rl-trainedTrained with GRPO (reinforcement learning) + contrastive + preference learning. Optimizes ranking quality directly.
ColBERT
late-interactionToken-level embeddings with MaxSim matching. Doc embeddings precomputable. Best speed/quality tradeoff for production.
RankZephyr (7B)
llm-basedAn LLM fine-tuned for listwise reranking. Part of the RankLLM toolkit (SIGIR 2025). Best absolute quality for offline use.
How rerankers connect to LLM "thinking"
Here's the core thesis: rerankers work because they think like LLMs think.
The "Lost in the Middle" paper (Liu et al., 2023) showed that LLMs perform significantly worse when relevant information is buried in the middle of the context window. Attention has positional biases — early and late positions get disproportionate weight.
Implementation playbook
# Stage 1: Hybrid retrieval (cast a wide net) vector_results = vector_db.search(query_embedding, top_k=50) keyword_results = bm25_index.search(query, top_k=50) # Reciprocal Rank Fusion fused = rrf_merge(vector_results, keyword_results, k=60) candidates = fused[:100] # Top 100 candidates # Stage 2: Cross-encoder reranking (pick the best) scored = reranker.score(query, candidates) # ~150ms top_docs = sorted(scored, reverse=True)[:5] # Stage 3: Feed to LLM context = format_context(top_docs) response = llm.generate(system_prompt + context + user_query)
Key parameters to tune
When to use what
| Scenario | Recommended | Why |
|---|---|---|
| Quick start / prototyping | Cohere Rerank API | Zero infra, pay per query |
| Production, latency-critical | BGE-base or ColBERT | Self-hosted, sub-100ms |
| Maximum quality | Jina v3 or mxbai-v2 | Accept latency for precision |
| Batch / offline | RankZephyr | Best absolute quality |
Pitfalls in production
A reranker can only reorder what you give it. If the right document isn't in your top-100 candidates, no reranking will surface it. Fix retrieval first.
Retrieving top-10 then reranking defeats the purpose. You need to retrieve wide (50-100) to give the reranker options.
A reranker trained on MS-MARCO (web search) may underperform on medical, legal, or code. Always evaluate on your domain.
Reranking to top-20 and dumping all 20 into the LLM recreates the "Lost in the Middle" problem. Keep it tight: 3-5 for factual queries.
Reranker metrics (NDCG, MRR) don't always correlate with final answer quality. Measure downstream task performance, not just retrieval metrics.
Where this is heading
Listwise reranking
Models like Jina v3 process all candidates simultaneously, enabling cross-document reasoning: "A has the definition, B has the example — together they answer better."
RL-trained models
GRPO + contrastive learning produces better rerankers at smaller sizes. A distilled 440M model now outperforms a 3B supervised baseline.
Pairwise prompting
Asking "which is more relevant?" instead of "how relevant?" — matched GPT-4 quality with models 50× smaller (Flan-UL2).
Retrieval–generation convergence
Future models may merge retrieval, reranking, and generation into a single forward pass — retrieving from an index as part of generation itself.
The bottom line
If your RAG system retrieves documents and feeds them directly to an LLM, you're leaving significant quality on the table.
A reranking step — even a simple one — will improve output quality more than a better embedding model, more than a larger chunk size, and often more than a bigger LLM.
If you want to get into an LLM's mind, start by speaking its language: attention.