Part IV · Information Retrieval & RAG
Chapter 62 ~11 min read

Reranking with cross-encoders

"Bi-encoder for recall. Cross-encoder for precision. Use both"

In Chapter 9 we introduced the bi-encoder vs cross-encoder distinction. In this chapter we go deeper on the reranking step specifically — what cross-encoder rerankers are, when and why to use them, how to integrate them into a RAG pipeline, and what the empirical impact looks like.

By the end you’ll know how to design a reranking step that actually helps quality, not just adds latency.

Outline:

  1. The retrieve-then-rerank pattern.
  2. Cross-encoder mechanics.
  3. The latency-quality trade.
  4. Available reranker models.
  5. Integration with the RAG pipeline.
  6. Reranker training data.
  7. Evaluation: does reranking actually help?
  8. Operational considerations.

62.1 The retrieve-then-rerank pattern

The pattern is now standard in production RAG:

  1. Retrieve the top-K candidates with a fast bi-encoder (or hybrid retriever). K is typically 50-200.
  2. Rerank those candidates with a slower but more accurate cross-encoder. The cross-encoder produces a fresh score for each candidate based on the (query, candidate) pair.
  3. Return the top-N reranked candidates (typically N is 5-20) to the LLM.

The reason for the two stages: bi-encoders are fast but imprecise; cross-encoders are slow but precise. Bi-encoders can search a million-document corpus in milliseconds because each document is encoded once and stored as a vector. Cross-encoders are 100× slower per pair because they re-encode the full (query, document) text on every comparison — but they’re 5-10× more accurate.

You can’t use cross-encoders as the primary retriever (too slow). You can’t rely solely on bi-encoders (not precise enough at the top of the ranking). The two-stage pattern uses each tool where it shines: bi-encoder filters the haystack, cross-encoder picks the best from the filtered set.

The math: a 1M-document corpus retrieved in 5 ms with a bi-encoder, then 100 candidates reranked in 100 ms with a cross-encoder = 105 ms total. Better quality than either alone, with manageable latency.

graph LR
  CORP["1M documents\n(corpus)"] --> BI["Bi-encoder\nANN index\n5 ms"]
  BI -->|top 100 candidates| CROSS["Cross-encoder\nreranker\n150 ms"]
  CROSS -->|top 10 reranked| LLM["LLM\nwith context"]
  style CROSS fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Two-stage retrieve-then-rerank: the bi-encoder does the fast fan-out across the full corpus; the cross-encoder does the expensive precision work on a small candidate set.

62.2 Cross-encoder mechanics

A cross-encoder takes a (query, document) pair as a single input and produces a single relevance score:

input = "[CLS] query text [SEP] document text [SEP]"
output = scalar score

The query and document are concatenated with a separator token. The whole concatenated input goes through the transformer’s attention layers. Every query token can attend to every document token and vice versa. The model learns to compare them at the token level.

The output is a single scalar produced by a small classification head on top of the final transformer layer. The score is learned during training to be high for relevant pairs and low for irrelevant pairs.

Compare to a bi-encoder: a bi-encoder encodes the query and document independently, producing two vectors that are compared by dot product. There’s no token-level interaction; each side is summarized into a single vector before any comparison happens. The bi-encoder is much faster (you can pre-compute the document vectors) but loses information.

The cross-encoder’s token-level attention is what makes it more accurate. The model can learn things like “query token A is similar to document token B, so this match is strong” — distinctions that get washed out in the bi-encoder’s pooled vectors.

Bi-encoder vs cross-encoder: bi-encoder produces independent vectors, cross-encoder sees full cross-attention between query and document tokens. Bi-encoder Query "fix faucet" Document "plumbing…" Encoder Encoder v_q v_d · dot product of vectors no cross-attention Cross-encoder [CLS] fix faucet [SEP] plumbing repair guide [SEP] Transformer (full cross-attn) query tokens attend to doc tokens and vice versa scalar relevance score
The cross-encoder's joint encoding lets every query token attend to every document token — information that is irreversibly lost in the bi-encoder's independent pooling step.

The cost: every (query, document) pair requires a full transformer forward pass. There’s no caching; the document representation depends on the query. For 100 candidates, that’s 100 forward passes per query. Each is fast (5-50 ms) but they add up.

Batching helps: you can pass all 100 candidates as a batch and run them in one forward pass on the GPU. This is the standard production pattern for cross-encoder reranking.

62.3 The latency-quality trade

The numbers for typical cross-encoder reranking on a single GPU:

ModelPer-pair latencyQuality (nDCG@10 lift over bi-encoder)
bge-reranker-base (~278M)~5 ms+5-8 points
bge-reranker-large (560M)~15 ms+7-10 points
bge-reranker-v2-m3 (568M, multilingual)~15 ms+6-9 points
mxbai-rerank-large-v1 (~430M)~12 ms+6-9 points
Cohere Rerank v3 (API)~50 ms+8-12 points (highest quality)

Per-batch (100 candidates) latency on a single H100:

  • bge-reranker-base: ~50 ms
  • bge-reranker-large: ~150 ms
  • API-based: ~200 ms (network latency dominates)

For a typical RAG pipeline with 200 ms TTFT budget, bge-reranker-large at ~150 ms is the standard choice. It’s slow enough to fit in the budget and high-quality enough to make the difference noticeable.

For latency-critical applications (50 ms TTFT budget), use the smaller bge-reranker-base or skip reranking entirely. For quality-critical applications, use the API-based options or a larger model.

62.4 Available reranker models

The leading rerankers as of late 2025:

BGE Reranker family (BAAI):

  • bge-reranker-base (~278M, English-focused). Fast, decent quality.
  • bge-reranker-large (~560M, English-focused). The standard.
  • bge-reranker-v2-m3 (~568M, multilingual). The multilingual version.

Mixedbread (mxbai):

  • mxbai-rerank-base-v1 and mxbai-rerank-large-v1. Competitive with BGE.

Cohere Rerank (proprietary API):

  • v3 multilingual. Often the highest quality. API only.

Voyage Rerank (proprietary):

  • Voyage rerank 2. Strong, especially for code and technical content.

Jina Reranker (Jina AI):

  • Jina-reranker-v2. Open and multilingual.

Custom fine-tuned rerankers: for specific domains, fine-tuning a base reranker on domain data can give significant improvements.

For most production RAG, bge-reranker-large or bge-reranker-v2-m3 (for multilingual) is the right default. They’re open, fast, and high-quality.

62.5 Integration with the RAG pipeline

The standard production architecture:

[Query]
   |
   v
[Hybrid retriever (BM25 + dense)]  →  top 100 candidates
   |
   v
[Cross-encoder reranker]  →  top 10 reranked
   |
   v
[LLM with top 10 in context]

The reranker is its own service, typically running on TEI (Chapter 49):

docker run --gpus all -p 8080:80 \
  -e MODEL_ID=BAAI/bge-reranker-large \
  ghcr.io/huggingface/text-embeddings-inference:1.5

The application code calls the reranker’s /rerank endpoint with the query and the candidate texts:

import requests

candidates = retriever.search(query, k=100)
candidate_texts = [c.text for c in candidates]

response = requests.post(
    "http://reranker:80/rerank",
    json={"query": query, "texts": candidate_texts}
)
ranked = response.json()  # list of {index, score} sorted by score

top_10 = [candidates[r["index"]] for r in ranked[:10]]

The reranker returns a sorted list of indices and scores. The application picks the top 10 (or whatever N) and passes them to the LLM.

The latency budget split:

  • Retrieve: 5-20 ms.
  • Rerank: 100-200 ms.
  • LLM (TTFT): 200-500 ms.
  • LLM (decode): seconds.

Reranking adds 100-200 ms to TTFT. For most applications, this is acceptable; the quality improvement is worth it.

62.6 Reranker training data

Cross-encoders are trained on the same kind of data as bi-encoders (Chapter 58), but with a different objective:

  • Bi-encoder training: contrastive loss that pulls (query, positive) close and pushes (query, negatives) far in embedding space.
  • Cross-encoder training: pointwise or pairwise loss that learns to score (query, document) pairs accurately.

The pointwise version: train the model to predict a relevance score (e.g., 0 or 1, or 0 to 5) for each pair.

The pairwise version: train the model to score (query, positive) higher than (query, negative).

The pairwise version (using a margin-based loss) is more common because it’s directly aligned with the ranking objective.

The training data sources:

  • MS MARCO (the canonical dataset).
  • BEIR (for diverse evaluation).
  • Synthetic data (the modern shift).
  • Mined hard negatives from production retrievers.

Modern strong rerankers (BGE-reranker-large) are trained on all of these, with iterative hard negative mining.

For most teams, you don’t train your own reranker. You use a pre-trained one. Custom training is only worth it for highly specialized domains.

62.7 Evaluation: does reranking actually help?

The honest question: does adding a reranker improve end-to-end RAG quality enough to justify the latency cost?

The empirical answer: yes, almost always.

The improvement on standard RAG benchmarks:

ConfigurationAnswer accuracy
Bi-encoder only (top 5)65%
Bi-encoder + reranker (top 5 from top 100)75%
BM25 only (top 5)60%
BM25 + reranker (top 5 from top 100)73%
Hybrid + reranker (top 5 from top 100)78%

Reranking adds 8-13 percentage points to answer accuracy. The improvement is consistent across benchmarks.

The reason: bi-encoders and BM25 both have noisy rankings at the top. The “10th most relevant” might not actually be in the true top-10. The reranker re-orders the top-K so the actually-most-relevant documents end up first. The LLM then sees better context.

There are edge cases where reranking doesn’t help:

  • The bi-encoder is already very good (rare).
  • The candidate set is too small (top 10 has nothing to re-order if it’s already correct).
  • The query is so simple that any reasonable retriever finds the right answer.
  • The reranker is wrong about the domain (using a general reranker on a specialized corpus).

For most production RAG, the cost-benefit favors reranking. Add it to your pipeline.

62.8 Operational considerations

The practical points:

(1) Run the reranker as its own service. TEI (Chapter 49) is the right runtime. Separate it from the main LLM and the embedder.

(2) Batch the candidates. Pass all top-K candidates as one batch to the reranker. The GPU handles them efficiently.

(3) Monitor reranker latency separately. It’s its own bottleneck.

(4) Tune the K value. How many candidates do you rerank? Common values: 50-200. More candidates = more reranking compute but more chance to recover the true top-N from the bi-encoder’s ranking.

(5) Tune the N value. How many results do you pass to the LLM after reranking? Common values: 5-20. More context = more LLM cost but more potentially-useful information.

(6) Cache reranker calls. If the same (query, document) pair is asked multiple times, cache the score. Useful for repeated queries.

(7) Fallback when the reranker is down. If the reranker service fails, fall back to the bi-encoder’s ranking. Don’t fail the request.

(8) Eval your reranker on your data. A reranker that’s great on MS MARCO might not be great on your domain. Run an internal eval before committing.

62.9 The mental model

Eight points to take into Chapter 63:

  1. Cross-encoders are 100× slower and 5-10× more accurate than bi-encoders. Use both.
  2. Retrieve-then-rerank is the standard production pattern.
  3. Cross-encoders compute every (query, document) pair fresh. No caching of document representations.
  4. bge-reranker-large is the standard open default.
  5. Add 100-200 ms to TTFT for reranking. Acceptable for most applications.
  6. Reranking adds 8-13 points of answer accuracy in typical RAG evaluations.
  7. Run the reranker as a separate TEI service. Batch the candidates.
  8. Tune K (number of candidates) and N (number returned) based on your latency/quality budget.

In Chapter 63 we look at query rewriting — improving the query before retrieval, to get better candidates in the first place.


Read it yourself

  • The BGE-reranker model cards on Hugging Face.
  • The Cohere Rerank documentation.
  • The TEI documentation on reranker models.
  • Nogueira & Cho, Passage Re-ranking with BERT (2019). The foundational paper.
  • The MS MARCO leaderboard.

Practice

  1. Why is a cross-encoder more accurate than a bi-encoder for ranking? Explain at the level of attention.
  2. Compute the latency for a 100-candidate reranking with bge-reranker-large at 15 ms per pair. Compare batched vs sequential.
  3. For a RAG pipeline with 200 ms TTFT budget, where can you fit reranking? Plan the latency budget.
  4. Why does reranking add ~10 percentage points to answer accuracy? Argue at the level of “what bi-encoders miss.”
  5. Set up bge-reranker-large with TEI and rerank a small candidate set. Compare to the bi-encoder ranking.
  6. Why is custom reranker training rarely worth it? When would it be?
  7. Stretch: Build a complete retrieve-then-rerank pipeline with hybrid retrieval and bge-reranker-large. Compare end-to-end answer quality vs no reranking on a small QA dataset.