Chapter 58: Dense retrieval: embeddings, contrastive learning, MTEB

In Chapter 9 we covered embedding models — what they are, the bi-encoder vs cross-encoder split, the basic contrastive training objective. In Chapter 57 we covered the lexical baseline that dense retrieval competes with and complements. This chapter goes deeper on dense retrieval specifically: how the encoders are trained, what makes them good at retrieval, and how to read the MTEB benchmark intelligently.

By the end you’ll know:

How dense retrievers are trained from scratch (the modern recipe).
What in-batch negatives, hard negatives, and ColBERT-style late interaction are.
How to read MTEB and interpret retrieval scores.
When dense retrieval is the right tool and when it isn’t.
The practical considerations for production retrieval.

Outline:

The dense retrieval problem.
Bi-encoder training, in detail.
In-batch negatives and the negative-mining problem.
Hard negatives.
Late interaction (ColBERT).
The MTEB benchmark.
Choosing an embedding model.
The training-data story: from MS MARCO to E5.
Multilingual and multimodal embedders.
Production considerations.

58.1 The dense retrieval problem

The dense retrieval setup:

Take every document in your corpus and compute its embedding (a fixed-dim vector) using an encoder.
Index the embeddings in a vector database.
At query time, compute the query’s embedding using the same encoder.
Find the documents whose embeddings are nearest to the query embedding (by cosine similarity or dot product).
Return the top-K.

The retrieval is a nearest-neighbor search in the embedding space, not a string matching operation. The embeddings encode meaning — semantically similar texts have similar embeddings — so the retrieval finds documents that are about what the query is asking, even if they don’t share the query’s words.

The whole game is in the encoder. A good encoder produces embeddings where similar things are close and dissimilar things are far. A bad encoder produces a useless space. The trick is training the encoder to produce a good space.

Dense retrieval encodes both query and documents into the same vector space; retrieval is a nearest-neighbor lookup — no shared words required, only shared meaning.

58.2 Bi-encoder training, in detail

We covered the basic contrastive training objective in Chapter 9. Let me go deeper.

The setup:

A base model — typically a pretrained encoder-only transformer like BERT, or a decoder-only LLM used as an encoder (e5-mistral).
Pairs of (anchor, positive) — semantically related texts. The anchor is typically a query; the positive is a relevant document.
Negatives — texts that are not relevant to the anchor.

The training objective is contrastive: pull the anchor close to the positive, push it far from the negatives. The standard loss is InfoNCE (Chapter 4 and 9):

L = -log [ exp(sim(anchor, positive) / τ) / Σ_j exp(sim(anchor, candidate_j) / τ) ]

Where the candidates include the positive and a set of negatives. The model is being asked to assign the highest probability to the correct positive among the candidates.

The training loop:

Sample a batch of (anchor, positive) pairs.
For each anchor, the candidates are the positive plus all the other positives in the batch (in-batch negatives) plus optionally some hard negatives (mined externally).
Compute the InfoNCE loss.
Backpropagate and update the encoder.

The encoder is shared between anchor and positive — both go through the same model. The output is taken from a pooling layer (mean pooling or last-token pooling, see Chapter 9).

Contrastive training uses the similarity matrix of a full batch — anchor × positives — to maximize the score of the correct pair (diagonal) over all in-batch negatives simultaneously.

After training, the encoder produces embeddings such that anchors and their positives are close in cosine space, and anchors and unrelated documents are far.

58.3 In-batch negatives and the negative-mining problem

The simplest source of negatives: other positives in the same batch.

If the batch has 1024 (anchor, positive) pairs, then for each anchor i, you use the other 1023 positives as negatives. This gives you 1023 negatives per anchor for free, with no extra forward passes — you compute all the embeddings once and then form the similarity matrix.

In-batch negatives are computationally efficient. The forward pass is O(2N) (the N anchors plus N positives); the loss computation is O(N²) for the similarity matrix, but it’s done in one matmul.

The problem: most of the in-batch negatives are very easy. They’re randomly sampled positives from unrelated queries. The model quickly learns to distinguish them and gets diminishing returns from training on them.

To make the model better, you need harder negatives — texts that are almost the right answer but not quite. These force the model to learn finer-grained distinctions.

58.4 Hard negatives

A hard negative is a candidate that the model would plausibly retrieve but is actually wrong. For a query “What is the capital of France?”, a hard negative might be:

“What is the capital of Germany?” (related question, wrong answer)
“France is a country in Europe.” (related but doesn’t answer)
“The capital of Italy is Rome.” (similar structure, different answer)

These are much harder to distinguish from the positive than random text. Training on them teaches the model to make fine distinctions.

How to find hard negatives? Mine them with a weaker model:

Train a weak retriever on the dataset (using just in-batch negatives).
For each anchor, use the weak retriever to find the top-N most similar texts that are not the actual positive.
These are the hard negatives.
Retrain a stronger retriever using both in-batch negatives and the mined hard negatives.

This is the iterative hard negative mining loop. It’s how all the modern strong retrievers (BGE, E5, GTE) are trained. Multiple rounds of mining and retraining produce progressively better retrievers.

A subtlety: false negatives. Sometimes the “hard negative” you mine is actually a relevant document that just wasn’t labeled as positive in the training data. Training on false negatives hurts. The mitigations:

Use multiple positive labels per query (when available).
Score hard negatives with a teacher model and discard the top ones (which might be true positives).
Use a margin-based loss that’s robust to label noise.

The most sophisticated retrievers use teacher-student distillation (Chapter 18) for the negative mining: a strong teacher model scores candidates, and the student learns to match the teacher’s scores. This avoids the binary label noise.

58.5 Late interaction (ColBERT)

A different angle on dense retrieval: instead of compressing each document to a single vector, keep a vector per token and do the matching at the token level. This is late interaction, and the canonical example is ColBERT (Khattab & Zaharia, 2020).

The mechanism:

Encode each document as a sequence of token embeddings (one vector per token, not pooled).
Encode the query as a sequence of token embeddings.
For each query token, find the most similar document token.
The document’s score is the sum of these per-query-token max similarities:

score(query, doc) = Σ_{q in query} max_{d in doc} sim(q, d)

This is much richer than single-vector cosine because it captures token-level alignment: which query words match which document positions.

ColBERT computes a MaxSim score for each query token independently, then sums — this captures which specific words matched, making it harder to fool with topic-averaging.

The benefits:

Higher accuracy than single-vector retrievers, especially on hard queries.
More interpretable: you can see which tokens matched.
Less prone to “topic drift”: the model can’t pretend a document is relevant by averaging away contradicting parts.

The costs:

Storage: a document with 200 tokens has 200 vectors, which is 200× the storage of a single-vector index. Even with compression, this is significant.
Compute: the per-query-token max is more expensive than a single dot product.
Complexity: vector databases have to support multi-vector documents.

ColBERT-v2 added product quantization that compresses each token vector to ~32 bits, bringing storage down to a few times the single-vector cost. This is the version most production teams would use.

For most production RAG, single-vector dense retrieval is the default, with ColBERT-style late interaction reserved for high-accuracy use cases where the storage cost is acceptable.

58.6 The MTEB benchmark

We covered MTEB briefly in Chapter 9. To recap and add depth:

MTEB (Massive Text Embedding Benchmark) is the standard benchmark for embedding models. It evaluates embedders on dozens of tasks across eight categories: classification, clustering, pair classification, reranking, retrieval, STS (semantic textual similarity), summarization, and bitext mining.

For RAG, the retrieval category is the most relevant. It includes datasets like:

MS MARCO (Microsoft Machine Reading Comprehension) — the canonical retrieval dataset, query-document pairs from Bing search.
Natural Questions — Google search queries with Wikipedia answers.
HotpotQA — multi-hop questions.
FEVER — fact verification.
TREC-COVID — biomedical retrieval.
NF Corpus — biomedical retrieval.
DBPedia — entity retrieval.

The MTEB retrieval scores use nDCG@10 as the metric — normalized Discounted Cumulative Gain at 10 retrieved documents. It rewards both recall (was the relevant document in the top 10?) and ranking (was it near the top of the 10?).

How to read the MTEB leaderboard:

Look at the retrieval column specifically, not the average. The categories test very different things.
Watch for contamination. Some models train on MTEB datasets directly. The MTEB team flags suspected cases.
Embedding dimension matters for cost. A 4096-dim embedder is 4× the storage of a 1024-dim embedder. For large corpora this is significant.
Model size matters for serving cost. A 7B-parameter embedder is much more expensive to serve than a 110M-parameter one. The quality difference may or may not justify the cost.

For most production RAG, the practical choice is:

Pick a small (110-300M params), high-quality embedder (BGE-large, GTE-large, e5-large) as the default.
Test on your own data. MTEB is a starting point, not a final answer.
Only consider larger models (e5-mistral-7b, etc.) if your evaluation says you need them.

58.7 Choosing an embedding model

The decision tree for picking an embedder:

Q: What’s your domain?

General English text → BGE, GTE, E5, Nomic Embed.
Multilingual → BGE-M3, multilingual E5, Cohere multilingual.
Code → CodeBERT-based or specialized code embedders.
Biomedical → SapBERT, BioBERT, PubMedBERT.
Domain-specific scientific → SPECTER, SciBERT.

Q: What’s your latency budget?

< 5ms per query: small models (<300M params) on GPU.
< 50ms: any size up to ~7B.
50ms: any model.

Q: What’s your throughput requirement?

< 100 QPS: any model.
100-1000 QPS: small models (~300M) on GPU, or medium models (~1B) on multiple GPUs.
1000 QPS: small models with horizontal scaling.

Q: What’s your storage budget?

Tight: smaller embedding dimension (768 or 1024). Use Matryoshka models that support truncation.
Generous: larger dimension (3072 or 4096) for slightly better quality.

Q: How important is multilingual?

Critical: BGE-M3 or multilingual E5.
Nice to have: BGE or GTE (which have some multilingual capability but are English-first).
Not important: any English-focused model.

Q: Are you OK with API or self-hosted?

Self-hosted: BGE, GTE, E5, Nomic — all open.
API is fine: Cohere Embed v3, Voyage AI, OpenAI text-embedding-3.

For most general production RAG, BGE-large-en-v1.5 or its multilingual variant BGE-M3 is a strong default. They’re open, well-supported, and perform near the top of MTEB at modest cost.

58.8 The training-data story

A short tour of the training data behind modern embedders:

MS MARCO (Microsoft, 2018) — the foundation. ~1 million query-passage pairs from Bing search logs. Used to bootstrap many early dense retrievers.

Natural Questions (Google, 2019) — real user questions from Google search, with Wikipedia answers. Smaller but high quality.

BEIR (2021) — a benchmark of 18 retrieval datasets across diverse domains. Used both as a benchmark and as additional training data.

Synthetic data with LLMs (2023+) — modern retrievers use LLMs to generate query-passage pairs synthetically. The LLM is shown a passage and asked to generate plausible queries. This scales much better than collecting human-labeled data.

The current state-of-the-art training recipes (BGE, E5, etc.) use a mix of all of the above: human-labeled data from MS MARCO and Natural Questions, plus synthetic data, plus hard negatives mined iteratively. The synthetic data is the biggest single source by volume.

E5 (Microsoft, 2023) is an interesting case: it’s trained primarily on synthetic data generated by GPT-4. The training data is the model’s training data. This is the same synthetic-data shift we covered in Chapter 19, applied to retrieval.

For most production teams, you don’t train your own embedder. You pick an existing one. The training story matters because it tells you what kind of data the model was trained on (and therefore what kind of queries it’ll handle well).

58.9 Multilingual and multimodal embedders

Two important specializations:

Multilingual

Models trained to embed text in many languages such that embeddings are aligned across languages. A query in English can retrieve documents in Spanish, French, or Chinese.

The training: contrastive learning across (English text, translation in another language) pairs. The model learns that translations should have nearby embeddings.

Examples:

BGE-M3: multilingual, multi-functionality (dense + sparse + multi-vector in one model). Strong default for multilingual RAG.
Multilingual E5: a multilingual variant of E5.
LaBSE (Google): an older multilingual embedder.
Cohere Embed v3 multilingual.

For multilingual RAG, always use a multilingual embedder, even if your queries and documents are mostly in one language. The cost is small and the cross-lingual capability is occasionally useful.

Multimodal

Models that embed text and images (or other modalities) into a shared space. Used for multimodal RAG where you might retrieve images by text query or vice versa.

Examples:

CLIP and SigLIP: the foundational multimodal embedders.
BGE Visualized: a multimodal extension of BGE.
Cohere Embed v3 multimodal.
Qwen-VL embedder variants.

Multimodal embedders are an active research area. The quality is improving fast. For production multimodal RAG (search images by text, or text by image), they’re essential.

58.10 Production considerations

The practical points for running dense retrieval in production:

(1) Pre-compute document embeddings. Don’t embed at query time. Embed your corpus once, store the vectors in a database, only embed the query at query time.

(2) Re-embed when the model changes. Embeddings from one model are useless with another. If you upgrade the embedder, re-embed everything. Plan for this.

(3) Monitor embedding latency separately. The encoder is its own service (TEI, Chapter 49). Monitor its TTFT and throughput separately from the main LLM.

(4) Use the same encoder for indexing and querying. Always. Don’t index with one model and query with another.

(5) Normalize the vectors if your model outputs them normalized. Some models do, some don’t. If yours does, store normalized vectors and use dot product instead of cosine for slightly faster retrieval.

(6) Handle long documents. Most embedders have a max input length (often 512 tokens, sometimes 8192). Documents longer than this need to be chunked (Chapter 61) before embedding.

(7) Cache query embeddings. Common queries get repeated. Cache the embeddings to avoid re-encoding.

(8) Pin model versions. Upgrading the embedder mid-flight breaks the index. Pin to a specific version and only upgrade with a planned re-indexing window.

These are mundane but important. Production retrieval systems live or die by their operational discipline.

58.11 The mental model

Eight points to take into Chapter 59:

Dense retrieval finds documents by semantic similarity in an embedding space.
Bi-encoders are trained with contrastive learning on (query, document) pairs.
In-batch negatives are cheap; hard negatives are quality. Iterative mining produces strong retrievers.
ColBERT-style late interaction trades storage for accuracy. Used for high-quality retrieval.
MTEB is the standard benchmark. Look at the retrieval column, not the average.
BGE and E5 are the modern open defaults. Multilingual variants for multi-language RAG.
Synthetic training data is the modern data source. The teacher LLM generates query-document pairs.
Production discipline: pre-compute, pin versions, monitor separately, never mix encoders.

In Chapter 59 we dive into the data structures that make dense retrieval fast at scale: vector index internals.

Read it yourself

Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (2020). The foundational dense retrieval paper.
Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (2020).
The BGE-M3 paper (2024).
The E5 paper (Wang et al., 2022) and its successors.
The MTEB paper (Muennighoff et al., 2022) and the leaderboard.
The MS MARCO dataset and the BEIR benchmark.

Practice

Why does in-batch negatives give you 1023 negatives per anchor with only one extra forward pass? Trace the matmul.
Why are hard negatives important for retriever quality? Construct an example.
What’s the storage cost of a ColBERT-style multi-vector index for 1M documents averaging 200 tokens, vs a single-vector index? Compute both.
Why does MTEB’s average score not tell you how good a retriever is for your task? Argue.
How would you choose between BGE-large-en, e5-large-v2, and gte-large-en for a customer support search system?
Why must you re-embed your entire corpus when you upgrade the embedding model?
Stretch: Train a tiny dense retriever with in-batch negatives on a subset of MS MARCO using sentence-transformers. Evaluate on the validation set.