Chapter 9: Embeddings and rerankers: what they are, why they're separate, why they're cheap

Most of this book is about generative LLMs. But the discriminative side of the transformer family — encoder-only models that produce dense vector representations — is just as important in production, and it powers half of every modern AI system. RAG pipelines (Chapter 65), semantic search, recommendation systems, deduplication, clustering, and classification all run on embedding models. And every modern retrieval pipeline ends with a reranker that re-scores the top candidates with a more expensive cross-encoder.

This chapter is about the encoder side. By the end you will be able to:

Explain how embedding models are trained and why contrastive learning is the right framing.
Explain the difference between a bi-encoder (embedding) and a cross-encoder (reranker), and why both exist.
Read the MTEB leaderboard intelligently and pick a model for a real workload.
Justify why text embeddings and rerankers are served by their own runtime (TEI) rather than vLLM.

Outline:

Encoder-only architectures.
The contrastive training objective.
In-batch negatives, hard negatives, the triplet loss.
Bi-encoder vs cross-encoder — the latency-quality tradeoff.
Pooling strategies: CLS, mean, last-token.
L2 normalization, cosine vs dot product.
The MTEB benchmark and how to read it.
Multi-vector retrieval — ColBERT and late interaction.
Reranker design — why cross-encoders are 100× slower and 5× better.
Why embedding and reranker serving is its own thing.
Forward pointers to the RAG part of the book.

9.1 Encoder-only architectures

In Chapter 7 we built a decoder-only transformer with a causal mask. An encoder-only transformer is the same architecture without the causal mask. Every position attends to every other position bidirectionally. The model produces a contextualized representation per input position and stops there — there’s no autoregressive generation, no language model head, no decoding loop.

The original BERT (Devlin et al., 2018) was an encoder-only model. It was trained on masked language modeling (MLM): mask out a random 15% of input tokens and ask the model to predict them. The MLM objective forces the model to understand context bidirectionally because the masked token might be predicted from tokens on either side. The result is a model that produces high-quality contextual representations — and is useless for generation, because there’s no causal structure.

For embeddings, the architectural details look like this:

Bidirectional self-attention. No causal mask. Every token sees every other.
A [CLS] (or similar) token at the start. The contextualized representation of this token is sometimes used as the “summary” of the entire input.
A pooling step at the end. The model produces representations for every position; the pooling step collapses them into a single vector that represents the entire input.
A final linear projection (sometimes) to a target embedding dimension.
Optional L2 normalization of the output, so all embeddings live on the unit sphere.

That’s the whole architecture. Modern embedding models (bge-m3, e5, gte, sentence-transformers, nomic-embed) are all encoder-only transformers with these elements, trained with contrastive losses (next section) on millions or billions of text pairs.

9.2 Contrastive training — the actual objective

Embedding models are trained with a contrastive objective. The setup:

You have pairs (anchor, positive) of text that are semantically similar (e.g., a question and its answer, a paragraph and its summary, two translations of the same sentence, two sentences from the same document).
You also have a set of negative examples — text that is semantically unrelated to the anchor.
You train the model so that the embedding of the anchor is close to the embedding of the positive and far from the embeddings of the negatives.

“Close” and “far” are measured in some distance metric — usually cosine similarity or dot product on the unit sphere.

The most common loss function is InfoNCE, which we previewed in Chapter 4:

L = -log [ exp(sim(anchor, positive) / τ) / Σ_j exp(sim(anchor, candidate_j) / τ) ]

where the sum in the denominator is over the positive plus all the negatives. Read that formula carefully — it’s the same softmax-cross-entropy loss from Chapter 4, applied to “the candidates are the classes.” The model is being asked to assign the highest probability to the correct positive among a slate of candidates. The temperature τ controls how sharp the probability distribution is over candidates.

The metric sim(x, y) is usually x · y (dot product) or x · y / (||x|| ||y||) (cosine similarity, which is the same thing if you’ve L2-normalized).

InfoNCE is softmax cross-entropy in disguise: the model assigns probability mass to candidates, and training maximizes the probability of the positive — gradients pull anchor toward positive and push it away from all negatives simultaneously.

This is the entire contrastive learning framework. Every modern embedding model is some flavor of “InfoNCE with a creative source of positives and negatives.”

9.3 In-batch negatives, hard negatives, and how the magic works

Where do the negatives come from? Three answers, in order of sophistication.

In-batch negatives

The simplest and cheapest. You have a batch of (anchor_i, positive_i) pairs. For each anchor i, you use all the other positives in the batch as negatives:

L_i = -log [ exp(sim(anchor_i, positive_i) / τ) / Σ_j exp(sim(anchor_i, positive_j) / τ) ]

So for a batch of 1024 pairs, every anchor has 1 positive and 1023 negatives — for free, with no extra forward passes. This is incredibly efficient because you compute one forward pass per text and get O(N²) similarity scores from one matmul.

The catch: the negatives are random. Most of them are very easy (totally unrelated text), so the model learns to distinguish “obviously different” pairs without learning to make fine distinctions. To get a better model you need harder negatives.

Hard negatives

A hard negative is a piece of text that is almost the right answer but not quite. For a question about “best pizza in Chicago,” a hard negative might be “best pizza in New York” — semantically very close but factually wrong. Hard negatives force the model to learn fine-grained semantic distinctions.

Hard negatives are typically mined ahead of time:

Train a weak embedding model on in-batch negatives.
For each anchor in your training set, use the weak model to find the top-N most-similar texts that are not the actual positive.
Use those as hard negatives in the next round of training.

This is the standard “iterative hard negative mining” loop. Modern embedding models do several rounds of it, sometimes with multiple teacher models contributing different hard negatives.

Multiple-positive contrastive

Some training datasets have multiple positives per anchor (e.g., a question with multiple correct answers). The loss generalizes: average the InfoNCE over the multiple positives, or use a multi-positive variant directly.

Curriculum and data composition

The largest embedding models (e5-mistral-7b, bge-m3) are trained on dozens of datasets covering every flavor of (anchor, positive) pair imaginable: question-answer, query-document, paraphrase, translation, code-comment, code-code, summary-document, Wikipedia title-paragraph. The art of training a strong embedding model is in the data mixture, not in the architecture.

9.4 Bi-encoder vs cross-encoder — the latency-quality tradeoff

This is the most important architectural distinction in retrieval, and the one that justifies why embeddings and rerankers are separate things.

Bi-encoder (embedding model)

In a bi-encoder, the anchor and the candidate are encoded independently. You compute embed(anchor) and embed(candidate) with two separate forward passes (or one batched pass), and then you compare them with a cheap similarity function (dot product or cosine).

def bi_encoder_score(anchor, candidate, model):
    e_anchor    = model.embed(anchor)        # forward pass 1
    e_candidate = model.embed(candidate)     # forward pass 2
    return (e_anchor * e_candidate).sum()    # dot product

The killer property of bi-encoders is that the candidate embeddings can be precomputed and indexed.

The bi-encoder's independence of query and document is what makes index-based retrieval possible; the cross-encoder's joint attention is what makes reranking more accurate — neither can replace the other.

If you have a million documents to search over, you embed them all once, store the vectors in a vector database (Qdrant, Pinecone, FAISS), and at query time you only need to embed the query once and do a fast nearest-neighbor search. This makes retrieval over enormous corpora possible.

The cost is that the anchor and candidate never interact during the encoding. Each is encoded in isolation. The model has to compress all the relevant information about a piece of text into a single vector, with no knowledge of what query it might be matched against.

Cross-encoder (reranker)

In a cross-encoder, the anchor and the candidate are concatenated into one input and encoded together:

def cross_encoder_score(anchor, candidate, model):
    inputs = f"{anchor} [SEP] {candidate}"
    return model(inputs).logits[0]   # one forward pass produces one score

The whole input passes through bidirectional self-attention, so every token in the candidate gets to attend to every token in the anchor. The model can learn very fine-grained interactions: “the second word of the query matches a phrase in the third sentence of the candidate.” This produces much better quality scores than bi-encoders.

The cost is that you cannot precompute anything. Each (query, candidate) pair requires its own forward pass through the entire transformer. For a query against a million documents, that’s a million forward passes — completely impossible at retrieval scale.

The hybrid pattern: retrieve-then-rerank

The production pattern that has emerged from this tension is two-stage retrieval:

Retrieve-then-rerank is mandatory at scale: the bi-encoder provides recall over millions of docs cheaply, and the cross-encoder (highlighted) adds the fine-grained reasoning the bi-encoder cannot do — skipping either stage loses too much quality or throughput.

Retrieve the top 100 (or 200, or 1000) candidates from the corpus using a fast bi-encoder. This is O(corpus_size) work but extremely cheap per item.
Rerank those 100 candidates using a cross-encoder. This is O(100) cross-encoder forward passes — still expensive, but bounded.
Return the top 5 (or 10, or whatever the application needs) from the reranked list.

The bi-encoder filters the haystack down to a manageable shortlist; the cross-encoder picks the best from the shortlist. Bi-encoder gives you recall over the corpus; cross-encoder gives you precision over the candidates. Both are necessary.

Cross-encoders are typically 100× slower per pair than bi-encoders but 5×–10× more accurate at the top of the ranking. The factor-of-100 latency means you can’t use them as the primary retriever; the factor-of-5 precision means you can’t skip them when quality matters.

9.5 Pooling strategies

After the encoder runs, you have a tensor of shape (N, S, d_model) — one contextualized representation per token. To turn this into a single embedding, you need a pooling strategy. The three you’ll see:

[CLS] pooling. Take the representation of the first token (which is a special [CLS] token added by the tokenizer). The model is trained so that the [CLS] representation acts as a summary. This is what BERT does. It works, but it’s brittle — the model has to learn to use the [CLS] slot as a sink for global information.
Mean pooling. Take the average of all token representations (with the padding mask applied so padding doesn’t contribute). Robust, simple, and the default in sentence-transformers. Works well in almost all cases.
Last-token pooling. Take the representation of the last non-padding token. This is the natural choice for decoder-only embedding models (yes, those exist — e5-mistral-7b is one) because the causal mask means the last token has seen the full input. Awkward for encoder-only models because there’s nothing special about the last position.

The choice of pooling matters less than the contrastive training. Most embedding models converge to similar quality regardless of pooling, as long as the training is done properly. Mean pooling is a safe default for encoder models; last-token is the standard for decoder-only embedders.

9.6 L2 normalization, cosine vs dot product

After pooling, embedding models typically apply L2 normalization: divide each embedding by its L2 norm so that every embedding has unit length and lives on the unit sphere of the embedding space.

e_normalized = e / ||e||_2

After normalization, cosine similarity equals dot product:

cos(a, b) = a · b / (||a|| ||b||) = a · b   (if both are unit length)

This is convenient because dot product is much faster to compute than cosine — no division, no square root, just one fused multiply-add per dimension. Vector databases all support dot-product search, which becomes equivalent to cosine search if you’ve normalized your vectors.

A few embedding models (some OpenAI models historically, some others) emit unnormalized vectors. In that case you have a choice: normalize at index time (and use dot product) or compute true cosine at query time (more flexible but slower). Most modern open embedding models normalize internally; you can usually treat the output as unit-length.

The embedding dimension also matters. Common dimensions:

Model	Dimension
`text-embedding-ada-002`	1536
`text-embedding-3-small`	1536 (truncatable to 256, 512, 1024)
`text-embedding-3-large`	3072 (truncatable)
`bge-large-en-v1.5`	1024
`bge-m3`	1024
`e5-mistral-7b-instruct`	4096
`nomic-embed-text-v1`	768

Higher dimension = more storage cost in the vector index, more compute per similarity comparison, but slightly better quality. Matryoshka representation learning (Kusupati et al., 2022) trains a model so that you can truncate the embedding to a smaller prefix and still get reasonable quality — the OpenAI v3 models are trained this way, which is why you can ask for a 256-dim embedding from a 3072-dim model.

9.7 The MTEB benchmark, and how to read it

The Massive Text Embedding Benchmark (MTEB) is the de facto leaderboard for embedding models. It evaluates embedding models on dozens of tasks across eight categories: classification, clustering, pair classification, reranking, retrieval, STS (semantic textual similarity), summarization, and bitext mining.

The MTEB leaderboard at huggingface.co/spaces/mteb/leaderboard shows scores for hundreds of models. As of late 2025, the top scores are around 70 average across all tasks, with the strongest open models being various flavors of e5-mistral, bge-m3, gte, nomic-embed, and proprietary models like Cohere Embed v3 and OpenAI text-embedding-3-large.

How to read it:

Average score is misleading. The categories test very different things. A model that’s great at retrieval but mediocre at clustering might be perfect for your RAG use case but bad for your topic-modeling pipeline.
Look at the “Retrieval” column if you’re building RAG. This is the task that matters for almost every production embedding use case.
Watch for benchmark contamination. Some models are trained on the MTEB datasets directly, which inflates their scores. The MTEB team flags suspected contamination, but the signal is noisy.
Embedding dimension matters for cost. A 4096-dim embedding is 4× the storage of a 1024-dim embedding. For a corpus of 100M documents, that’s the difference between a 200 GB index and an 800 GB index.
Latency and model size matter for serving. A 7B parameter embedding model gives the best quality but is much slower per query than a 100M-parameter model. For a search system serving thousands of queries per second, the smaller model is often the right choice even if the bigger one wins on the leaderboard.

The practical workflow: pick the smallest model that gets within 5% of the top retrieval score on benchmarks similar to your use case, deploy it, measure end-to-end retrieval quality on your own data, and only consider upgrading if the evaluation says you need to.

9.8 Multi-vector retrieval — ColBERT and late interaction

A halfway point between bi-encoders and cross-encoders: multi-vector or late-interaction retrieval. Instead of pooling to a single vector, you keep the full sequence of token-level representations, and at query time you compute a richer similarity score that uses all of them.

The canonical example is ColBERT (Khattab & Zaharia, 2020). For each query token, you find the most similar candidate token, and the document score is the sum of these per-token max similarities:

score(query, doc) = Σ_{q in query} max_{d in doc} sim(q, d)

This is much richer than a single dot product because it captures token-level alignment, not just sentence-level similarity. It’s also much closer to bi-encoder cost than cross-encoder cost: you can precompute the document token vectors and index them, and query time is just a max-and-sum operation.

The cost is storage: instead of one vector per document, you have one vector per token, which can be 100×–500× more storage. Vector databases that support multi-vector retrieval (Vespa, Qdrant with ColBERT-style indexing) handle this. Single-vector ANN indices don’t.

ColBERT v2 added a quantization layer that compresses each token vector to ~32 bits, bringing storage down to a few times the single-vector cost while keeping most of the quality. There’s been a research push around late-interaction models in the last two years; expect more.

For your interview prep: know that ColBERT exists, know it’s “late interaction,” know it’s better than bi-encoders and cheaper than cross-encoders, and know the storage cost is the catch. We will not go deeper than this in the book.

9.9 Reranker design

Rerankers are cross-encoders trained specifically for the second stage of retrieval. The standard setup:

Input: the query and a candidate passage, concatenated with a separator.
Output: a single scalar score (a logit) indicating how relevant the passage is to the query.
Training: trained with triplet losses or pairwise losses on (query, positive_doc, negative_doc) triples mined from existing retrieval datasets.

The dominant open rerankers are:

bge-reranker-large, bge-reranker-v2-m3 — the BGE family. Strong baselines.
mxbai-rerank — Mixedbread’s open rerankers.
Cohere Rerank (proprietary, API-only).
Voyage Rerank (proprietary).

Production reranker latency is 5–50ms per (query, passage) pair on a single GPU, depending on the model size and the passage length. For a top-100 reranking step that’s 0.5–5s per query if done sequentially, or much less if batched. Batching the cross-encoder forward pass over all 100 candidates at once is the key optimization — you do one batched matmul instead of 100 separate ones, and the GPU utilization is much higher.

9.10 Why embedding and reranker serving is its own thing

If you’ve followed Parts II and III of the book, you know that production LLM serving is built around vLLM and similar runtimes optimized for autoregressive generation. Embedding models and rerankers are not generative, and using vLLM for them is the wrong tool for the job. They have their own runtime: TEI (Text Embeddings Inference).

Why TEI exists as a separate runtime, and not as a vLLM workload type:

No KV cache. Encoder-only models do one forward pass per input and produce a single output. There’s no KV cache to manage, no prefix sharing, no continuous batching of decode steps.
Short, fixed-length inputs. Embeddings are usually generated for inputs in the 100–8000 token range, and the input length is known up front (no autoregressive growth). The batching strategy is “wait for a few short inputs, batch them, run one forward pass, return.”
Different scheduling metric. vLLM scales on vllm:num_requests_running (a measure of in-flight generation work). TEI scales on tei_queue_size (a measure of pending input requests). The two metrics measure completely different things and require completely different autoscaler tuning.
Model size and quantity. A production retrieval system might run one embedder and one reranker, both small (under 1 GB), serving thousands of QPS. A production chat system runs a few large models (>100 GB), serving tens of QPS. The two profiles call for different runtime defaults.
Output shape. vLLM streams tokens out as they’re generated. TEI returns a fixed-shape vector (or a single scalar for rerankers) once the forward pass completes. The transport layer is simpler and the latency is single-shot.

In production, embedders and rerankers are typically deployed as their own service, behind their own AI gateway route, with their own autoscaling configuration, and managed by ops as separate workloads from the chat models. We’ll come back to this in Chapter 49 when we cover production TEI deployment.

9.11 Forward pointers to the RAG part of the book

This chapter is the foundation for Part IV — Information Retrieval & RAG (Chapters 55–63). Specifically:

Chapter 57 covers BM25 and classical IR — the lexical baseline that hybrid search combines with embeddings.
Chapter 58 is the deeper version of “dense retrieval and contrastive training,” extending §9.2 of this chapter.
Chapter 59 covers vector index internals (HNSW, IVF, FAISS, ScaNN). This is where the “embedding → searchable index” jump happens in detail.
Chapter 60 covers hybrid search and fusion (RRF, weighted score fusion).
Chapter 61 covers chunking strategies — the most underrated lever in RAG quality.
Chapter 62 is the deeper version of “reranking with cross-encoders.”
Chapter 63 covers query rewriting, HyDE, and multi-query.
Chapter 64 covers RAG evaluation — Ragas, LLM-as-judge, golden sets.
Chapter 65 puts the whole pipeline together end to end.

By the time you reach Chapter 65, the question “what’s a sensible RAG architecture for 10TB of documents?” should have a confident answer.

9.12 The mental model

Eight points to take into Chapter 10:

Embedding models are encoder-only transformers trained on contrastive losses with InfoNCE.
In-batch negatives are the cheap workhorse; hard negatives are the quality lever.
Bi-encoder vs cross-encoder is the most important architectural distinction in retrieval.
Bi-encoders enable indexed retrieval because they encode each side independently and let you precompute the document side.
Cross-encoders are 100× slower and 5× better. Use them as the second stage in retrieve-then-rerank.
Mean pooling and L2 normalization are the safe defaults. Read the model card for the actual choice.
MTEB is the benchmark. Look at the retrieval column, not the average. Watch for contamination.
TEI is the production runtime for embedders and rerankers, not vLLM. They have completely different operational profiles.

In Chapter 10 we close out Part I with a different kind of practical skill: how to read a model card and a model lineage chart adversarially.

Read it yourself

The original BERT paper: Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) — the encoder-only architecture and MLM training.
Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (2020) — the canonical paper on bi-encoder retrieval.
Khattab & Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (2020) — the late-interaction paper.
Reimers & Gurevych, Sentence-BERT (2019) — the paper that made bi-encoder embeddings practical and launched the sentence-transformers library.
The MTEB paper: Muennighoff et al., MTEB: Massive Text Embedding Benchmark (2022).
The HuggingFace text-embeddings-inference (TEI) GitHub repository — read the README for the production interface.

Practice

Implement a tiny bi-encoder training loop in PyTorch using in-batch negatives. Use a small base model like bert-base-uncased. Train it on 1000 (question, answer) pairs from any QA dataset. Plot the validation retrieval recall.
Why does in-batch negatives produce a quadratic speedup over computing similarities pair-by-pair? Show the matmul that does it.
The InfoNCE loss has a temperature τ. Predict what happens when τ is very small (e.g., 0.01) and when τ is very large (e.g., 10). Run a tiny experiment and confirm.
Why does L2 normalization make cosine similarity equal to dot product? Prove it from the definition.
You have a corpus of 1M documents and want to do retrieval. Estimate the storage cost for an embedding-only index with 1024-dim fp32 vectors, with 1024-dim fp16 vectors, with 1024-dim int8 quantized vectors, and with a ColBERT-style multi-vector index averaging 200 tokens per document at 32 bits per token. (Answers: 4 GB, 2 GB, 1 GB, ~80 GB.)
Why can’t you use a cross-encoder as the primary retriever for a 1M-document corpus? Compute the latency of doing it, assuming 10 ms per (query, document) pair on a single GPU. Then explain why retrieve-then-rerank fixes this.
Stretch: Take a small embedding model from HuggingFace, build an in-memory FAISS index over a Wikipedia subset (e.g., 10k articles), and write a tiny retrieval-then-reranking script that uses bge-reranker-base as the second stage. Measure the latency of each stage on a single CPU.

Concept check