Information retrieval primer: TF-IDF, BM25, why they still matter
"Before there were embeddings, there was BM25. Twenty years later, BM25 is still a hard baseline to beat"
We open Part IV — Information Retrieval & RAG — with the classical IR foundation. Modern LLM systems increasingly depend on retrieval, and “retrieval” doesn’t just mean vector search. The classical methods (TF-IDF, BM25) are still the workhorses of production search, often outperforming pure vector approaches and almost always usefully complementing them.
By the end of this chapter you’ll know:
- Why text retrieval is the foundation of every search system.
- The TF-IDF model and its intuition.
- BM25 — the algorithm that runs most production search.
- Why BM25 is still competitive with dense retrieval in 2025.
- How sparse and dense retrieval complement each other.
- Inverted indices and how they make BM25 fast.
Outline:
- The retrieval problem.
- The bag-of-words model.
- TF-IDF derivation.
- BM25 — the saturation function.
- Why BM25 works so well.
- Inverted indices.
- The hybrid argument.
- Production BM25 tools.
- The limits of lexical search.
57.1 The retrieval problem
The setup: you have a collection of documents (millions, billions, whatever). A user submits a query (a question, a search term, a partial sentence). You need to return the top-K most relevant documents from the collection, fast.
This is the information retrieval problem, and it predates the entire field of machine learning. Search engines (Google, Bing, DuckDuckGo), enterprise search, e-commerce search, code search, and now RAG systems — all of them solve this same problem at different scales.
The challenge is twofold:
- Relevance: how do you decide which documents are relevant to a query? This is a modeling problem.
- Scale: how do you find the relevant documents in milliseconds across billions of documents? This is a systems problem.
Classical IR (1970s-2010s) solved both problems with lexical methods — methods that work on the words themselves. The intuition: a document is relevant to a query if it contains the query’s words, especially if it contains them often and they’re rare in other documents. The TF-IDF and BM25 algorithms formalize this intuition.
Modern LLM-era retrieval added semantic methods (dense embeddings, Chapters 9, 56–57). These can capture meaning beyond exact word matches. But they don’t replace lexical methods; they complement them.
57.2 The bag-of-words model
The simplest representation of text for retrieval: a bag of words. Ignore order; just count how many times each word appears.
Document 1: “The quick brown fox jumps over the lazy dog” Bag: {the: 2, quick: 1, brown: 1, fox: 1, jumps: 1, over: 1, lazy: 1, dog: 1}
Document 2: “The lazy dog sleeps under the tree” Bag: {the: 2, lazy: 1, dog: 1, sleeps: 1, under: 1, tree: 1}
Query: “lazy dog” Bag: {lazy: 1, dog: 1}
Both documents contain “lazy” and “dog,” so both are relevant. We need a way to score each document by how relevant it is, so we can rank them.
The bag-of-words model is naive but effective. It throws away word order, syntax, and meaning, but it captures the most basic signal: which words appear. The vast majority of search relevance can be predicted from this signal alone.
57.3 TF-IDF derivation
TF-IDF (Term Frequency × Inverse Document Frequency) is the classical scoring formula. Let me derive it from intuition.
Intuition 1: Term Frequency. A document that mentions “lazy” three times is probably more about laziness than one that mentions it once. So the score should grow with the number of times the term appears. The simplest formula: tf(t, d) = count(t, d), the raw count of term t in document d.
Intuition 2: Document Frequency. Common words (like “the”) appear in almost every document. They don’t tell you anything about relevance. Rare words (like “tetragrammaton”) that appear in only a few documents are highly informative. So the score should give more weight to rare words.
The “inverse document frequency” formula: idf(t) = log(N / df(t)), where N is the total number of documents and df(t) is the number of documents containing term t.
For a corpus of 1 million documents:
- “the” appears in ~900,000 documents →
idf("the") = log(10^6 / 9 × 10^5) ≈ 0.10 - “tetragrammaton” appears in 100 documents →
idf("tetragrammaton") = log(10^6 / 100) = log(10^4) ≈ 9.21
The IDF for “tetragrammaton” is ~92× higher. Matching “tetragrammaton” between query and document is much more meaningful than matching “the.”
The TF-IDF score for a document d given a query q:
score(d, q) = Σ_{t in q} tf(t, d) × idf(t)
Sum over the query terms. For each term, multiply how often it appears in the document by how rare it is in the corpus.
This is TF-IDF in its simplest form. It’s almost too simple — it has problems with long documents (which trivially have high TF for everything) and with diminishing returns (a document with 100 mentions of “lazy” isn’t 10× as relevant as one with 10 mentions). The fix is BM25.
57.4 BM25 — the saturation function
BM25 (Best Match 25, from a series of TREC experiments by Robertson, Sparck Jones, and others, formalized in 1994) is the modern refinement of TF-IDF. It addresses TF-IDF’s two main weaknesses with a more sophisticated formula:
score(d, q) = Σ_{t in q} idf(t) × (tf(t, d) × (k_1 + 1)) / (tf(t, d) + k_1 × (1 - b + b × |d| / avg_dl))
Where:
k_1is a tuning parameter (typically 1.2 to 2.0).bis another tuning parameter (typically 0.75).|d|is the length of documentdin tokens.avg_dlis the average document length in the corpus.
That looks ugly. Let me decompose it:
The IDF term: same as TF-IDF, with a slight modification (BM25 uses a slightly different IDF formula that handles edge cases better).
The TF saturation: the tf(t, d) × (k_1 + 1) / (tf(t, d) + k_1 × ...) part. This is a saturating function — it grows with tf initially but levels off as tf gets large. Going from 1 mention to 2 mentions of a term boosts the score significantly. Going from 100 to 200 mentions barely changes the score. This is much more realistic than the linear TF in TF-IDF.
The length normalization: the (1 - b + b × |d| / avg_dl) factor. This penalizes long documents. A document that’s 10× the average length naturally has 10× the term counts; without normalization, long documents would always score higher just because they have more words.
Together, these refinements make BM25 a much more accurate scoring formula than TF-IDF. The empirical improvement is significant — BM25 is the de facto baseline that any new retrieval method has to beat.
57.5 Why BM25 works so well
BM25 is decades old and looks naive next to neural retrieval. So why is it still competitive?
(1) Exact lexical match is a strong signal. Most queries contain at least some specific terms that the user wants matched literally. BM25 scores these exactly.
(2) The saturation function is empirically right. Term frequency really does have diminishing returns; document length really does need normalization.
(3) IDF captures rarity well. Rare terms really are more informative.
(4) It generalizes. BM25 works on any language, any domain, any document type. No training data needed.
(5) It’s interpretable. When a document scores high, you can see exactly why (which terms matched, with which weights).
(6) It’s robust to query length. Both short queries (a few keywords) and long queries (full sentences) work.
(7) It’s fast. With an inverted index, BM25 scoring is O(query terms × matching docs), which is very fast in practice.
The empirical result: on most benchmarks, BM25 is within a few points of the best neural retriever, and often beats neural retrievers on tasks requiring exact matching (product names, code, technical terminology, named entities).
The reason most modern RAG systems use hybrid retrieval (BM25 + dense) is that the two have complementary strengths. We’ll get to that in §57.7.
57.6 Inverted indices
The systems side: how do you compute BM25 scores for a billion-document corpus in 50ms? The answer is the inverted index.
An inverted index maps each term to the list of documents that contain it (the “posting list”), along with the term’s positions or frequencies in each document.
"lazy" → [doc1: tf=1, doc2: tf=1, doc7: tf=3, ...]
"dog" → [doc1: tf=1, doc2: tf=1, doc4: tf=2, ...]
To score a query, you:
- Look up the posting list for each query term.
- Find the documents that appear in (intersection or union of) the posting lists.
- For each candidate document, compute the BM25 score using the precomputed
tfandidfvalues. - Return the top-K by score.
The lookup is O(log N) per term using a hash table or B-tree. The intersection is O(min(posting list lengths)). The scoring is O(matching documents). For typical queries with a few terms and a few thousand matching documents, the total work is on the order of microseconds to milliseconds.
The inverted index is the data structure that made web-scale search possible. It’s been refined over decades — modern indices use compression, sharding, distributed indexing, replication, all the production tricks. The fundamental data structure is unchanged.
57.7 The hybrid argument
The most important insight in modern retrieval: lexical and dense retrieval are complementary. Each catches things the other misses.
Lexical (BM25) catches:
- Exact matches on named entities, product names, technical terms.
- Acronyms and abbreviations.
- Code, IDs, structured tokens.
- Rare terms with strong signal (the “tetragrammaton” case).
- Queries where the user wants specific words matched.
Dense retrieval catches:
- Synonyms and paraphrases (“car” matches “automobile”).
- Semantic concepts not literally in the query.
- Cross-lingual matches (with multilingual encoders).
- Conversational queries that don’t share words with the relevant documents.
Neither alone covers all cases. Together, they cover most cases. The standard practice in modern RAG systems is hybrid retrieval: run BM25 and dense in parallel, combine the results.
Combination strategies (next chapter Chapter 60):
- Reciprocal Rank Fusion (RRF): combine the rank positions, not the scores.
- Weighted score fusion: a weighted sum of normalized scores.
- Learning to rank: train a model to combine.
The empirical result: hybrid retrieval consistently beats either single approach. The improvement is typically 5-15 percentage points on retrieval benchmarks. For production systems, the operational complexity of running two retrievers is paid back by the quality improvement.
57.8 Production BM25 tools
The BM25 implementations you’ll encounter:
Elasticsearch / OpenSearch. The dominant production search engines. Use Lucene under the hood. BM25 is the default scoring function. Battle-tested at extreme scale (Wikipedia, GitHub, etc. all use Elasticsearch).
Lucene itself. Java library that Elasticsearch is built on. Lower-level than Elasticsearch but more flexible.
Tantivy. Rust-based search engine library. Faster than Lucene for some workloads. Used by Quickwit and other modern search products.
Whoosh. Pure Python search engine. Slower but easy to embed in Python applications.
rank_bm25. A small Python library for BM25 scoring without the full search engine machinery. Useful for prototypes and small corpora.
Vespa. A more general search and ranking platform from Yahoo, with strong support for hybrid retrieval and ML ranking.
Qdrant, Weaviate, Pinecone. Modern vector databases that have added BM25 / sparse search support to enable hybrid retrieval.
For production RAG with hybrid retrieval, the typical stack is Elasticsearch (or OpenSearch) for BM25 + a vector database for dense retrieval. Or, increasingly, a single vector database that supports both (Vespa, Weaviate, Qdrant with their sparse vector support).
57.9 The limits of lexical search
BM25 is great but it has real limits:
(1) No synonym handling. “car” and “automobile” are different terms with no relation in BM25’s view. Manual synonym lists help but don’t generalize.
(2) No semantic understanding. “How do I fix a broken faucet?” doesn’t match “Plumbing repair guide” because they share no terms.
(3) Sensitive to spelling and morphology. “running” and “runs” are different terms. Stemming helps but is imperfect.
(4) Domain-specific tuning needed. BM25’s k_1 and b parameters work well by default but can be tuned for specific corpora. Most teams don’t tune them and accept the default.
(5) No cross-lingual. A query in English doesn’t match a document in Spanish, even if they’re about the same topic.
(6) Long-tail queries fail. Queries that are rare or unusual often don’t match anything because the user’s word choice doesn’t align with the document’s word choice.
These limits are exactly where dense retrieval (Chapter 58) shines. The two approaches address different failure modes, which is why hybrid retrieval works so well.
57.10 The mental model
Eight points to take into Chapter 58:
- Information retrieval is the foundation of search, predating the LLM era by decades.
- Bag of words is the simplest representation, ignoring order but capturing word counts.
- TF-IDF scores documents by term frequency and inverse document frequency.
- BM25 refines TF-IDF with TF saturation and length normalization. Decades-old, still the best lexical method.
- Inverted indices make lexical retrieval fast at any scale.
- BM25 is competitive with neural retrieval on most benchmarks because exact lexical matching is a strong signal.
- Hybrid retrieval (BM25 + dense) consistently beats either alone. Modern RAG systems use both.
- Lexical search has real limits: no synonyms, no semantics, no cross-lingual. Dense retrieval addresses these.
In Chapter 58 we look at the dense retrieval side: how embedding models are trained for retrieval and what makes them effective.
Read it yourself
- Robertson & Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond (2009). The canonical BM25 reference.
- Introduction to Information Retrieval by Manning, Raghavan, and Schütze. The textbook. Free online.
- The Elasticsearch documentation on BM25 and similarity scoring.
- The Lucene documentation on the inverted index.
- The Pyserini toolkit (Lin et al.) for IR experiments.
Practice
- Compute BM25 scores by hand for a small corpus of 5 documents and a 2-term query. Use
k_1 = 1.5, b = 0.75. - Why is the BM25 saturation function more realistic than linear TF? Plot the score as a function of
tffor both. - Why does BM25 use length normalization? Construct a case where it matters.
- For a corpus of 1 million documents where “the” appears in 900k and “quantum” appears in 1k, compute the IDF for each.
- Why is BM25 still competitive with dense retrieval in 2025? List three reasons.
- Read the Elasticsearch documentation on BM25 parameters. When would you tune
k_1orb? - Stretch: Implement BM25 in 50 lines of Python. Run it on a small corpus and verify it ranks documents sensibly.