Part IV · Information Retrieval & RAG
Chapter 59 ~19 min read

Vector index internals: HNSW, IVF, ScaNN, FAISS

"You don't search a million vectors by computing a million distances. You build an index, the same way databases have for the last fifty years"

In Chapters 9 and 56 we covered embedding models — how they’re trained, how to choose one, what they produce. This chapter covers what happens after the embeddings are produced: how to index them so you can find nearest neighbors quickly, and how the major index types work under the hood.

By the end you’ll understand HNSW, IVF, ScaNN, and FAISS well enough to pick the right one for any constraint set, tune the parameters intelligently, and avoid the common pitfalls.

Outline:

  1. The nearest-neighbor problem.
  2. Exact vs approximate (the recall-latency-memory frontier).
  3. Brute force.
  4. IVF — partition-based indexing.
  5. Product quantization.
  6. HNSW — graph-based indexing.
  7. ScaNN — Google’s hybrid.
  8. FAISS — the canonical library.
  9. The vector database landscape.
  10. Picking an index.

59.1 The nearest-neighbor problem

You have N vectors of d dimensions in a database. You receive a query vector. You want to return the k vectors in the database that are nearest to the query (by cosine similarity, L2 distance, or dot product).

Naive solution: compute the distance from the query to every database vector, sort, return the top k. This is O(N × d) per query — fine for small N but impossibly slow for large N.

For N = 1 million and d = 1024 in fp32, naive search is 1M × 1024 × 4 bytes = 4 GB of memory traffic per query, plus a million distance computations. On a CPU, this is ~1 second per query. Too slow.

The fix is approximate nearest neighbor (ANN) indexing. Instead of computing exact distances, build a data structure that lets you find approximate near neighbors in O(log N) or O(√N) time. You sacrifice some recall (the index occasionally misses the true nearest neighbor) for dramatic latency improvement.

Modern ANN algorithms achieve recall of 95-99% at queries in 1-10 ms on million-scale datasets. Good enough for most production retrieval.

59.2 Exact vs approximate

The trade-off:

Exact search (brute force): 100% recall, slow. Approximate search: 90-99% recall, fast.

For most retrieval applications, 99% recall is fine — you’d rather have the top-10 in 5 ms than the top-10-with-the-true-best-included in 5 seconds. The 1% you miss is usually not the document that would have changed the answer; it’s some marginal candidate that ranked just below the threshold.

For some applications (legal discovery, medical literature search, exact deduplication), 100% recall matters and you have to either:

  • Use exact search (slow but correct).
  • Use approximate search and accept a slightly worse top-K.
  • Use approximate search but increase the search width to compensate.

The recall-latency trade is configurable in most ANN algorithms. Higher recall = slower search; lower recall = faster search. You pick the operating point.

Recall-latency Pareto frontier for ANN indexes — higher recall always costs more latency. Recall@10 → Latency (ms) → 80% 90% 95% 99% 100% 1 10 100 1000 IVF-PQ HNSW brute force typical prod ef=200
HNSW achieves a better recall-latency frontier than IVF-PQ — the typical production operating point sits at 99% recall, ~5–10 ms, far from the expensive exact-search corner.

59.3 Brute force

The simplest “index”: no index at all. Store the vectors in a flat array; at query time, compute all distances and sort.

Brute force is sometimes the right choice:

  • Small N (under ~10k vectors). The index overhead isn’t worth it.
  • High-dimensional vectors (d > 1000). Curse of dimensionality makes ANN less effective.
  • High recall requirement. Brute force guarantees 100% recall.
  • GPU search. Modern GPUs can brute-force millions of vectors per query. FAISS-GPU has optimized brute force kernels.

For the typical RAG application with 100k-10M documents, brute force is too slow. You need an ANN index.

The libraries that support brute force:

  • FAISS-IndexFlat: the brute-force index in FAISS.
  • PyTorch / NumPy: a few lines of matrix multiplication for brute-force search.
  • Most vector databases support brute force as an option.

For prototyping, start with brute force. It’s the simplest baseline, and you can switch to ANN once you’ve validated the rest of the pipeline.

59.4 IVF — partition-based indexing

IVF (Inverted File) is the simplest ANN approach. The idea: partition the vector space into cells (using k-means or similar), and only search the cells nearest to the query.

The setup:

  1. Train: run k-means on the training vectors to find nlist centroids. Each vector is assigned to its nearest centroid. The cells are the Voronoi regions around each centroid.

  2. Index: store each vector in its cell’s posting list (similar to an inverted index, hence the name).

  3. Query:

    • Compute the query’s distance to each centroid.
    • Pick the nprobe nearest centroids.
    • Brute-force search the vectors in those nprobe cells.
    • Return the top-k.
IVF partitions the vector space into Voronoi cells; at query time only the nprobe nearest cells are searched. × × × × q nprobe=2 cells searched 2 cells skipped 1. compute dist(q, all centroids) O(nlist × d) 2. pick top nprobe centroids nprobe=10 out of 1000 3. brute-force inside those cells O((N/nlist) × nprobe × d) 4. return top-k
IVF limits the brute-force search to the nprobe cells nearest the query centroid — with nprobe=10 out of 1000 cells, only 1% of the corpus is touched per query.

The parameters:

  • nlist: number of cells. Typical: √N (e.g., 1000 cells for 1M vectors).
  • nprobe: number of cells to search at query time. Higher = better recall, slower. Typical: 10-100.

The search complexity is O(d × (nlist + (N/nlist) × nprobe)). With nlist = √N and nprobe = √nlist, this is O(d × N^0.75) — much better than O(N) brute force but not as good as O(log N).

IVF is simple, has predictable behavior, and works well for medium-scale indexes (a few million to a billion vectors). It’s the default index in many vector databases.

Limitations:

  • Edge cells: queries near the boundary between cells may miss the true nearest neighbor (which is in the cell next door, not searched).
  • Dimensionality: high-d spaces have many cells, and nprobe has to grow.
  • Memory: stores the full vectors in the cells.

The fix for the memory issue is product quantization.

59.5 Product quantization

Product quantization (PQ) is a compression technique for vectors. The idea: split each vector into sub-vectors, quantize each sub-vector to a small code, and store the codes instead of the full vector.

The setup:

  1. Split the d-dim vector into m sub-vectors of d/m dimensions each.
  2. For each sub-vector position, train a separate k-means with k centroids (typically k = 256, so each code fits in 8 bits).
  3. To encode a vector: for each sub-vector, find the nearest centroid and store the centroid’s index.
  4. The encoded vector is m bytes (instead of d × 4 bytes for fp32).

For d = 1024, m = 32, k = 256: each sub-vector is 32 dim, encoded as 1 byte. The whole vector becomes 32 bytes — a 128× compression vs fp32.

Product quantization splits a 1024-dim float32 vector into 32 sub-vectors, each encoded as 1 byte — 128x compression. Original: 1024 × float32 = 4 096 bytes sub-vec 1 sub-vec 2 sub-vec 3 … sub-vec 32 … codebook₁ 256 centroids codebook₂ 256 centroids codebook�� 256 centroids Compressed: 32 bytes (1 byte per sub-vector) 47 12 200 … 32 codes × 1 byte = 32 bytes total 128× compression: 4 096 bytes → 32 bytes
Product quantization achieves 128× compression by replacing each 32-dimensional float sub-vector with its nearest codebook centroid index — one byte per sub-vector instead of 128 bytes.

To compute distances against PQ codes:

  1. Asymmetric distance: use the full query vector and compare against the centroid distances of the database codes. Slower but more accurate.
  2. Symmetric distance: encode the query too and compare codes. Faster but less accurate.

In practice, asymmetric distance computation is fast enough — you precompute a lookup table of distances from the query to each centroid in each sub-vector position, and then compute the document distance as a sum over the codes.

PQ + IVF is IVF-PQ: partition the vectors into cells, then quantize each vector with PQ. The combination gives you both fast search (IVF’s partitioning) and small memory (PQ’s compression). FAISS calls this IndexIVFPQ.

For very large indexes (billions of vectors), IVF-PQ is the standard choice. Storage drops by 100× and search remains fast.

The trade-off: PQ loses some accuracy. The recall at fixed nprobe is lower than uncompressed IVF. For most RAG applications this is fine; the recall is still 95%+.

59.6 HNSW — graph-based indexing

HNSW (Hierarchical Navigable Small World) is the most popular ANN index in 2024-25. The idea: build a multi-layer graph where each node is a vector and edges connect nearby vectors. At query time, navigate the graph greedily toward the query.

The construction:

  1. Start with an empty graph.
  2. For each new vector to index:
    • Pick a random level (using a probability distribution that decreases with level).
    • Insert the vector at all levels from 0 up to the picked level.
    • At each level, find the nearest existing nodes and connect to them (capped at M connections per node).

The result is a multi-layer graph: layer 0 contains all vectors with many connections; higher layers contain progressively fewer vectors with fewer connections. The top layer is small and serves as an “express lane” for navigation.

HNSW multi-layer graph: top layer is sparse for fast navigation; bottom layer is dense for accurate search. Layer 2 Layer 1 Layer 0 drop level drop level ef_search neighborhood
HNSW entry point is in the sparse top layer; greedy descent drops to each lower layer when no local improvement is found — O(log N) hops to reach the target neighborhood.

The search:

  1. Start at a fixed entry point in the top layer.
  2. Greedy descent: at each step, move to the neighbor closest to the query.
  3. When you can’t improve at the current layer, drop to the next layer down.
  4. At layer 0, do a more thorough search (visiting ef_search nearest neighbors) and return the top-k.

The parameters:

  • M: max connections per node. Typical: 16-64. Higher = better recall, more memory.
  • ef_construction: search width during index building. Typical: 100-400. Higher = better index quality, slower build.
  • ef_search: search width during query. Typical: 50-500. Higher = better recall, slower query.

The properties:

  • Logarithmic search complexity: O(log N) per query.
  • High recall: 99%+ at reasonable ef_search.
  • Memory-heavy: stores the full vectors plus the graph (M × 4 bytes per node for the connections).
  • Slow to build: O(N × log N) with significant constants.
  • Static: traditionally hard to add or delete vectors after building. Modern HNSW variants support incremental updates.

HNSW is the default in many vector databases (Qdrant, Weaviate, Milvus, Pinecone) because it gives the best recall-latency trade-off in the medium-to-large scale (1k to 100M vectors). At billions of vectors, IVF-PQ is more memory-efficient.

For most production RAG with under ~10M vectors, HNSW is the right default.

59.7 ScaNN — Google’s hybrid

ScaNN (Guo et al., Google, 2020) is Google’s open-source ANN library. It uses a partition-based approach (similar to IVF) but with a learned scoring function that improves recall significantly.

The trick: ScaNN’s scoring uses a special anisotropic quantization that’s tuned for inner-product (dot product) similarity, which is what most embedding models use.

ScaNN beats most other ANN libraries on the latency-recall trade-off, especially for large-scale workloads. It’s used inside Google for many services.

The downside: it’s less feature-rich than FAISS (fewer index types) and less well-known. For most production teams, FAISS or a vector database is the more common choice.

For very large-scale (billion+) retrieval where you want the absolute fastest inner-product search, ScaNN is worth evaluating.

59.8 FAISS — the canonical library

FAISS (Facebook AI Similarity Search) is the most widely-used vector index library. It’s a C++ library with Python bindings that implements many ANN index types:

  • IndexFlat — brute force.
  • IndexIVFFlat — IVF without compression.
  • IndexIVFPQ — IVF with PQ compression.
  • IndexHNSW — HNSW.
  • IndexLSH — locality-sensitive hashing.
  • And many more.

FAISS supports both CPU and GPU. The GPU version can search billion-scale indexes in milliseconds.

A typical FAISS usage:

import faiss
import numpy as np

# Build the index
d = 1024  # vector dimension
nb = 1000000  # number of vectors
nlist = 1000  # number of IVF cells

vectors = np.random.random((nb, d)).astype('float32')
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.train(vectors)
index.add(vectors)

# Search
index.nprobe = 10  # number of cells to probe
queries = np.random.random((5, d)).astype('float32')
distances, indices = index.search(queries, k=10)

That’s a complete IVF index in 10 lines of Python. FAISS makes the underlying algorithms accessible without you having to implement them.

For most production teams, you don’t use FAISS directly — you use a vector database that uses FAISS (or HNSW or another library) under the hood. But knowing FAISS is useful for prototyping, custom workflows, and understanding what your vector database is doing.

59.9 The vector database landscape

The major vector databases as of late 2025:

Qdrant. Open-source, Rust-based, HNSW + scalar quantization, on-disk indexes, JSON metadata filtering. Fast and developer-friendly. Used by many production teams.

Weaviate. Open-source, Go-based, HNSW, schema-based, supports both vector and lexical search. GraphQL API.

Milvus. Open-source, distributed, supports many index types (IVF, HNSW, etc.). Used at very large scale.

Pinecone. Hosted (managed) vector database. Easy to use, popular for prototyping. Closed-source.

LanceDB. Open-source, embedded, columnar storage. Good for small-to-medium workloads embedded in applications.

Chroma. Open-source, lightweight, easy to use. Popular for prototypes and notebooks.

pgvector. A PostgreSQL extension that adds vector indexing to existing Postgres databases. Useful when you already have Postgres.

Vespa. Yahoo’s open-source search engine. Supports both lexical (BM25), dense vector, and ColBERT-style multi-vector. Powerful but complex.

Redis with vector search. Redis modules add vector search to Redis. Useful for caching scenarios.

The decision matrix:

Use caseRecommended
Production RAG, open-source, defaultQdrant
Want lexical + vector in one systemWeaviate or Vespa
Already using Postgrespgvector
Hosted / managedPinecone
Embedded in applicationLanceDB or Chroma
Massive scale (billions)Milvus or Vespa
Want maximum controlFAISS directly

For most production RAG in 2025, Qdrant is the modern open-source default. It has good performance, good developer experience, supports filtering, and runs on Kubernetes.

59.10 Picking an index

The decision tree for the ANN index type:

Q: How many vectors?

  • < 10k: brute force. The index overhead isn’t worth it.
  • 10k - 1M: HNSW. Best recall-latency trade-off.
  • 1M - 100M: HNSW with PQ compression, or IVF-PQ.
  • 100M - 10B: IVF-PQ on disk.
  • 10B: distributed IVF-PQ across many nodes.

Q: What’s your recall requirement?

  • 100% (exact): brute force only.
  • 99%+: HNSW with high ef_search.
  • 95-99%: standard HNSW or IVF.
  • Lower: IVF-PQ with aggressive compression.

Q: What’s your latency budget?

  • < 1 ms: HNSW with low ef_search, on memory.
  • 1-10 ms: standard HNSW or IVF.
  • 10-100 ms: IVF-PQ on disk.

Q: What’s your memory budget?

  • Generous: HNSW (stores full vectors).
  • Tight: IVF-PQ (compressed).

Q: Do you need to update the index?

  • Frequently (real-time inserts): HNSW with incremental support.
  • Rarely (batch reindex): IVF.
  • Never: any index, optimized for read.

For most production RAG with 100k-10M vectors, HNSW with M=16-32, ef_search=100-200 is the right starting point. Tune from there based on your specific recall/latency needs.

59.11 The mental model

Eight points to take into Chapter 60:

  1. Brute force is O(N) and impossibly slow for large corpora.
  2. ANN trades recall for latency. 99% recall in milliseconds vs 100% recall in seconds.
  3. IVF partitions the space and searches only nearby cells. Simple, predictable.
  4. PQ compresses vectors into small codes. Combines with IVF for memory-efficient indexes.
  5. HNSW is a multi-layer graph. Best recall-latency at medium scale. Modern default.
  6. ScaNN is Google’s library. Fastest for inner-product search, less feature-rich.
  7. FAISS is the canonical library. Used directly or under the hood of vector databases.
  8. Qdrant is the modern open-source default vector database. HNSW under the hood.

In Chapter 60 we look at how to combine BM25 and dense retrieval: hybrid search and fusion.


Read it yourself

  • The HNSW paper: Malkov & Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (2016).
  • The FAISS documentation and tutorials.
  • The Qdrant documentation on indexing and quantization.
  • The ScaNN paper and GitHub repository.
  • The PQ paper: Jégou et al., Product Quantization for Nearest Neighbor Search (2011).
  • The ANN-Benchmarks site for empirical comparisons.

Practice

  1. For 1M vectors of dim 1024 in fp32, compute the storage cost and brute-force latency on a CPU at 30 GB/s memory bandwidth.
  2. For the same 1M vectors with IVF-PQ at m=64, k=256, compute the storage cost.
  3. Why is HNSW O(log N) for search? Trace through a search on a small example.
  4. For nlist=1000, nprobe=10 on 1M vectors, how many vectors does IVF actually search per query?
  5. When would you choose IVF-PQ over HNSW for a 10M-vector index? Argue from memory and recall.
  6. Why does asymmetric distance (full query, PQ codes for documents) work better than symmetric? Trace through the lookup.
  7. Stretch: Build a FAISS HNSW index on a small dataset (e.g., 100k embeddings from a public corpus). Tune ef_search and plot the recall-latency curve.