Part IV · Information Retrieval & RAG
Chapter 61 ~16 min read

Chunking strategies

"Chunking is the most underrated lever in RAG quality. Most teams pick a default and never tune it. The teams that do tune it see double-digit improvements"

In Chapter 60 we covered hybrid retrieval — how to find candidate documents fast. This chapter covers a question that comes one step earlier: how do you decide what counts as a “document”?

In a typical RAG corpus, the source data is much larger than what fits in any retriever’s input window. Wikipedia articles can be tens of thousands of words. PDFs can be hundreds of pages. Codebases have millions of lines. You can’t index a 200-page PDF as a single “document” — the embedding would be useless (averaging meaning across hundreds of distinct topics) and the LLM context wouldn’t fit.

The answer is chunking: split the source into smaller pieces, index each piece separately, and retrieve the most relevant pieces for each query. The strategy you use to chunk has an enormous effect on retrieval quality, and it’s the single most underrated lever in RAG.

By the end of this chapter you’ll know all the major chunking strategies, when each is appropriate, and the practical pitfalls.

Outline:

  1. The chunking problem.
  2. Fixed-size chunking.
  3. Sentence-based chunking.
  4. Recursive character-based chunking.
  5. Semantic chunking.
  6. Hierarchical and parent-document chunking.
  7. Propositional chunking.
  8. Code-specific chunking.
  9. Overlap and the boundary problem.
  10. Chunk-level metadata.
  11. The empirical results.
  12. Picking a strategy.

61.1 The chunking problem

The fundamental tension: chunks need to be small enough to embed well but large enough to be self-contained.

If chunks are too small (say, single sentences), each chunk lacks context. A sentence like “He was born in 1856” is meaningless without knowing who “he” is. The retriever might find it but the LLM can’t use it.

If chunks are too large (say, whole documents), the embedding averages meaning across many topics. The retriever can’t distinguish between “this 50-page report has one section about my query” and “this 50-page report is about my query.” Both look the same to the embedder.

The right size is in the middle: large enough to be self-contained, small enough to be focused. Empirically, this is often around 200-1000 tokens per chunk, but the right answer depends on the corpus and the workload.

Beyond size, the chunk boundaries matter. Splitting in the middle of a sentence is bad. Splitting in the middle of a code function is bad. Splitting between paragraphs that depend on each other is bad. Chunking strategies differ in how cleverly they pick the boundaries.

Chunk size trade-off: too small loses context, too large loses focus, sweet spot is 200-1000 tokens. Chunk size (tokens) → Retrieval quality 50 200 500 1500 5000 sweet spot 200–600 tok "He was born in 1856" — who? topic dilution
Retrieval quality peaks in the 200–600 token range — below that, chunks lose self-contained context; above that, the embedding averages across unrelated topics and loses focus.

61.2 Fixed-size chunking

The simplest strategy: split into fixed-size chunks, ignoring document structure. Pick a chunk size (in characters or tokens), split the document into pieces of that size, index each piece.

def fixed_size_chunk(text, chunk_size=500):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

Five lines of Python. Always works. Always available.

The problems:

  • Cuts across sentence and paragraph boundaries randomly.
  • No semantic awareness. Two unrelated topics in the same paragraph end up in the same chunk.
  • Fragments important content. A definition might be cut in half.

Fixed-size chunking is the baseline you start with. It’s quick to set up, easy to reason about, and works well enough for many corpora. But it’s almost always suboptimal compared to smarter strategies.

For prototyping, fixed-size with chunk_size = 500-1000 characters and overlap = 50-100 characters is the standard starting point.

61.3 Sentence-based chunking

A small improvement: split on sentence boundaries. Use a sentence tokenizer (NLTK, spaCy, or a simpler regex) to identify sentence boundaries and group sentences into chunks.

def sentence_chunk(text, target_size=500):
    sentences = sentence_tokenize(text)
    chunks = []
    current = []
    current_size = 0
    for sent in sentences:
        if current_size + len(sent) > target_size and current:
            chunks.append(' '.join(current))
            current = [sent]
            current_size = len(sent)
        else:
            current.append(sent)
            current_size += len(sent)
    if current:
        chunks.append(' '.join(current))
    return chunks

This produces chunks that are sentence-aligned — no sentence is cut in half. Each chunk is approximately target_size characters but may be slightly larger or smaller depending on sentence boundaries.

Sentence-based chunking is better than fixed-size because:

  • Sentences are semantically coherent units. Don’t split them.
  • Chunks read naturally when included in LLM prompts.

But it still doesn’t account for paragraph or document structure. A paragraph break is a stronger signal than a sentence break, but sentence-based chunking ignores it.

61.4 Recursive character-based chunking

The chunking strategy used by LangChain’s RecursiveCharacterTextSplitter and most production RAG systems. The idea: try to split on the strongest available boundary, falling back to weaker boundaries if needed.

The algorithm:

  1. Try to split on "\n\n" (paragraph breaks). If the resulting chunks are small enough, done.
  2. If a chunk is too large, try splitting it on "\n" (line breaks).
  3. If still too large, try splitting on ". " (sentence endings).
  4. If still too large, split on whitespace.
  5. If still too large, split on individual characters.

Each splitter tries the strongest separator first, recursing into chunks that are still too large. The result: chunks that respect document structure where possible.

For Markdown documents, the separator list might be ["\n## ", "\n### ", "\n\n", "\n", " ", ""] — split first on headers, then on paragraph breaks, then on lines, etc. For code, the separators might include language-specific markers (function definitions, class definitions).

Recursive character-based chunking is the modern default. It’s the simplest strategy that handles document structure intelligently. It’s available in all major RAG libraries (LangChain, LlamaIndex, Haystack).

For most production RAG, this is what you should use unless you have a reason to do something more elaborate.

61.5 Semantic chunking

A more ambitious approach: use an embedding model to find chunk boundaries based on semantic similarity. The idea: walk through the document sentence by sentence, and start a new chunk whenever the topic shifts.

The algorithm:

  1. Split the document into sentences.
  2. Compute the embedding of each sentence.
  3. Compute the cosine similarity between consecutive sentences.
  4. Identify “valleys” in the similarity (places where consecutive sentences are dissimilar — these are topic shifts).
  5. Use the valleys as chunk boundaries.

The result: chunks that align with topic boundaries, so each chunk is about one topic.

The catch: it’s much more expensive than recursive chunking (requires running the embedder on every sentence) and the quality improvement is marginal for most corpora. The improvement is biggest on corpora with mixed topics in the same document (e.g., long blog posts that cover multiple subjects).

Semantic chunking is implemented in LangChain’s SemanticChunker and LlamaIndex’s SemanticSplitterNodeParser. For specific use cases (long-form articles, books, documentation), it can be worth it. For most RAG, recursive character-based chunking is good enough.

61.6 Hierarchical and parent-document chunking

A different approach: store multiple chunk sizes for the same document.

The pattern:

  1. Split the document into small chunks (say, 200 tokens) — these are the search chunks.
  2. Split the document into larger chunks (say, 1000 tokens) — these are the context chunks.
  3. Each search chunk has a reference to the context chunk it belongs to.
  4. At retrieval time, embed and search over the small chunks.
  5. When you find a relevant small chunk, return its parent large chunk to the LLM.

The advantage: search precision is high (small chunks are focused, easy to embed) but context for the LLM is large (the parent chunk has enough context to answer the question).

This is the parent-document retriever pattern, available in LangChain and LlamaIndex. It’s particularly effective for question-answering tasks where the relevant fact is small but needs surrounding context to be understood.

Parent-document retrieval: small chunks are indexed for search, but the larger parent is returned to the LLM. Document Parent chunk 1 ~1000 tokens chunk A chunk B Parent chunk 2 chunk C chunk D Vector index (small chunks) A B C D Query → hit: A fetch parent Parent chunk 1 ~1000 tokens full context sent to LLM
Parent-document retrieval indexes small focused chunks but returns the larger parent to the LLM — combining search precision (small) with generation context (large).

A more elaborate version is hierarchical retrieval: split into multiple levels (paragraphs, sections, chapters), embed each level separately, retrieve at the appropriate level for the query. This is the approach used in research systems for retrieving over books or long documents.

For most production RAG, parent-document with two levels (small for search, large for context) is a strong strategy that significantly outperforms flat chunking.

61.7 Propositional chunking

A research-frontier approach: use an LLM to extract atomic propositions from the document, and use those as the chunks.

A “proposition” is a single atomic fact, written as a complete sentence. For example, the document “Marie Curie was born in 1867 in Warsaw. She won the Nobel Prize twice.” might be propositionalized as:

  • Proposition 1: “Marie Curie was born in 1867.”
  • Proposition 2: “Marie Curie was born in Warsaw.”
  • Proposition 3: “Marie Curie won the Nobel Prize twice.”

Each proposition is a self-contained fact. They’re tiny (one sentence each), focused, and easy to embed.

The advantages:

  • Maximum search precision. Each proposition is one atomic claim.
  • No ambiguity. No chunk contains multiple unrelated facts.

The disadvantages:

  • Expensive to build. Requires LLM calls to propositionalize every document.
  • Loses context. Propositions are too small to be self-contained for some questions.
  • Operationally complex. The propositionalization step is its own pipeline.

Propositional chunking is mostly research-stage. The Dense X Retrieval paper (Chen et al., 2023) showed it can outperform standard chunking by several points on QA benchmarks. As LLM costs drop, this approach becomes more practical.

For now, it’s worth knowing about but not the default.

61.8 Code-specific chunking

Code is a special case. You can’t split a function in half — the resulting fragments are useless. You can’t ignore the file structure — a function references variables defined elsewhere.

Code chunking strategies:

Function-level chunking. Each function is one chunk. Use a language parser (tree-sitter) to identify function boundaries. Each chunk is one logical unit.

Class-level chunking. Each class is one chunk. Useful for object-oriented code where the class is the natural unit of meaning.

File-level chunking. Each file is one chunk. Works for small files but doesn’t scale to large files.

AST-based chunking. Use the abstract syntax tree to identify semantic units (functions, classes, modules) and chunk along those boundaries. The most accurate but most complex.

For production code RAG, function-level chunking with tree-sitter is the standard. Tools like langchain-text-splitters have built-in support for many languages.

61.9 Overlap and the boundary problem

Regardless of which chunking strategy you use, you should add overlap between chunks. The reason: a relevant fact might be near the boundary of a chunk, and overlap ensures it’s captured fully in at least one chunk.

The overlap is typically 10-20% of the chunk size. For 500-character chunks, 50-100 characters of overlap.

def chunk_with_overlap(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

The overlap costs storage (more chunks per document) but improves retrieval quality measurably. The trade-off is favorable.

A more sophisticated version: overlap on sentence boundaries rather than character boundaries. This ensures the overlap is meaningful.

61.10 Chunk-level metadata

Each chunk should carry metadata that helps with retrieval and presentation:

  • Source document: the original document the chunk came from.
  • Position: where in the document the chunk is (chunk index, character offset).
  • Section/heading: the section header the chunk belongs to (for hierarchical documents).
  • Document type: PDF, HTML, code, etc.
  • Author / creation date: if you want to filter or boost based on these.
  • Tags / categories: for filtered retrieval.

The metadata is stored alongside the chunk in the vector database. At query time, you can:

  • Filter by metadata (e.g., “only return chunks from documents tagged ‘product-docs’”).
  • Boost by metadata (e.g., “prefer recent chunks”).
  • Display metadata to the user (e.g., “this answer came from page 47 of report X”).

Most vector databases (Qdrant, Weaviate, Pinecone) support metadata-based filtering natively. Use it.

A common pattern: add the document title and section header to the chunk’s text before embedding. This gives the chunk additional context that helps the embedder. For example:

Document: "Quarterly Earnings Report Q3 2024"
Section: "Operating Income"

Chunk text: "Operating income for the quarter was $1.2 billion, up 15% year over year..."

The combined text is embedded as one. The result is often better retrieval than embedding just the raw chunk.

61.11 The empirical results

Concrete numbers for chunking strategy impact on retrieval quality (nDCG@10 on a typical RAG benchmark):

StrategyScore
Fixed-size, 500 chars, no overlap0.45
Fixed-size, 500 chars, 50 overlap0.48
Sentence-based, 500 chars target0.51
Recursive character-based, 500 chars0.54
Recursive + parent document (200 search / 1000 context)0.59
Semantic chunking0.56
Propositional chunking0.58

The pattern: smarter chunking gives 5-15 points of improvement over the naive baseline. The biggest wins come from:

  • Respecting document structure (recursive over fixed-size): +5-7 points.
  • Adding overlap: +1-3 points.
  • Parent-document retrieval: +5-8 points on top of recursive.

These are large effects. Tuning the chunking is one of the highest-leverage things you can do for RAG quality.

Chunking strategy impact on nDCG@10 — parent-document retrieval gives the largest single gain. fixed, no overlap 0.45 fixed + overlap 0.48 sentence-based 0.51 recursive char-based 0.54 parent-document 0.59 0.44 0.52 0.60 nDCG@10 →
Parent-document retrieval (small chunks for search, large parent for context) gives the largest single gain over the fixed-size baseline — 14 points of nDCG@10 versus 1 point from simply adding overlap.

61.12 Picking a strategy

The decision tree:

Q: What kind of documents?

  • Plain text articles → recursive character-based with paragraph priority.
  • Markdown → recursive with header priority.
  • PDFs → recursive after extraction (use a good PDF parser like pdfplumber or unstructured).
  • Code → function-level with tree-sitter.
  • Mixed structured data → use document-type-specific strategies and unify in the index.

Q: How long are the documents?

  • Short (< 1000 tokens): one chunk per document is fine.
  • Medium (1000-10000): recursive chunking with 500-1000 char chunks.
  • Long (> 10000): hierarchical or parent-document approach.

Q: What kind of questions will users ask?

  • Specific facts: smaller chunks (200-500 tokens) for precision.
  • Open-ended questions: larger chunks (1000-2000) for context.
  • Both: parent-document approach.

Q: How much can you spend on the chunking pipeline?

  • Minimal (just split): fixed-size or recursive.
  • Moderate: recursive + parent-document.
  • High (LLM-powered): propositional or semantic chunking.

For most production RAG with mixed text content, recursive character-based chunking with parent-document retrieval, 500-token search chunks, 1500-token context chunks, with metadata enrichment is a strong default. Tune from there based on your specific evaluation.

61.13 The mental model

Eight points to take into Chapter 62:

  1. Chunking is the most underrated lever in RAG quality. Tune it.
  2. Fixed-size chunking is the baseline. Almost always suboptimal but easy to start with.
  3. Recursive character-based chunking respects document structure. Modern default.
  4. Parent-document retrieval uses small chunks for search, large chunks for LLM context. Big quality win.
  5. Semantic chunking uses embeddings to find topic boundaries. Marginal gain, expensive.
  6. Propositional chunking uses an LLM to extract atomic facts. Research-stage but promising.
  7. Code needs language-aware chunking at function level.
  8. Add overlap and metadata. Both improve quality measurably.

In Chapter 62 we look at the cross-encoder reranking step that comes after retrieval.


Read it yourself

  • The LangChain documentation on text splitters.
  • The LlamaIndex documentation on node parsers.
  • Chen et al., Dense X Retrieval: What Retrieval Granularity Should We Use? (2023). The propositional chunking paper.
  • The Unstructured library documentation (for parsing PDFs and other complex formats).
  • Pinecone’s blog series on chunking strategies.

Practice

  1. Implement recursive character-based chunking in 30 lines of Python without using LangChain.
  2. Why does chunk overlap improve retrieval quality? Construct an example where it matters.
  3. For a corpus of code repositories, why is function-level chunking better than fixed-size? Argue.
  4. Read the Dense X Retrieval paper. Identify the propositionalization prompt and consider its cost.
  5. Why does adding the document title and section header to the chunk text before embedding improve retrieval?
  6. For a corpus of Wikipedia articles where users ask both factual questions and open-ended ones, design a chunking strategy.
  7. Stretch: Build a parent-document retrieval pipeline with LangChain on a small corpus. Measure retrieval quality vs flat chunking.