Chapter 63: Query rewriting, HyDE, multi-query, query decomposition

We’ve covered the retrieval and reranking sides of RAG (Chapters 55-60). In this chapter we look at the query side: techniques that transform the user’s query before it hits the retriever, in order to get better candidates.

The user’s query is often suboptimal for retrieval. It might be:

Too short to give the retriever signal.
Too vague to match anything specific.
In the wrong vocabulary — using different words than the documents.
Too complex — asking multiple questions at once that need separate retrievals.
Implicit — referring to context the retriever doesn’t have.

The query rewriting techniques in this chapter address each of these. They use LLMs to transform the query into something better. The cost is an extra LLM call before retrieval; the benefit is significantly better retrieval quality.

By the end you’ll know HyDE, multi-query, query decomposition, and the other rewriting tricks, and when each is appropriate.

Outline:

The query-document gap.
Query expansion.
HyDE — Hypothetical Document Embeddings.
Multi-query rewriting.
Query decomposition for multi-hop questions.
Conversational query rewriting.
Step-back questions.
The cost-benefit picture.
The decision matrix.

63.1 The query-document gap

The fundamental issue: users write queries; documents are not queries. The user might ask “How do I fix a leaky faucet?” but the relevant document is titled “Plumbing repair manual” and contains words like “valve seat,” “washer,” and “compression fitting” — none of which appear in the query.

The retriever has to bridge this gap. Bi-encoders try to do it via semantic similarity (Chapter 58). BM25 doesn’t bridge it at all. Hybrid retrieval (Chapter 60) helps by combining both. But there’s still a gap.

The query rewriting approach: transform the query into something closer to the documents, by using an LLM to do the bridging. This adds compute and complexity but consistently improves retrieval.

The transformations fall into a few categories:

Expansion: add synonyms or related terms to the original query.
Generation: have the LLM generate a hypothetical “ideal answer,” then retrieve documents similar to the hypothetical answer.
Decomposition: split a complex query into multiple simpler queries.
Reformulation: rewrite the query in a form closer to the documents.
Contextualization: incorporate conversational context into a standalone query.

Each technique serves a different failure mode.

graph TD
  Q[User query] --> CHECK{Query type?}
  CHECK -->|multi-turn chat| CR[Conversational rewriting]
  CHECK -->|vocabulary gap| HYDE[HyDE — generate hypothetical answer]
  CHECK -->|multi-hop| DECOMP[Query decomposition]
  CHECK -->|ambiguous| MULTI[Multi-query rewriting]
  CHECK -->|simple / latency-critical| PASS[Pass through unchanged]
  CR --> RET[Retrieval]
  HYDE --> RET
  DECOMP --> RET
  MULTI --> RET
  PASS --> RET
  style HYDE fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style CR fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The decision tree for query rewriting: conversational rewriting is near-universal for chat; HyDE addresses vocabulary gap; decomposition handles multi-hop; latency-critical paths skip rewriting.

63.2 Query expansion

The simplest rewriting: add synonyms or related terms to the query, then retrieve with the expanded version.

The classical (pre-LLM) version used WordNet or other thesauri. Modern versions use an LLM:

Original query: "How do I fix a leaky faucet?"

LLM prompt: "Expand this query with related terms and synonyms: ..."

Expanded query: "How do I fix a leaky faucet? plumbing repair valve seat 
washer compression fitting drip leak faucet maintenance"

You then retrieve with the expanded query. The added terms help the retriever (especially BM25) find documents that use different vocabulary.

Query expansion is simple, cheap (one LLM call), and effective for lexical retrieval. It’s less useful for dense retrieval, which already has some semantic understanding.

63.3 HyDE — Hypothetical Document Embeddings

HyDE (Gao et al., 2022) is a clever rewriting trick. The idea:

Generate a hypothetical answer to the query using an LLM. The hypothetical answer is what the LLM thinks a relevant document would look like, even though it’s hallucinated.
Embed the hypothetical answer with the embedder.
Retrieve documents similar to the hypothetical answer’s embedding, not the original query’s embedding.

For example:

Query: “How do I fix a leaky faucet?”

LLM-generated hypothetical answer: “To fix a leaky faucet, first turn off the water supply under the sink. Remove the faucet handle by unscrewing the set screw. Replace the worn washer or O-ring inside the valve seat. Reassemble and test.”

This hypothetical answer is much closer in vocabulary to a real plumbing manual than the original query. Embedding it gives a vector that’s close to actual relevant documents.

HyDE is surprising in that it works even though the hypothetical answer is potentially wrong. The hypothetical answer might confuse a faucet’s washer with a hose’s washer; the LLM might make up “the Phillips screwdriver method.” None of this matters because we’re not using the answer directly — we’re using it as a retrieval anchor. The LLM only needs to produce something topically similar to a real answer for the retrieval to find good documents.

The cost: one extra LLM call per query (typically using a small LLM, ~50ms latency).

The benefit: substantial improvement on queries where the user’s wording is far from document wording. HyDE is particularly effective for out-of-domain retrieval (queries about topics the embedder wasn’t trained on).

HyDE shifts the query vector from user vocabulary ("fix leaky faucet") to document vocabulary ("washer," "valve seat," "tap") — the hypothetical answer serves as a bridge, not an answer.

HyDE is implemented in LangChain and LlamaIndex. For RAG systems with diverse query/document vocabularies, it’s worth using.

63.4 Multi-query rewriting

A more brute-force approach: generate multiple rewritings of the query and retrieve with each, then merge the results.

Original query: "How do I fix a leaky faucet?"

LLM-generated rewritings:
1. "Steps to repair a dripping kitchen faucet"
2. "Common causes of faucet leaks and solutions"
3. "DIY plumbing: faucet washer replacement guide"

Each rewriting is sent to the retriever independently, and the results are merged (typically with RRF, Chapter 60). The union covers more of the document space than any single query alone.

The cost: one LLM call to generate the rewritings, plus N retrieval calls (one per rewriting).

The benefit: better recall, especially for queries that have multiple valid interpretations. If the user’s query is ambiguous, multi-query covers more interpretations.

The trade-off: more retrieval cost (N times instead of 1). For latency-critical applications, this can be too much.

A simpler version: generate just one rewriting in addition to the original, and retrieve with both. This doubles the retrieval cost but gives most of the benefit.

63.5 Query decomposition for multi-hop questions

Some queries are actually multiple questions that need to be answered in sequence:

“What’s the population of the country where the inventor of the lightbulb was born?”

This is a multi-hop question. To answer it, you need to:

Find the inventor of the lightbulb (Edison, born in the US).
Find the population of the US.

A single retrieval can’t answer this — there’s no single document that mentions both Edison’s birthplace and the US population. You need to decompose the query into sub-questions, retrieve for each, and chain the answers.

The decomposition is done by an LLM:

LLM prompt: "Break down this question into sub-questions that can be answered one at a time: ..."

Sub-questions:
1. Who invented the lightbulb?
2. Where was that person born?
3. What is the population of that country?

Each sub-question is retrieved separately. The answers are chained: the answer to question 1 feeds into question 2, the answer to question 2 feeds into question 3.

This is query decomposition and it’s the foundation of multi-hop question answering. Implementations include LangChain’s MultiHopQAChain, LlamaIndex’s SubQuestionQueryEngine, and various agent-based approaches.

The cost is high: multiple LLM calls (one per hop) plus multiple retrievals. The benefit is being able to answer questions that single-hop retrieval simply can’t handle.

For most chat applications, multi-hop is rare. For knowledge-base Q&A and research assistants, it’s essential.

63.6 Conversational query rewriting

A specific case: the user’s query refers to context from earlier in the conversation. For example:

User: What's the capital of France?
Assistant: Paris.
User: What's its population?

The second query, “What’s its population?”, is meaningless without the conversation history. The retriever doesn’t have the history; it just sees “What’s its population?” — which matches nothing.

The fix: rewrite the query to be standalone, incorporating the conversation context:

LLM prompt: "Given the conversation history, rewrite the user's latest 
question as a standalone question that doesn't require the conversation."

Conversation:
- User: What's the capital of France?
- Assistant: Paris.
- User: What's its population?

Standalone question: "What is the population of Paris?"

The standalone question is what gets sent to the retriever. The LLM resolves the “its” reference to “Paris.”

This is conversational query rewriting and it’s essential for any chat-based RAG application. Without it, follow-up questions fail.

The implementation is straightforward: before retrieval, run the query through an LLM with the conversation history and a “rewrite as standalone” prompt. The standalone query is what’s used for retrieval. Most chat-RAG frameworks (LangChain, LlamaIndex) have this built in.

The cost: one LLM call per turn (using a small model is fine).

The benefit: multi-turn chat actually works.

63.7 Step-back questions

A creative technique: ask a more abstract version of the query first.

For example, instead of asking “What’s the average GDP of countries that have won the World Cup more than twice?”, first ask the step-back question: “Which countries have won the World Cup more than twice?”. Retrieve and answer that. Then use the answer to inform retrieval for the more specific question.

The step-back is useful for queries where the specific version is hard to retrieve directly but the more abstract version is easier.

Step-back prompting is an active research area. The Chen et al. Step-Back Prompting paper (2023) introduced the formal version. It’s implemented in some advanced RAG frameworks.

For most production RAG, step-back is overkill. For highly analytical queries (financial analysis, multi-step reasoning), it can help.

63.8 The cost-benefit picture

Each rewriting technique adds latency (an extra LLM call) and possibly compute (multiple retrievals). The question is whether the quality improvement is worth it.

The empirical numbers (rough):

Technique	Latency overhead	Quality improvement
Query expansion	+50 ms	+2-5 points
HyDE	+100 ms	+5-10 points
Multi-query (3 rewritings)	+100 ms + 3× retrieval	+5-8 points
Query decomposition	+200 ms + N× retrieval	+10-20 points (multi-hop only)
Conversational rewriting	+50 ms	Critical for multi-turn (otherwise fails)
Step-back	+200 ms + 2× retrieval	+5-10 points (specific cases)

The latency overhead is small relative to the typical RAG e2e latency (10+ seconds). The quality improvement is substantial.

For most production RAG, conversational rewriting + HyDE (or multi-query) is a strong default. They cover the most common failure modes and add ~150 ms to TTFT.

For multi-hop QA, query decomposition is essential. For latency-critical chat, skip multi-query and HyDE in favor of just conversational rewriting.

63.9 The decision matrix

Use case	Recommended rewriting
Single-turn Q&A, default	HyDE
Multi-turn chat	Conversational rewriting (essential)
Multi-turn chat with diverse vocabulary	Conversational rewriting + HyDE
Multi-hop questions	Query decomposition
Out-of-domain queries	Query expansion + HyDE
Latency-critical (sub-100ms TTFT)	No rewriting
Complex analytical queries	Step-back + decomposition
Search-engine-style	Query expansion

For most production chat-RAG: conversational rewriting + optional HyDE. For more sophisticated applications, layer on more techniques.

63.10 The mental model

Eight points to take into Chapter 64:

The user’s query is often not the right query for retrieval. Rewriting helps.
Query expansion adds synonyms. Cheap, helps lexical retrieval.
HyDE generates a hypothetical answer and retrieves on its embedding. Bridges the query-document vocabulary gap.
Multi-query generates multiple rewritings and merges retrievals. Better recall, more cost.
Query decomposition splits multi-hop questions into sub-questions. Essential for multi-hop QA.
Conversational rewriting turns context-dependent queries into standalone ones. Essential for multi-turn chat.
Step-back questions ask the abstract version first. Useful for analytical queries.
For most production RAG, conversational rewriting + HyDE is a strong default.

In Chapter 64 we look at the question of how to know if your RAG actually works: RAG evaluation.

Read it yourself

Gao et al., Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE, 2022).
Wang et al., Query2doc: Query Expansion with Large Language Models (2023).
Zheng et al., Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models (2023).
The LangChain documentation on multi-query retrievers and HyDE.
The LlamaIndex documentation on sub-question query engines.

Practice

Implement HyDE in 30 lines of Python with the OpenAI API and a vector store of your choice.
Why does HyDE work even though the hypothetical answer is potentially hallucinated? Argue at the level of “what we’re using the answer for.”
For a multi-turn chat about a specific document, design a conversational rewriting prompt.
Decompose this multi-hop question into sub-questions: “What’s the second-largest country in the largest continent by population?”
Why is query expansion more useful for lexical retrieval (BM25) than for dense retrieval?
For a customer support chatbot, which rewriting techniques would you use? Argue.
Stretch: Build a chat-RAG pipeline with conversational rewriting and HyDE. Test it on a multi-turn conversation and verify both techniques activate when they should.