Part III · Inference Internals & Production Serving
Chapter 23 Core ~17 min read

Batching: static, dynamic, continuous (Orca)

"In a world where decode is memory-bandwidth-bound, the only way to make a GPU earn its keep is to make every weight read serve as many users as possible"

In Chapter 21 we showed that decode is memory-bandwidth-bound — the GPU spends most of its time reading weights from HBM, and the actual arithmetic is small. In Chapter 22 we showed how the KV cache eliminates redundant computation within a single request. This chapter is about the third leg of the stool: how to share a single forward pass across many concurrent requests so the cost of reading the weights is amortized.

This is batching, and the modern variant — continuous batching — is one of the most important inventions in LLM serving. It’s the difference between a GPU running at 5% utilization and one running at 90%, and it’s the technique that made vLLM (and Orca, the paper that introduced it) the dominant serving stacks.

Outline:

  1. The amortization argument.
  2. Static batching — the naive baseline.
  3. Dynamic batching — the first improvement.
  4. The padding problem and why naive batching wastes throughput.
  5. Continuous (token-level, iteration-level) batching.
  6. The Orca paper.
  7. The scheduling decisions.
  8. The latency variance trade.
  9. Continuous batching with prefill.
  10. The implementation in vLLM.

23.1 The amortization argument

Recall from Chapter 21 the arithmetic intensity of a decode-step matmul:

AI(N) ≈ N FLOPs/byte

where N is the batch size — the number of concurrent users in the same forward pass. Single-user decode has AI=1, which is hundreds of times below the H100’s compute-vs-memory crossover (~330 FLOPs/byte). Batched decode with N=64 has AI=64, still below the crossover but 64× better.

The amortization argument: the cost of reading the weights from HBM is paid once per forward pass, regardless of how many users are in the batch. If you can fit 64 users into the same forward pass, you pay one weight-read cost and serve 64 users. If you can fit 256 users into the same forward pass, you serve 256 users with the same one weight-read cost.

The throughput gain is nearly linear in batch size, because the marginal cost per user is just the small per-position arithmetic (which is dwarfed by the fixed weight-read cost in the memory-bound regime). A 70B model that can serve 20 tokens/sec/user at batch size 1 can serve roughly the same per-user rate at batch size 64 — but the total throughput is now 64 × 20 = 1280 tokens/sec instead of 20.

This is the entire economic case for batching in LLM serving. Without batching, GPU economics for LLM serving don’t work. With batching, they’re tractable. The question is just how aggressively you can batch and how to schedule the batches.

23.2 Static batching — the naive baseline

The simplest batching strategy: collect a batch of N requests, run them all together, return the results.

def static_batch_serve(model, requests):
    batch = pad_to_max_length(requests)         # right-pad to longest
    outputs = model.generate(batch, max_tokens=256)
    return outputs

This is static batching. It’s the obvious approach if you’ve used a deep learning framework and you want to apply it to LLMs. It also has two killer flaws.

Flaw 1: padding waste. Different users have prompts of different lengths and will generate completions of different lengths. To put them in one tensor, you have to pad the shorter ones to the length of the longest. Every padded position is GPU work that produces no useful output.

Static batching pads all sequences to the longest, wasting most GPU cycles on padding tokens. Static Batch — 4 requests padded to max length 400 Req A 400 real tokens Req B 50 350 padding (wasted) Req C 200 200 padding Req D 30 370 padding useful: 680 / 1600 slots = 43% utilization — 57% waste
Static batching wastes the majority of GPU cycles on padding tokens — the GPU must process every slot in the padded tensor, even if it holds no real content.

For a batch with prompts of length 50, 200, 1000, and 4000 tokens, you have to allocate space for 4 × 4000 = 16,000 token positions, even though the actual work is only 50 + 200 + 1000 + 4000 = 5,250 token positions. The batch is 31% useful work and 69% padding.

Flaw 2: head-of-line blocking. A static batch of N requests can’t return any result until all N requests are done. If 99 requests finish in 10 steps and one runs for 1000 steps, the 99 fast requests have to wait the full 1000 steps before their results are returned. The slowest request determines the latency for everyone in the batch.

These flaws make static batching nearly useless for LLM serving. It works for image classification (where every input has the same shape and produces one output) but not for variable-length, variable-cost generation tasks. Modern LLM serving stacks abandoned it years ago.

23.3 Dynamic batching — the first improvement

Dynamic batching is a small step up: instead of waiting for a fixed batch of N requests, the server collects whatever requests have arrived in a small time window (say 50ms), batches them together, and runs the forward pass.

def dynamic_batch_serve(server):
    while True:
        requests = server.get_requests(max_wait_ms=50, max_batch=64)
        if requests:
            batch = pad_to_max_length(requests)
            outputs = model.generate(batch, max_tokens=256)
            for req, out in zip(requests, outputs):
                req.return(out)

Dynamic batching helps when the request rate is high enough that you usually have multiple requests in the time window. It still suffers from both flaws of static batching — padding waste within the batch, and head-of-line blocking.

Dynamic batching is the standard for non-LLM model serving (BERT classification, image embedding, etc.), where each request has a fixed compute cost. For LLMs with variable-length generation, it’s not enough.

23.4 The padding problem in detail

Let’s quantify how bad the padding is. For a batch of LLM requests, you have to pad along the sequence dimension, which means every batch entry is allocated space for the longest sequence’s full prefill + decode length.

In a real serving workload:

  • Prompt lengths range from ~10 tokens (short questions) to ~10000 tokens (RAG-heavy queries).
  • Completion lengths range from ~5 tokens (yes/no answers) to ~2000+ tokens (detailed explanations, reasoning, code).

A batch of 32 requests might have:

  • Min total length: 15 tokens
  • Max total length: 8000 tokens
  • Mean total length: 800 tokens

With static batching, every entry is allocated 8000 token slots. Total slots: 32 × 8000 = 256,000. Useful slots: 32 × 800 = 25,600. Padding fraction: 90%.

You’re paying for 256,000 token forward passes and getting 25,600 tokens of useful work. The GPU is spending 90% of its time on padding tokens that get masked out at the end. This is a disaster.

The fix is to never materialize padded batches. Instead, batch at the token level, not the request level. This is the essence of continuous batching.

23.5 Continuous (iteration-level) batching

Continuous batching (also called iteration-level batching, or token-level batching) restructures the serving loop so that:

  1. The batch of tokens being processed in a single forward pass is dynamic — composed of whatever tokens from whatever requests are currently in flight.
  2. A request can enter the batch as soon as it arrives, and exit the batch as soon as it emits EOS, without waiting for other requests in the batch.
  3. Different requests in the same batch can be at different positions in their generation. Request A might be decoding token 50; request B might be decoding token 200; request C might be doing prefill of its prompt. They all share the same forward pass.

This is a fundamentally different scheduling model. The batch is no longer a fixed group of requests; it’s a fluid set of in-progress sequences that change every step.

Continuous batching: requests enter and exit the running batch independently each iteration, with no padding and no head-of-line blocking. time (forward-pass iterations) → iter 1 iter 2 iter 3 iter 4 iter 5 Seq A generating… Seq B generating → EOS at iter 3, returned Seq C admitted at iter 2 → Seq D admitted iter 4 batch composition changes every iteration — no idle slots, no padding each forward pass is exactly as big as the live sequence count demands
Continuous batching keeps GPU utilization near 100% by letting sequences enter and exit each iteration independently — finished requests return results immediately rather than waiting for the slowest sequence.

The basic structure:

def continuous_batch_serve(server, model, kv_cache_pool):
    in_flight = []   # list of in-progress sequences
    
    while True:
        # 1. Admit new requests if the batch has room
        while server.has_pending() and len(in_flight) < max_concurrent:
            req = server.next_request()
            in_flight.append(create_sequence(req))
        
        # 2. Run one forward pass for the batch
        # The batch is variable: each sequence contributes its current token
        batch_input = make_batch_input(in_flight)   # shape (N_active, 1) for decode-only
        logits = model(batch_input, kv_cache_for(in_flight))
        
        # 3. Sample next token for each sequence
        for seq, logit in zip(in_flight, logits):
            next_token = sample(logit)
            seq.append(next_token)
            
            # 4. Check for completion
            if next_token == EOS or seq.length == max_tokens:
                seq.complete()
                in_flight.remove(seq)
                seq.return_to_user()

The key shifts from static batching:

  • No padding. Each step, the batch contains exactly the tokens that are actively being generated. No padded slots, no wasted forward.
  • No head-of-line blocking. When a sequence finishes (EOS), it leaves the batch immediately and its result is returned to the user. The other sequences keep going.
  • New requests enter the batch on the next forward pass. No waiting for the current batch to finish.

The result is that the GPU is much more consistently utilized. Throughput goes up dramatically (often 10–20×) compared to static batching, and per-request latency goes down (because head-of-line blocking is gone).

23.6 The Orca paper

The seminal paper on continuous batching is Orca (Yu et al., OSDI 2022). The full title is Orca: A Distributed Serving System for Transformer-Based Generative Models.

The paper’s key contributions:

(1) Iteration-level scheduling. The paper introduced the term and the formalism. Each iteration (one forward pass) processes whatever tokens are currently in flight, with new requests entering and finished requests leaving between iterations.

(2) Selective batching. A subtle but important point. Some operations in a transformer (the linear layers, the FFN) can naturally batch across requests — the matmul just gets a bigger batch dimension. But attention can’t naturally batch across requests because each request has a different KV cache. The Orca paper showed how to handle this: batch the linear ops together, but compute attention per-request (or per-group-of-requests-with-the-same-length).

(3) The scheduling problem. The paper formalized the scheduling challenge: given a queue of pending requests and a fixed memory budget, which requests should be in the current batch? Orca proposed a simple FCFS (first-come-first-serve) scheduler with admission control.

(4) Empirical results. Orca showed throughput improvements of 5–37× over static batching on real LLM workloads. This was the moment that “continuous batching” went from “interesting research idea” to “thing every production stack must implement.”

After Orca, vLLM (Kwon et al., 2023) extended the idea with PagedAttention (Chapter 24) for efficient KV cache management, and continuous batching became the de facto standard.

23.7 The scheduling decisions

Continuous batching is a scheduler. Like any scheduler, it has to make decisions:

stateDiagram-v2
  [*] --> Waiting : request arrives
  Waiting --> Running : admitted (KV budget available)
  Running --> Running : decode step (token generated)
  Running --> Waiting : preempted (memory pressure)
  Running --> Finished : EOS or max_tokens
  Finished --> [*] : result returned to user
  Waiting --> Finished : rejected (timeout or 503)
  style Running fill:var(--fig-accent-soft),stroke:var(--fig-accent)

A sequence cycles between Waiting and Running until it finishes; under memory pressure the scheduler preempts Running sequences back to Waiting to free KV cache blocks.

(1) Admission. Should we admit a new request to the batch right now, or wait? If the batch is at maximum capacity (KV cache memory limit, or compute saturated), we have to wait. If admitting a long-prompt request would force eviction of an existing request, we have to decide whether the trade is worth it.

(2) Prioritization. Which pending request goes next? FCFS is the simplest. Priority-based scheduling (premium tier > free tier) is common in production. Shortest-job-first minimizes average latency but is hard to predict.

(3) Eviction. When the batch is full and a high-priority request arrives, do we evict an existing request to make room? Evicting means dropping the in-progress KV cache (and either re-running the request later from scratch, or returning an error to the user).

(4) Preemption. Can a request be paused mid-generation to free its KV cache temporarily? vLLM supports drop-and-recompute: free the KV cache for a low-priority request, but remember its token sequence; when it gets re-scheduled, re-prefill the cache from scratch and continue.

(5) Prefill scheduling. When a new request arrives, you have to do its prefill before it can join the decode batch. Prefill is compute-bound and large; if you do it inline with the decode batch, you’ll add latency to the existing decode requests. The trade-off:

  • Inline prefill (vLLM v0.x default): prefill in the same forward pass as decode. Simple but adds latency to decoding requests.
  • Chunked prefill: split the prefill into smaller chunks and interleave with decode steps. Smoother latency profile but more complex.
  • Disaggregated prefill (Chapter 36): run prefill on separate GPUs entirely. Best latency, most operational complexity.

Most production stacks today use chunked prefill (vLLM v0.5+, SGLang) as the default. It’s a good compromise between simplicity and latency smoothness.

23.8 The latency-variance trade

Continuous batching massively improves throughput, but it has a subtle cost: latency variance.

In a static batch, every request takes the same wall-clock time (the slowest one’s). In a continuous batch, fast requests finish quickly and slow requests can take much longer. The median latency is much better than static batching, but the tail latency can be worse for a request that gets stuck in a batch with many other long-running requests.

The reason: each step’s wall-clock time depends on the maximum sequence length in the current batch, because attention is O(S) per step in the cached case. If your request is in a batch with a sequence that’s already at 8000 tokens, every one of your decode steps takes longer than it would in an unloaded batch.

This is the latency-variance trade. You get more throughput, but individual requests can experience unpredictable per-token latency depending on what else is in the batch.

Mitigations:

  • Limit the maximum batch sequence length in any single forward pass. Long sequences get their own (smaller) batch.
  • Limit batch size. Don’t pack the GPU completely full; leave headroom for shorter requests.
  • Tier requests by SLO. Premium-tier requests run in batches with smaller max sequences; free-tier requests run in the maximally-utilized batches.

In production, the tail latency under continuous batching is about 1.5–2× the median. This is the cost you pay for the throughput improvement.

23.9 Continuous batching with prefill

A subtlety that took the field a while to figure out: how do you handle prefill in a continuous batch?

Prefill processes many tokens at once (the whole prompt). Decode processes one token per step. If you put both in the same forward pass, you have a tensor with mixed shapes — some sequences contribute many tokens (prefill), some contribute one token (decode).

The naive solution: prefill blocks the batch. When a new request arrives, run its prefill alone (or in a separate pass), then add it to the decode batch. The cost: the entire batch has to wait for the prefill to finish, adding latency to all the decoding requests.

The chunked solution (modern default): split the prefill into chunks and process one chunk per forward pass, alongside the decode tokens. The chunk size is chosen so that the total work in each forward pass is roughly constant. A 4000-token prefill is split into 8 chunks of 500 tokens each, and the model runs 8 forward passes — each one processing 500 prefill tokens of the new request plus the decode tokens of all the existing in-flight requests.

This gives a smoother latency profile because no single forward pass is dominated by a huge prefill, and the decode tokens never have to wait for a multi-second prefill to complete.

The best of all worlds is disaggregation (Chapter 36): run prefill on a separate pool of GPUs entirely, ship the resulting KV cache to the decode pool. This eliminates the trade-off completely at the cost of two GPU pools and an inter-pool transfer.

23.10 The implementation in vLLM

vLLM is the most influential production implementation of continuous batching. Its loop, simplified:

while True:
    # 1. Schedule: pick which sequences to run this step
    scheduled = scheduler.schedule(
        running=in_flight_sequences,
        waiting=pending_queue,
        kv_cache_budget=available_kv_blocks,
    )
    
    # 2. Forward pass: run the model on the scheduled batch
    # The batch is heterogeneous: some sequences are doing prefill chunks,
    # some are doing decode steps.
    outputs = model.execute_model(scheduled)
    
    # 3. Sample: produce next token for each sequence
    for seq_data, output in zip(scheduled, outputs):
        token = sampler.sample(output, seq_data.sampling_params)
        seq_data.append_token(token)
        
        if token == EOS or seq_data.is_finished:
            scheduler.finish(seq_data)
            seq_data.return_to_user()
    
    # 4. Free KV cache for finished sequences, admit new ones
    scheduler.update_after_step()

The interesting bits are in the scheduler. vLLM’s scheduler implements:

  • PagedAttention KV cache (Chapter 24) for efficient memory management.
  • Chunked prefill by default in v0.5+.
  • Priority scheduling with optional preemption.
  • Drop-and-recompute when memory pressure forces eviction.
  • Speculative decoding integration (Chapter 27) where draft tokens fold into the same step.

Reading vLLM’s scheduler.py is one of the best ways to internalize how production LLM serving actually works. It’s a few thousand lines of dense Python with detailed comments. If you want to be an expert at LLM serving, this is the file to read.

23.11 The mental model

Eight points to take into Chapter 24:

  1. Decode is memory-bandwidth-bound. Batching amortizes the weight read over many users.
  2. Static batching wastes 70–90% of GPU time on padding. Don’t use it for LLMs.
  3. Dynamic batching helps but doesn’t fix padding or head-of-line blocking.
  4. Continuous batching (Orca) batches at the token level, not the request level. Each forward pass contains whatever tokens are in flight, regardless of request boundaries.
  5. Throughput improves 5–37× over static batching. This is one of the largest wins in LLM serving.
  6. The cost is latency variance. Tail latency can be 1.5–2× the median.
  7. Chunked prefill is the modern default for handling new requests in a continuous batch.
  8. vLLM’s scheduler is the canonical reference implementation. Read it.

In Chapter 24 we look at the data structure that makes the KV cache memory-efficient enough for continuous batching to work: PagedAttention.


Read it yourself

  • Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022). The continuous batching paper. Read sections 3 and 4.
  • The vLLM blog post on PagedAttention and continuous batching.
  • The vLLM source code, especially vllm/core/scheduler.py and vllm/engine/llm_engine.py.
  • Agrawal et al., SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (2024). The chunked prefill paper.
  • Patel et al., Splitwise: Efficient Generative LLM Inference Using Phase Splitting (2024). The disaggregation paper, which builds on continuous batching.

Practice

  1. Compute the padding overhead for a static batch of 32 requests with prompt+completion lengths drawn uniformly from [100, 4000]. What fraction is wasted?
  2. Explain why continuous batching has zero padding overhead but can have head-of-line latency for long sequences. Where does the variance come from?
  3. Why does continuous batching help decode but not prefill? Trace the arithmetic intensity in both cases.
  4. Read the Orca paper’s Figure 4 (the throughput comparison). Why is the speedup over static batching 5× for some workloads and 37× for others? What property of the workload predicts the speedup?
  5. Design a scheduling policy that prioritizes premium-tier requests over free-tier requests in vLLM. What are the levers? What are the failure modes?
  6. Why does chunked prefill smooth out latency compared to inline prefill? Plot the per-step compute time for both scheduling strategies.
  7. Stretch: Read vllm/core/scheduler.py end to end. Identify the scheduling policy and write a short summary of how it makes admission, prioritization, and eviction decisions.

Concept check

4 questions. Click a choice to check. Your score is saved locally.

Score
0 / 4
  1. 1. Why does batching improve LLM serving throughput in a memory-bandwidth-bound decode regime?
  2. 2. What is the padding waste problem in static batching and why does it reduce GPU utilization?
  3. 3. Continuous (iteration-level) batching allows new requests to join an in-flight batch between decode steps. What problem does this solve that dynamic batching cannot?
  4. 4. Mixing a new prefill request into an ongoing continuous decode batch causes a 'prefill bubble.' Why, and how do modern schedulers handle it?
Related chapters