Chapter 55: Benchmarking inference: methodology, tools, gotchas

By Chapter 30 you understood the cost model for inference. By Chapter 31 you understood the latency model. This chapter is about how to actually measure inference performance correctly. Production decisions about hardware, frameworks, and configurations are based on benchmark numbers; if your benchmarks are wrong, your decisions are wrong.

LLM inference benchmarking is harder than you’d expect because the workload has many dimensions (input length, output length, concurrency, sampling, batch composition) and small changes can move the numbers significantly. This chapter is about the methodology that produces actually-useful benchmark numbers.

Outline:

What to measure.
Open-loop vs closed-loop benchmarking.
Warmup and burn-in.
The percentile traps.
The throughput-vs-latency Pareto frontier.
The multi-axis trap (revisited from Chapter 36).
The benchmark tools: vllm bench, genai-perf, llmperf, custom harnesses.
The mock vLLM trick.
Common benchmarking mistakes.
Reporting results.

55.1 What to measure

The first question of any benchmark: what are you measuring? For LLM inference, the standard metrics are:

Throughput metrics:

Total tokens per second (across all requests).
Per-request tokens per second (the average per-user generation speed).
Requests per second (RPS) at a given concurrency.

Latency metrics:

Time-to-first-token (TTFT) — wall clock from request submission to first token streamed back.
Time-per-output-token (TPOT) or inter-token latency — average time between successive output tokens.
End-to-end latency (e2e) — total wall clock from request submission to last token.

Distribution metrics:

p50, p90, p99, p999 for each latency above.
Mean (less useful but commonly reported).

Resource metrics:

GPU utilization — % of GPU compute being used.
HBM bandwidth utilization — % of HBM bandwidth being used.
KV cache utilization — % of the cache pool occupied.
Power draw — for cost calculations.

You don’t need all of these for every benchmark. The relevant set depends on what you’re trying to learn:

Comparing two configurations: throughput, p50, p99 for both, on the same workload.
Capacity planning: throughput at target latency (e.g., “tokens/sec at p99 < 500ms”).
Cost optimization: $-per-million-tokens at target latency.
Latency tuning: detailed percentile breakdown.

The mistake is to report only one number (“vLLM does 1000 tokens/sec”) without specifying the workload, the concurrency, the latency tolerated, or the percentile.

55.2 Open-loop vs closed-loop benchmarking

The most important methodological distinction in LLM benchmarking: open-loop vs closed-loop.

Closed-loop benchmark. The benchmark client maintains a fixed number of in-flight requests. When one finishes, it immediately submits a new one. The system processes requests as fast as it can, and the throughput is the rate at which requests complete.

# Closed-loop pseudocode
in_flight = 100
while True:
    if in_flight < 100:
        submit_new_request()
        in_flight += 1
    if request_finished:
        in_flight -= 1

This measures the maximum sustainable throughput for the given concurrency. It does not measure realistic latency because every new request starts immediately when capacity is available — there’s no queue, no waiting, no real-world traffic pattern.

Open-loop benchmark. The benchmark client submits requests at a fixed rate (e.g., 100 requests per second), regardless of how fast they’re being processed. If the system can’t keep up, requests queue and latency grows.

# Open-loop pseudocode
target_rps = 100
while True:
    submit_request_now()
    sleep(1 / target_rps)

This measures realistic latency under load. If you submit at 100 RPS and the system can only handle 80 RPS, the queue grows and latency goes to infinity. If the system can handle 200 RPS, the queue stays empty and latency reflects only the work itself.

The key insight: open-loop reveals the saturation point. You sweep the request rate and find the rate at which latency starts spiking — that’s the maximum sustainable throughput at acceptable latency. Closed-loop just tells you “this many in-flight requests gives this throughput,” which is useful for capacity planning but doesn’t reveal the latency picture.

For production benchmarking, always use open-loop. The closed-loop number is misleading because production traffic is open-loop (users submit when they want, not when the system is ready).

Closed-loop latency rises smoothly with concurrency and never reveals saturation; open-loop latency curves show the knee where the system becomes overloaded — which is the operationally important point.

55.3 Warmup and burn-in

Every benchmark needs a warmup period before measuring. The reasons:

Cold caches. The KV cache is empty at the start; the prefix cache has no hits; the CUDA kernels are JIT-compiling. The first few requests are slow.
JIT compilation. Some kernels compile on first use (Chapter 54). The first request that hits a particular shape is much slower.
Warm-up of dynamic structures. Memory allocators, schedulers, and other internal state need to reach steady state.

The fix: run a warmup period before the measurement starts. Submit requests for 30-60 seconds, discard the results, then start measuring.

# Warmup
for _ in range(warmup_seconds * target_rps):
    submit_request()
discard_results()

# Real measurement
start = time.time()
while time.time() - start < measurement_duration:
    submit_request()
record_results()

Without warmup, your TTFT p99 will be wildly inflated by the first few cold requests.

For very long benchmarks, also burn in the request stream. Some metrics (cache hit rates, autoscaler state) take minutes to reach steady state. A 5-minute warmup followed by a 30-minute measurement is reasonable for production-style benchmarks.

55.4 The percentile traps

Percentile metrics are the most useful but the most error-prone. Common mistakes:

Trap 1: Reporting averages. “Average TPOT was 80ms” is much less useful than “p50 was 70ms, p99 was 250ms.” Averages hide the tail. Always report percentiles.

Trap 2: Computing percentiles wrong. Some tools compute percentiles per-bucket (with histogram quantization) and others compute exact percentiles. The two can disagree by 10-20%. Use exact computation if you can; if you use histograms, make sure the bucket boundaries are dense enough.

Trap 3: Not enough samples. Computing p99 from 50 samples is meaningless — the p99 is essentially “the worst sample” and is dominated by noise. You need at least 1000 samples for p99 to be stable, and 100,000 for p999. Run benchmarks long enough to collect them.

Trap 4: Aggregating across heterogeneous workloads. If half your requests have 100-token outputs and half have 5000-token outputs, the e2e p99 distribution is bimodal and the single p99 number is misleading. Break down by workload type when reporting.

Trap 5: Streaming vs non-streaming. TTFT is meaningful for streaming responses but doesn’t exist for non-streaming. e2e is meaningful for both but is dominated by output length for non-streaming. Be clear about which mode you’re benchmarking.

The right way to report percentiles:

Workload: Llama 3 70B, 1000-token prompts, 500-token outputs
Concurrency: open-loop, 50 RPS
Duration: 10 minutes after 1-minute warmup
Samples: 30000

TTFT:
  p50:  300 ms
  p90:  450 ms
  p99:  900 ms
  p999: 1500 ms

TPOT:
  p50:   60 ms
  p90:   90 ms
  p99:  180 ms
  p999: 350 ms

e2e:
  p50:   30.3 s
  p90:   45.5 s
  p99:   90.5 s

This is the format that lets a reader actually understand the system’s behavior.

55.5 The throughput-vs-latency Pareto frontier

A single benchmark number isn’t enough. The relationship between throughput and latency is a curve, not a point. As you increase concurrency, throughput goes up but latency also goes up. The interesting question is “what’s the maximum throughput at acceptable latency?”

The right way to measure: sweep the request rate and plot throughput vs latency p99. Every point on the curve is a (throughput, p99 latency) pair. The Pareto frontier is the upper-left edge of this curve — the points where you can’t get more throughput without paying more latency, or vice versa.

p99 latency
   ^
   |        x  high concurrency, high throughput, very high latency
   |       x
   |     x     "saturation knee"
   |   xx
   | xx
   |x   low concurrency, low throughput, low latency
   +---------------> throughput

The “saturation knee” is the most useful operating point: high throughput with bounded latency. Below the knee, you have spare capacity but you’re not using it. Above the knee, latency spikes and you’ve overcommitted.

For each system you benchmark, find the knee. The number you report is “X tokens/sec at p99 < Y ms” — the saturation point at the latency budget you care about.

The throughput-latency Pareto frontier has a knee where incremental throughput stops being worth the latency cost — the operating point to report is always "X tokens/sec at p99 < Y ms".

55.6 The multi-axis trap

We saw this in Chapter 36 (disaggregated PD): when comparing two configurations, you have to measure on multiple axes, not just one. The relevant axes for LLM serving:

Per-GPU throughput — tokens/sec/GPU.
Total throughput — tokens/sec across the whole fleet.
p50 latency — typical user experience.
p99 latency — worst-case user experience.
At various concurrency levels — the answer changes.

A configuration that “doubles total throughput” might use 2× the GPUs and have unchanged per-GPU throughput. A configuration that “halves p99 latency” might also halve per-GPU throughput. These are different stories.

Always report all the relevant axes for any comparison. The reader should be able to see the trade-off, not just the headline.

55.7 The benchmark tools

The major benchmarking tools for LLM serving:

`vllm bench`

vLLM’s built-in benchmarking utility. Runs against a live vLLM server with configurable workload.

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct &
python -m vllm.benchmark --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 1000 \
  --request-rate 50 \
  --dataset-name sharegpt \
  --dataset-path sharegpt_v3_unfiltered.json

The tool generates a workload from a dataset (ShareGPT is the standard), submits requests at a target rate, and measures latency. Reports throughput and latency percentiles.

vLLM’s benchmark is the easiest to start with. It uses open-loop submission by default. Output format is reasonably standardized.

`genai-perf` (NVIDIA)

NVIDIA’s official LLM benchmarking tool. Part of the Triton Inference Server toolkit.

genai-perf -m llama-3-8b -u localhost:8000 \
  --service-kind openai --endpoint-type chat \
  --concurrency 1 16 32 64 \
  --measurement-interval 10000

Sweeps concurrency levels automatically and reports the throughput-latency curve. Good for hardware vendors and for sweeping configurations.

`llmperf` (Anyscale)

Open-source LLM load testing tool from Anyscale.

llmperf load-test --model llama-3-8b \
  --num-concurrent-workers 32 \
  --max-num-completed-requests 1000

Similar in scope to vllm bench. Used for comparing serving providers (Together AI, Fireworks AI, etc.).

`vllm-benchmark`

A separate community tool that wraps vLLM’s benchmark with more workload types.

Custom harnesses

Most production teams build their own benchmark harness. The reasons:

Match the specific workload of their application.
Test specific scenarios (long prompts, multi-turn, function calling).
Integrate with their own monitoring stack.

A custom harness is typically 200-500 lines of Python that sends requests via the OpenAI client library and computes percentiles. Not glamorous but effective.

For most production teams, start with vllm bench for general benchmarking, build a custom harness for your specific workload.

55.8 The mock vLLM trick

A useful technique for benchmarking the rest of the stack (not the model itself): use a mock vLLM that returns fake responses immediately. This lets you measure the gateway, the network, the scheduler, the autoscaler — everything except the actual model compute — in isolation.

A mock vLLM is just a small Python server that exposes the OpenAI API and returns canned responses:

from fastapi import FastAPI
from sse_starlette.sse import EventSourceResponse
import asyncio

app = FastAPI()

@app.post("/v1/chat/completions")
async def chat_completions(request):
    # Return a fake streaming response with controllable latency
    async def stream():
        await asyncio.sleep(0.1)  # Fake TTFT
        for i in range(50):
            yield {"data": f'{{"choices": [{{"delta": {{"content": "token{i}"}}}}]}}'}
            await asyncio.sleep(0.05)  # Fake TPOT
        yield {"data": "[DONE]"}
    return EventSourceResponse(stream())

This serves an OpenAI-compatible streaming response with controllable latency, no GPU required. You can run it as a stand-in for vLLM in your test environment.

Mock vLLM is useful for:

Load testing the gateway. Verify it can handle the throughput you expect.
Testing autoscaling logic. Trigger scale-up by sending lots of requests, verify the autoscaler responds.
CI testing of the integration. Verify the AI gateway, the auth layer, the metering all work together without needing a real GPU.
Debugging latency issues. If the mock is fast and the real serving is slow, you’ve narrowed the problem to the model itself.

graph LR
  B[Benchmark client] -->|OpenAI API| G[AI Gateway]
  G -->|route| M[Mock vLLM]
  M -->|canned SSE stream| G
  G -->|measure latency| B
  style M fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The mock vLLM replaces the real GPU backend in CI and load tests, letting you validate the gateway, auth, and routing layers without provisioning GPUs.

Most production teams have a mock vLLM somewhere in their test infrastructure. It’s a small but valuable tool.

55.9 Common benchmarking mistakes

A non-exhaustive list of mistakes I’ve seen in real benchmark reports:

(1) No warmup. First request was cold; p99 numbers are bogus.

(2) Closed-loop only. Reports throughput at fixed concurrency, doesn’t reveal saturation behavior.

(3) Not enough samples for p99. Computing p99 from 100 samples gives a noise number.

(4) Only reporting one configuration. “vLLM does 1000 tokens/sec” without specifying the model, prompts, concurrency, hardware.

(5) Aggregating across heterogeneous workloads. Single p99 number for a bimodal distribution.

(6) Not reporting per-GPU throughput. Only reporting total throughput hides whether you’re using the GPUs efficiently.

(7) Comparing apples to oranges. vLLM with default settings vs TensorRT-LLM with optimized settings, both running the same model. Each was tuned differently.

(8) Forgetting to disable extra logging. Verbose logging in vLLM adds overhead and skews timing.

(9) Tokenizer cost included or not. Some benchmarks include the time to tokenize the prompt; others don’t. Make sure both compared systems do the same thing.

(10) Not accounting for variance. Run the benchmark multiple times and report the variance, not just one run.

These are all real mistakes I’ve seen in production benchmark reports. Avoid them.

55.10 Reporting results

A good benchmark report includes:

The hardware (GPU type, count, network, storage).
The model (name, version, quantization).
The workload (prompt distribution, output length distribution, number of requests).
The concurrency / request rate (open or closed loop, level).
Warmup duration.
Measurement duration and number of samples.
All the percentiles (p50, p90, p99, p999) for TTFT, TPOT, and e2e.
Throughput (per-GPU and total).
Resource utilization (GPU, HBM bandwidth, KV cache).
The exact configuration of the system under test (vLLM flags, etc.).
Variance across runs (multiple runs with stddev).

This is the format that lets someone else reproduce your benchmark and verify your numbers. Without this level of detail, your report is just a marketing claim.

55.11 The mental model

Eight points to take into Chapter 56:

What you measure determines what you know. Always report throughput AND latency percentiles, not just one.
Open-loop, not closed-loop. Production traffic is open-loop, so benchmark that way.
Warmup first. Discard the cold-start measurements.
Enough samples for the percentiles. p99 needs 1000+ samples; p999 needs 100k+.
Sweep concurrency. The throughput-latency Pareto frontier is what matters, not a single point.
Multi-axis when comparing. Per-GPU throughput, total throughput, p50, p99 — all of them.
Use vllm bench to start, build custom for your workload.
Mock vLLM is useful for testing the rest of the stack without a GPU.

In Chapter 56 we close out Stage 4 (and Part III) with content safety as inference: guardrails architecture.

Read it yourself

The vLLM benchmarking documentation.
The NVIDIA genai-perf documentation.
The Anyscale llmperf GitHub repository.
The Pope et al. paper Efficiently Scaling Transformer Inference for the methodology framework.
The classic Tail at Scale paper (Dean & Barroso, 2013) for the percentile thinking.
Brendan Gregg’s Systems Performance book on benchmarking methodology.

Practice

Why is open-loop benchmarking better than closed-loop for production-relevant numbers?
Run vllm bench against a small open model. Report TTFT and TPOT percentiles for two concurrency levels.
Why do you need at least 1000 samples for p99 to be stable? Explain with the variance of an order statistic.
For a Llama 3 70B serving deployment, design a benchmark that measures the throughput-latency Pareto frontier. What concurrency levels would you sweep?
Why is “vLLM does 1000 tokens/sec” a meaningless statement without context? Identify the missing pieces.
Build a mock vLLM that returns canned streaming responses. Use it to load-test a gateway.
Stretch: Run an open-loop benchmark of vLLM with different --max-num-batched-tokens values. Plot the throughput-latency curve for each. Identify the optimal value for your hardware.