Chapter 48: vLLM in production: every flag that matters

This is the most concrete chapter in Stage 4. By the end you will know what every important vLLM serving flag does, why it matters, and how to set it for your workload. This is the operational reference for actually running vLLM in production.

I’ll walk through the flags in roughly the order you’d think about them when configuring a deployment: model loading, parallelism, memory, batching, scheduling, sampling, observability, and stability.

Outline:

The model and tokenizer flags.
Parallelism flags.
Memory flags.
Batching and scheduling flags.
Quantization flags.
Prefix caching flags.
Speculative decoding flags.
KV cache and sequence length flags.
API server flags.
Observability flags.
The “what to tune first” workflow.

48.1 The model and tokenizer flags

The most basic configuration: which model to load.

--model (or first positional arg). The model path or HuggingFace ID. Examples:

--model meta-llama/Llama-3.1-70B-Instruct
--model /mnt/models/llama-3-70b
--model TheBloke/Llama-2-70B-Chat-AWQ

--tokenizer (optional). Override the tokenizer. By default vLLM uses the tokenizer that ships with the model. Override only if you have a specific reason (e.g., a custom tokenizer for fine-tuning).

--revision. Pin to a specific git revision of the model on HF Hub. Use this in production to prevent silent model updates.

--trust-remote-code. Allow loading models that have custom Python code in their HF Hub repository. Required for some non-standard architectures (Qwen, Mixtral early versions, etc.). Security note: this executes arbitrary code from the model repo. Only set it for models from trusted sources.

--tokenizer-mode. auto (default) uses the model’s tokenizer; slow forces the slow Python tokenizer (for debugging). Stick with auto.

--dtype. The compute dtype for model inference. Options: auto, half (fp16), bfloat16, float, float16, float32. Default to auto, which picks bf16 on Hopper and Ampere, fp16 on older GPUs. Override only if you have a specific reason.

48.2 Parallelism flags

For multi-GPU serving:

--tensor-parallel-size (or -tp). Number of GPUs for tensor parallelism. The model is split across this many GPUs. Must divide evenly into the model’s attention heads. Defaults to 1.

-tp 1   # single GPU
-tp 2   # 2-way TP
-tp 4   # 4-way TP
-tp 8   # 8-way TP, typical for 70B+ models on H100

The right value depends on:

The model size (does it fit on one GPU?).
The number of GPUs available.
The latency/throughput trade-off (more TP = lower latency, sometimes lower throughput per GPU).

Tensor parallelism splits each layer across GPUs within a node (requiring AllReduce per layer); pipeline parallelism splits model stages across nodes (used only when the model won't fit in one node).

--pipeline-parallel-size (or -pp). Number of pipeline parallelism stages. Each stage holds a chunk of the model’s layers. Defaults to 1.

-pp 1   # no pipeline parallelism
-pp 2   # 2 pipeline stages, used for models that don't fit in one node

For models that fit in a single node (typical), use -tp only. For models that don’t (405B+ in bf16), combine -tp within each node and -pp across nodes.

--worker-use-ray. Use Ray for distributed worker coordination instead of multiprocessing. Required for multi-node deployment.

For a 70B model on a single 8-GPU H100 node: -tp 8. Don’t overthink it.

48.3 Memory flags

Memory tuning is the most fragile part of vLLM configuration. Get it wrong and you OOM or waste capacity.

--gpu-memory-utilization. Fraction of GPU memory to use for vLLM (model + KV cache + activations). Default 0.9 (90%). The remaining 10% is reserved for everything else (CUDA driver, other processes, headroom).

--gpu-memory-utilization 0.95   # aggressive, more KV cache budget
--gpu-memory-utilization 0.8    # conservative, more headroom

Set this as high as you can without OOMing. The KV cache size scales with this — more memory utilization means more concurrent requests.

Setting gpu-memory-utilization to 0.9 reserves 90% of HBM for weights and KV cache combined, leaving 10% headroom; a larger KV cache pool directly increases the maximum batch size.

--swap-space. Amount of CPU memory (in GiB) to use for KV cache swap when GPU memory is exhausted. Default 4 GiB. Increase if you have CPU memory and want a larger safety net for evictions.

--max-num-batched-tokens. The maximum number of tokens that can be in a single forward pass (sum of prefill tokens + decode tokens across the batch). Default depends on the model, often 8192 or 16384. Increase for higher throughput on workloads with long prompts; decrease for tighter latency.

--max-num-batched-tokens 8192   # tight latency
--max-num-batched-tokens 32768  # high throughput

This is one of the most important throughput knobs. Tune it for your workload.

--max-num-seqs. The maximum number of sequences in a batch. Default 256. Caps the concurrency. Increase for higher throughput; decrease for tighter latency.

48.4 Batching and scheduling flags

--max-model-len. The maximum sequence length the model supports. Defaults to whatever the model config says (e.g., 8192 for Llama 3 base, 131072 for Llama 3 with extended context). Set this lower than the model’s max if you don’t need the full context — it directly determines the KV cache pool size.

--max-model-len 8192    # for chat workloads with shorter prompts
--max-model-len 32768   # standard
--max-model-len 131072  # full Llama 3 context

--enable-chunked-prefill. Enable chunked prefill (Chapter 23). Defaults to enabled in vLLM v0.5+ for most models. Splits long prefills into chunks that interleave with decode steps. Almost always you want this on.

--max-num-batched-tokens (already covered) interacts with chunked prefill: it sets the chunk size implicitly. With chunked prefill enabled, set --max-num-batched-tokens to a value that gives smooth prefill (e.g., 2048-4096).

--disable-log-stats. Disable periodic logging of stats. Defaults to enabled. Keep enabled in production for observability.

--disable-log-requests. Disable per-request logging. Defaults to enabled. Disable in production at scale to avoid log spam.

48.5 Quantization flags

For quantized model serving:

--quantization. The quantization scheme. Options: awq, gptq, fp8, bitsandbytes, squeezellm, marlin, aqlm, deepspeedfp, experts_int8, compressed-tensors, etc. Defaults to None (bf16/fp16 native).

--quantization awq          # AWQ INT4
--quantization gptq         # GPTQ INT4
--quantization fp8          # FP8 (Hopper)
--quantization marlin       # Marlin INT4 (faster than awq/gptq on supported configs)

The model has to be pre-quantized in the right format. You can’t pass --quantization fp8 to an unquantized model and expect it to work; you need a model that was quantized to fp8 first.

For Hopper deployments of large models, --quantization fp8 is increasingly the right default. For consumer/budget deployments, --quantization awq is the standard.

--kv-cache-dtype. The dtype for the KV cache, independent of the weight dtype. Options: auto, fp8, fp8_e5m2, int8. Default auto (matches the model dtype).

--kv-cache-dtype fp8        # 2x smaller KV cache, very small quality loss
--kv-cache-dtype fp8_e5m2   # alternative FP8 variant

For long-context workloads, KV cache quantization is a big win. Try fp8 first.

48.6 Prefix caching flags

--enable-prefix-caching. Enable prefix caching (Chapter 29). Defaults to enabled in modern vLLM versions for most models. Reuses KV cache across requests with shared prefixes. Almost always a win.

--prefix-caching-hash-algo. The hashing algorithm for prefix cache lookup. Options: builtin, sha256. Default builtin. The choice matters very little; stick with the default.

For most workloads, prefix caching adds 50-95% reduction in TTFT for cache hits. Always enable it.

48.7 Speculative decoding flags

For speculative decoding (Chapter 27):

--speculative-model. The draft model to use. None by default.

--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--speculative-draft-tensor-parallel-size 1 \
--num-speculative-tokens 5

--num-speculative-tokens. How many draft tokens to propose per step. Higher = more aggressive, lower = more conservative. Typical: 4-7.

--speculative-draft-tensor-parallel-size. TP for the draft model. Usually 1.

--use-v2-block-manager. Required for some speculative decoding configurations.

Speculative decoding is most useful at low concurrency. Enable it if your workload has spare GPU compute and you want lower latency.

48.8 KV cache and sequence length flags

A few additional KV-related flags:

--block-size. PagedAttention block size in tokens. Options: 8, 16, 32. Default 16. Smaller blocks reduce internal fragmentation; larger blocks reduce metadata overhead. The default is usually right.

--num-gpu-blocks-override. Override the auto-computed number of KV cache blocks. Almost never needed; use only if vLLM’s auto-detection picks a wrong value.

--enable-lora. Enable multi-LoRA serving. Required if you want to serve multiple LoRA adapters on the same base.

--max-lora-rank and --max-cpu-loras. LoRA-specific limits.

48.9 API server flags

--host. Bind address. Default 0.0.0.0. Standard.

--port. API port. Default 8000. Set to whatever your service expects.

--api-key. Optional API key for authentication. Don’t rely on this for production security; put a real auth gateway in front.

--served-model-name. The name vLLM reports in API responses. Default is the value of --model. Override to expose a friendlier name.

--chat-template. Override the chat template. Use only if the tokenizer’s built-in template is wrong.

--max-log-len. Maximum number of tokens to include in logged request previews. Default 0 (no logging). Set to a small number for debugging.

--response-role. The role name for assistant responses. Default assistant.

48.10 Observability flags

--otlp-traces-endpoint. Endpoint for OpenTelemetry traces. Set this to your tracing collector (Tempo, Jaeger).

--collect-detailed-traces. Collect more granular traces. Has some overhead; enable for debugging only.

Prometheus metrics are exposed automatically on the same port as the API server, at /metrics. The most important metrics:

vllm:num_requests_running — number of requests currently being processed (this is the autoscaling signal).
vllm:num_requests_waiting — number of requests in the queue.
vllm:gpu_cache_usage_perc — KV cache utilization.
vllm:cpu_cache_usage_perc — CPU cache (swap) utilization.
vllm:time_to_first_token_seconds — TTFT histogram.
vllm:time_per_output_token_seconds — TPOT histogram.
vllm:e2e_request_latency_seconds — end-to-end latency histogram.
vllm:prompt_tokens_total — total prompt tokens served.
vllm:generation_tokens_total — total generated tokens.

These metrics are the foundation of your monitoring. Scrape them with Prometheus and graph in Grafana.

48.11 The “what to tune first” workflow

A practical workflow for setting vLLM up for a new deployment:

Step 1: Get it running with defaults.

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 8192

If this doesn’t work, you have a model loading or hardware issue, not a tuning issue.

Step 2: Set the model length to what you actually need. Don’t leave it at the default (which might be 131072 and waste KV cache budget).

Step 3: Enable quantization if you can. For a 70B on H100, try --quantization fp8 or --quantization awq depending on the model availability.

Step 4: Tune --gpu-memory-utilization. Start at 0.9, decrease to 0.85 if you OOM, increase to 0.95 if you have headroom.

Step 5: Tune --max-num-batched-tokens. This is the biggest throughput knob. Start at 4096 for latency-critical, 16384+ for throughput-critical.

Step 6: Verify prefix caching is enabled. It should be by default. Check with the metrics.

Step 7: Set up observability. Configure Prometheus scraping. Build dashboards for the key metrics.

Step 8: Configure autoscaling. Use KEDA with vllm:num_requests_running (Chapter 51).

The ten-step vLLM tuning workflow moves from "will it start?" through hardware, memory, throughput, and observability — always measuring before changing the next knob.

Step 9: Run a benchmark on your real workload. Use vLLM’s benchmark scripts or a custom load tester (Chapter 55). Measure throughput, latency, and tail latency.

Step 10: Iterate. Adjust based on the measurements. Don’t tune blind.

This is the workflow that takes you from “vLLM is running” to “vLLM is serving production traffic well.” It’s not glamorous; it’s mostly measurement and iteration.

48.12 The mental model

Eight points to take into Chapter 49:

Defaults get you to 60% of optimal. Tuning gets you to 95%.
--max-model-len should match your actual workload, not the model’s max.
--tensor-parallel-size is set by hardware and latency requirements.
--gpu-memory-utilization is the safety knob. Start at 0.9.
--max-num-batched-tokens is the throughput knob. Tune for your workload.
Quantization and prefix caching are the two biggest cost wins. Enable them.
Observability metrics are exposed automatically on /metrics. Scrape them.
The tuning workflow is measure → adjust → measure. Don’t tune blind.

In Chapter 49 we look at the other production runtime that matters: TEI for embeddings and rerankers.

Read it yourself

The vLLM documentation, especially the “Engine Arguments” page.
The vLLM CLI reference (vllm serve --help).
The vLLM benchmarking documentation.
The vLLM source code, especially vllm/engine/arg_utils.py for the canonical list of flags.
The vLLM GitHub issues and discussions for community tuning advice.

Practice

Construct a vLLM serve command for Llama 3.1 70B Instruct with tp=8, max_model_len=16384, fp8 quantization, prefix caching enabled, max-num-batched-tokens=8192.
Why is --max-model-len important to set lower than the model default? What happens if you leave it at 131072?
Read the vLLM arg_utils.py source. Identify five flags that aren’t covered in this chapter and what they do.
For a workload with very short prompts (~50 tokens) and long completions (~2000 tokens), how would you tune --max-num-batched-tokens and --max-num-seqs?
The metric vllm:num_requests_running is used as the autoscaling signal. Why is it better than CPU utilization for LLMs?
Why does --quantization require the model to be pre-quantized? What’s the failure mode if you pass awq to an unquantized model?
Stretch: Run vLLM with a small open model and tune --max-num-batched-tokens while measuring throughput. Find the sweet spot for your hardware.