Warmup, readiness probes, and the model-isn't-ready-yet problem
"The pod is up. The model is loaded. The first request still takes 30 seconds. Why?"
In Chapter 52 we covered the cold start problem at the deployment level — getting model weights onto the node. In this chapter we cover the related but distinct problem at the request level — making sure that when a pod says it’s ready, it’s actually able to serve requests at full speed.
The “model-isn’t-ready-yet” problem is one of the more annoying operational issues with LLM serving. It manifests as: the pod has passed all its health checks, K8s is routing traffic to it, and the first few requests are mysteriously slow or even failing. The root cause is lazy initialization — vLLM (and other runtimes) defer some work until the first request actually arrives, and that work takes seconds to minutes.
This chapter is about diagnosing and fixing the model-isn’t-ready-yet problem.
Outline:
- The lifecycle of a “ready” pod.
- The lazy initialization problem.
- CUDA kernel JIT compilation.
- Memory allocation timing.
- Warmup endpoints and how to use them.
- PostStart lifecycle hooks and their timeout problems.
- Idempotent warmup scripts.
- Readiness probe design.
- The interaction with rolling deployments.
54.1 The lifecycle of a “ready” pod
Let me trace the lifecycle of an LLM pod from container start to “actually serving traffic well”:
T+0: Container starts. The runtime image is loaded. The Python process begins.
T+5s: vLLM imports its dependencies (PyTorch, transformers, custom CUDA libraries).
T+10s: vLLM reads the model config and validates the model files exist on the volume.
T+15s: vLLM initializes the CUDA context and reserves the GPU.
T+30s: vLLM begins loading model weights from disk into HBM. For a 70B model, this takes 30-60 seconds.
T+90s: Model weights are in HBM. vLLM starts the API server.
T+95s: The /health endpoint starts returning 200 OK.
T+95s: Kubernetes’ startup probe sees the 200 OK and marks the pod ready. Traffic begins flowing.
T+96s: The first request arrives. vLLM processes it.
T+126s: The first request completes. It took ~30 seconds, much longer than steady-state.
T+140s: The second request arrives and completes in 200ms.
The mystery: why did the first request take 30 seconds when the pod was supposedly ready at T+95?
The answer: lazy initialization. Several things happened during the first request that hadn’t happened during pod startup:
- CUDA kernel JIT compilation for the specific input shape. vLLM uses dynamic kernel selection, and the first request triggers compilation of any kernels not yet cached.
- Memory pool allocation. Some KV cache blocks weren’t pre-allocated, only allocated on first use.
- Tokenizer initialization for the first batch.
- Backend warmup for FlashAttention or similar kernels that have lazy state.
These add up to seconds of extra latency on the first request. After the first request, everything is cached and the pod is at full speed.
The user sees: the first request through every new pod is slow. For autoscaling, this means every scale-up event causes a latency spike for the unlucky users hitting the new pods. For canary deployments, the same.
The fix is to trigger all the lazy initialization before declaring the pod ready, by running warmup requests during startup.
54.2 The lazy initialization problem
The deeper reason lazy initialization exists: eager initialization is hard to do correctly. The runtime doesn’t know what shapes and sizes future requests will use. It can guess, but the guesses might be wrong, leading to wasted work.
Lazy initialization is the lazy programmer’s solution: do the work when it’s actually needed. The first request reveals what shapes are used, and the runtime compiles the kernels for those shapes. Subsequent requests with the same shapes hit the cache.
The cost is the latency spike on the first request. For a single-user system, this is annoying. For a production fleet with autoscaling, it’s a real operational issue.
The fix: eager initialization triggered by warmup requests. We pre-emptively run requests through the model that exercise all the code paths, forcing all the lazy initialization to happen during startup. The pod doesn’t pass its readiness probe until warmup is complete.
54.3 CUDA kernel JIT compilation
The biggest source of first-request latency is CUDA kernel JIT compilation. Modern GPU code is often compiled from PTX (NVIDIA’s intermediate representation) to SASS (the actual GPU machine code) at runtime, based on the specific GPU model and the specific input shapes.
Some kernels are AOT-compiled and cached, but many are JIT-compiled on first use:
- CUDA Graphs that vLLM uses for low-overhead kernel launches need to be captured for each unique input shape.
- FlashAttention v3 has shape-specific kernel variants that get compiled lazily.
- PyTorch’s
torch.compilegenerates code on first call. - Triton kernels are compiled on first use.
Each of these adds 1-5 seconds for the first request that triggers them. Combined, the first request can be 20-30 seconds slower than steady-state.
The fix: warm up by running requests with the shapes you expect in production. The warmup request shapes don’t have to match the actual workload exactly; they just have to trigger the kernel compilation.
54.4 Memory allocation timing
Another lazy initialization source: memory pools. vLLM allocates the KV cache pool at startup (good), but other allocators (PyTorch’s caching allocator, custom buffers) may grow on demand. The first time the model needs a buffer of a particular size, the allocator goes to the OS for memory.
This is fast in absolute terms (microseconds to milliseconds per allocation), but if many allocations happen during the first request, the cumulative latency is noticeable.
The fix: warm up to allocate all the buffers, so the first real request finds them all already allocated.
54.5 Warmup endpoints
The standard pattern for fixing lazy init: run warmup requests after vLLM starts and before the pod passes the readiness probe.
A simple warmup script:
#!/bin/bash
# Wait for vLLM to be ready
until curl -s http://localhost:8000/health > /dev/null; do
sleep 1
done
# Run warmup requests
for prompt_length in 100 500 1000; do
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL_NAME\",
\"messages\": [{\"role\": \"user\", \"content\": \"$(python -c "print('test ' * $prompt_length)")\"}],
\"max_tokens\": 50,
\"temperature\": 0
}" > /dev/null
done
echo "Warmup complete"
This waits for vLLM to be ready, then runs three warmup requests with different prompt lengths. Each request triggers JIT compilation for that input shape. After warmup, subsequent requests hit cached kernels.
The warmup script is run as part of the container startup, after vLLM but before the readiness probe passes. This ensures the pod doesn’t accept traffic until warmup is done.
How to integrate this with vLLM and KServe:
Option 1: Wrap vLLM in a startup script.
CMD bash -c "vllm serve $MODEL & VLLM_PID=$!; \
/warmup.sh; \
wait $VLLM_PID"
The Dockerfile starts vLLM in the background, runs the warmup script, then waits for vLLM. The container appears “running” only after warmup completes, but vLLM is actually already accepting requests in the background.
Option 2: Use a custom readiness probe.
spec:
containers:
- name: vllm
readinessProbe:
exec:
command:
- sh
- -c
- "/warmup.sh && curl -f http://localhost:8000/health"
initialDelaySeconds: 60
periodSeconds: 30
The readiness probe runs the warmup script (which is idempotent and exits quickly after the first run) and then checks /health. The pod is ready only after both succeed.
Option 3: Run warmup as part of vLLM’s CLI.
vLLM has a --enforce-eager flag that disables CUDA Graphs (eliminating one source of JIT but losing some performance). For some deployments, this is the simplest fix.
Most production deployments use Option 1 or Option 2.
54.6 PostStart lifecycle hooks and their timeout problems
K8s has a postStart lifecycle hook that runs after the container starts. You might think this is the right place for warmup:
spec:
containers:
- name: vllm
lifecycle:
postStart:
exec:
command:
- /warmup.sh
This doesn’t work reliably for slow warmup. The postStart hook has an implicit timeout (usually a few minutes) and is not allowed to block container startup indefinitely. If your warmup takes 5 minutes, the hook might be killed before it completes, and the container is reported as failed.
The K8s docs are vague about the exact timeout, but in practice:
- Synchronous postStart hooks should complete in a few minutes max.
- Long-running warmup must be done asynchronously or as part of the main container command.
The workaround: make warmup asynchronous.
lifecycle:
postStart:
exec:
command:
- sh
- -c
- "(/warmup.sh > /tmp/warmup.log 2>&1 &) ; exit 0"
This launches the warmup in the background and returns immediately. The hook succeeds. The warmup runs in parallel with whatever else is happening. The downside: the readiness probe might pass before warmup is done, defeating the point.
A better fix: use Option 1 (wrap vLLM in a startup script) instead of postStart. The startup script can take as long as it needs because it’s part of the main container command, not a hook.
The lesson: don’t use postStart for slow operations. Use the main container command or a separate init container.
54.7 Idempotent warmup scripts
If warmup runs every time a pod starts, and the pod might restart for various reasons (eviction, OOM, manual restart), the warmup script needs to handle being run multiple times. Make warmup idempotent.
A non-idempotent warmup might write a marker file to indicate completion. The next run sees the marker and skips warmup. But if the marker is in ephemeral storage (like an emptyDir or a tmpfs), it doesn’t survive container restart and warmup runs again. If it’s in persistent storage, the marker can become stale (valid for an old model version).
The simplest approach: just run warmup every time. It’s fast (a few seconds to a minute) and there’s no harm in running it multiple times.
If warmup is expensive (e.g., several minutes for very large models), you can cache the JIT-compiled artifacts on the host:
volumes:
- name: cuda-cache
hostPath:
path: /var/cache/cuda
type: DirectoryOrCreate
containers:
- name: vllm
env:
- name: CUDA_CACHE_PATH
value: /var/cache/cuda
volumeMounts:
- name: cuda-cache
mountPath: /var/cache/cuda
CUDA caches compiled kernels in CUDA_CACHE_PATH. If the cache is on a hostPath volume, it persists across pod restarts on the same node. The first warmup populates the cache; subsequent restarts read from it.
This is one of those small operational tricks that saves real time at scale.
54.8 Readiness probe design
The K8s readiness probe controls when traffic flows to a pod. Get the probe right or you’ll either route traffic before the pod is ready (bad) or wait too long after it’s ready (also bad).
For LLM serving, the probe should:
- Wait long enough for model loading. A 70B model takes 60-90 seconds; the probe needs to give it that time.
- Verify the model is actually loaded (not just that the process is running).
- Verify warmup is complete (not just that the model is loaded).
- Be cheap to run. The probe runs every few seconds; expensive probes waste resources.
A typical configuration:
spec:
containers:
- name: vllm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 30 # 30 × 10s = 5 minutes max
successThreshold: 1
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
failureThreshold: 5
initialDelaySeconds: 600 # Don't kill during startup
Three probes:
- Startup probe: gives the pod up to 5 minutes to start (loading + warmup). Once it passes, the readiness and liveness probes take over.
- Readiness probe: checks every 5 seconds. If it fails, the pod is removed from the service endpoints (no new traffic).
- Liveness probe: checks every 10 seconds. If it fails repeatedly, the pod is restarted.
The initialDelaySeconds: 600 on the liveness probe is critical: it prevents K8s from killing the pod during startup if startup is slow. Without it, K8s might decide the pod is unhealthy before it finishes loading.
stateDiagram-v2
[*] --> Starting
Starting --> WarmingUp : model loaded, /health 200
WarmingUp --> Ready : warmup requests complete
Ready --> NotReady : readiness probe fails
NotReady --> Ready : readiness probe passes
Ready --> Restarting : liveness probe fails (5×)
Restarting --> Starting
style WarmingUp fill:var(--fig-accent-soft),stroke:var(--fig-accent)
style Ready fill:var(--fig-surface),stroke:var(--fig-border)
The pod must pass through the WarmingUp state — where warmup requests trigger CUDA JIT compilation — before K8s marks it Ready and routes traffic to it.
54.9 The interaction with rolling deployments
When you deploy a new model version, K8s does a rolling update: spin up new pods, wait for them to be ready, then terminate old pods. The interaction with cold start matters.
A rolling deployment of a 70B model:
- K8s creates a new pod.
- The new pod takes 5 minutes to load and warm up.
- After 5 minutes, the new pod is ready.
- K8s terminates one old pod.
- Repeat for each replica.
For 10 replicas, this takes ~50 minutes. During the rollout, you have a mix of old and new pods, and the fleet size temporarily grows (because new pods come up before old pods come down). This consumes extra GPU resources.
The strategies to make rollouts faster:
(1) Surge with limit. Configure the rolling update to allow up to 25% surge (extra pods during rollout). This parallelizes the rollout while bounding the extra resource use.
(2) Pre-warmed nodes. Use LocalModelCache to pre-cache the new model version on the nodes before deploying. This skips the download step in cold start.
(3) Blue-green deployments. Bring up the entire new fleet alongside the old fleet, switch traffic at the gateway, then terminate the old fleet. Faster but uses 2× the resources during the switch.
(4) Canary deployments. Bring up a small number of new pods (10% of fleet), validate, then expand. Slower but lower risk.
Most production deployments use rolling updates with surge for routine deploys, and blue-green for major version changes.
54.10 The mental model
Eight points to take into Chapter 55:
- A “ready” pod is not always actually ready. Lazy initialization causes first-request latency spikes.
- CUDA kernel JIT compilation is the biggest source of first-request lag. Warmup triggers it.
- Run warmup before the readiness probe passes. The pod shouldn’t accept traffic until warmup is done.
- PostStart hooks have timeouts and shouldn’t be used for slow warmup. Use the main container command or an init container.
- Make warmup idempotent. It runs every time the pod starts.
- Cache CUDA JIT artifacts on the host to speed up subsequent warmups.
- Set the startup probe generously (5+ minutes for large models) and the liveness probe with a long initial delay.
- Rolling deployments interact with cold start. Use surge, pre-cached nodes, or blue-green to speed them up.
In Chapter 55 we look at the methodology of benchmarking inference — how to measure all of this correctly.
Read it yourself
- The Kubernetes documentation on pod lifecycle, probes, and lifecycle hooks.
- The vLLM
--enforce-eagerflag documentation. - The PyTorch
torch.compiledocumentation on caching and warmup. - The CUDA
CUDA_CACHE_PATHdocumentation. - Examples of warmup scripts in production LLM deployments.
Practice
- Write a warmup script for vLLM that runs three requests with prompt lengths 100, 500, and 2000 tokens. Make it idempotent.
- Why are lifecycle postStart hooks unreliable for slow warmup? Trace what happens with a 5-minute warmup.
- Configure a startup probe for a Llama 3 70B deployment that allows up to 10 minutes of loading time.
- How does CUDA kernel JIT compilation cause first-request lag? Identify three specific kernels that get JIT-compiled.
- Why should you set
initialDelaySecondson the liveness probe for LLM pods? What goes wrong without it? - For a 10-replica rolling deployment of a 70B model with 5-minute cold start, how long does the rollout take with 25% surge? With blue-green?
- Stretch: Run vLLM on a small model with and without warmup. Measure the first-request latency in each case. Quantify the improvement.