Part III · Inference Internals & Production Serving
Chapter 52 ~14 min read

The model cold-start problem and pre-cached weights

"A new pod that has to download a 140 GB model from S3 takes longer to start than a tepid bath"

In Chapter 51 we saw that cold start is the #1 pain in LLM autoscaling. This chapter is about why cold start is so slow and what to do about it. The single most important technique is pre-caching the model weights on node-local NVMe so new pods load from local disk instead of pulling from object storage. The KServe LocalModelCache CRD is the standard mechanism on K8s.

By the end of this chapter you’ll know how to design a cold-start mitigation strategy for any LLM deployment.

Outline:

  1. Why cold start is slow.
  2. The model loading sequence.
  3. Pre-caching weights on node-local NVMe.
  4. The LocalModelCache CRD.
  5. Init container alternatives.
  6. Daemonset alternatives.
  7. Warmup endpoints.
  8. The timing-coordination gotcha.
  9. Patterns for fast cold start.

52.1 Why cold start is slow

A cold start (a pod starting from nothing) for a large LLM has several slow steps:

  1. Pod scheduling. K8s picks a node with the right resources. Usually fast (~1 second).
  2. Image pull. The runtime container image (vLLM, TensorRT-LLM, etc.) is pulled from a registry. The image is typically 2-5 GB. Pull time depends on the network and image cache; typically 30-60 seconds for a fresh pull, near-instant if cached.
  3. Init container or storage initializer runs. Downloads the model from S3/GCS/etc. This is the slow step. A 70B model in bf16 is 140 GB. At 1 GB/s download (typical for cloud-managed S3 in the same region), that’s 140 seconds. At slower speeds, it’s much worse.
  4. Runtime container starts. vLLM Python process initializes, loads dependencies. ~5-10 seconds.
  5. Model loads into GPU memory. vLLM reads the weights from disk and loads them into HBM. Also slow. For a 70B model from local NVMe, ~30-60 seconds. From slower disk, longer.
  6. CUDA kernel compilation. Some kernels (especially custom ones for the specific model architecture) get JIT-compiled on first run. Adds 10-30 seconds.
  7. Health checks pass. vLLM’s /health endpoint starts returning 200. The startup probe sees this and marks the pod ready.
  8. Traffic flows. The K8s service starts routing requests to the new pod.

Total cold start: 2-10 minutes for a large model. The dominant components are model download (step 3) and model loading into GPU memory (step 5). Everything else is comparatively fast.

The two slow steps are independent. You can mitigate model download with pre-caching on local disk (this chapter). You can mitigate model loading time with faster disk I/O (NVMe instead of network storage).

Without a local cache every cold start downloads the model from S3; with the LocalModelCache the download step is skipped and pods load from node-local NVMe in seconds. Without local cache New Pod download 3–5 min GPU load ~75 s Ready With LocalModelCache New Pod skip download GPU load ~75 s Ready saves 3–5 min
The LocalModelCache skips the model download entirely on nodes that already have the weights, cutting cold start from 5+ minutes to roughly 75 seconds of GPU loading.

52.2 The model loading sequence

Let me trace the model loading sequence in more detail. For a 70B model:

Download phase:

  • The init container or storage initializer starts.
  • It connects to S3 (or other storage) and lists the model files.
  • It downloads each shard (typically 16-20 shards for a 70B model in bf16).
  • The shards are written to a shared volume (an emptyDir, a PVC, or a hostPath).
  • Total: 140 GB at ~1 GB/s = 140 seconds. Realistic with overhead and slower-than-peak network: 3-5 minutes.

Loading phase:

  • The runtime container starts.
  • vLLM reads the model config from the shared volume.
  • For each shard, it reads the safetensors file and loads it into GPU memory via PyTorch.
  • The loading is sequential per shard (some parallelism is possible but not always exploited).
  • The bottleneck is disk read bandwidth. From local NVMe (~7 GB/s), 140 GB = 20 seconds. From network storage (~1 GB/s), 140 seconds.

JIT phase:

  • vLLM’s first forward pass triggers JIT compilation of CUDA kernels for the specific shapes.
  • This is one-time per pod startup.
  • 10-30 seconds for a typical model.

Warmup phase (optional):

  • Some serving stacks run a warmup forward pass to pre-compile and pre-cache things.
  • Adds another 10-30 seconds.

The optimization targets:

  • Eliminate the download phase by pre-caching the model on the node before the pod starts. Saves 3-5 minutes.
  • Speed up the loading phase by using fast local NVMe for the model storage. Saves 2-3 minutes.
  • Reduce JIT time by warming up at startup. Saves 10-30 seconds.

The first two are the big wins. The third is incremental.

52.3 Pre-caching weights on node-local NVMe

The core idea: store the model weights on each GPU node’s local NVMe SSD so new pods can load them without going to remote storage.

The mechanism:

  1. A controller (the LocalModelCache CRD’s controller, or a custom DaemonSet) downloads the model to a known path on each node.
  2. New pods that need the model mount that path via a hostPath volume.
  3. The pod skips the download phase and goes straight to loading from local NVMe.

The savings: 3-5 minutes per cold start, replaced by 0 seconds of download.

The tradeoffs:

  • Disk space. Every GPU node needs enough NVMe to hold the cached models. A 70B model is 140 GB; multiple models is multiple multiples of that. Modern GPU nodes have 1-4 TB of NVMe; this fits a few models.
  • Cache management. Models change. New versions need to be downloaded; old versions need to be evicted. The cache controller handles this.
  • First download is still slow. The first time a model is pulled to a node, it takes the full download time. Subsequent pods on that node load fast.

For production deployments where pods come and go frequently, the local cache pays for itself within a few hours of operation.

52.4 The LocalModelCache CRD

KServe defines a LocalModelCache CRD specifically for this purpose. The CRD describes a model that should be cached on a set of nodes:

apiVersion: serving.kserve.io/v1alpha1
kind: LocalModelCache
metadata:
  name: llama-3-70b-cache
spec:
  modelName: llama-3-70b
  storageUri: s3://my-bucket/models/llama-3-70b/
  nodeSelector:
    nvidia.com/gpu.product: H100
  storageClass: local-nvme
  capacity: 200Gi

This says: “for every node matching the nodeSelector, download the Llama 3 70B model from S3 and store it on the local-nvme storage class, allocating 200 GB.”

The KServe LocalModelCache controller:

  1. Watches for new nodes matching the selector.
  2. When a new node appears, creates a download Job that copies the model from S3 to the node’s local storage.
  3. Tracks the cache state per node.
  4. When pods reference this cached model, mounts the local path automatically.

The InferenceService spec then references the cached model:

spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      runtime: vllm-runtime
      storageUri: cache://llama-3-70b-cache

The cache:// URI tells KServe “load this from the LocalModelCache, not from remote storage.”

When a new pod is scheduled on a node where the cache is ready, the model is mounted instantly. When scheduled on a node where the cache isn’t ready, the controller falls back to remote download (but the cache for that node starts populating, so future pods will be fast).

graph TD
  A[LocalModelCache controller] -->|watches new nodes| B{cache ready on node?}
  B -->|yes| C[mount hostPath directly]
  B -->|no| D[download Job runs]
  D --> E[writes to node NVMe]
  E --> C
  C --> F[vLLM pod starts from local disk]
  style C fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style F fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The LocalModelCache controller populates each node once; all subsequent pods on that node skip the download and mount from local NVMe immediately.

52.5 Init container alternatives

If you don’t want to use LocalModelCache (or your KServe version doesn’t support it), you can implement the same pattern with init containers:

spec:
  predictor:
    initContainers:
      - name: download-model
        image: amazon/aws-cli:latest
        command:
          - sh
          - -c
          - |
            if [ ! -f /mnt/cache/.complete ]; then
              aws s3 sync s3://my-bucket/models/llama-3-70b/ /mnt/cache/
              touch /mnt/cache/.complete
            fi
        volumeMounts:
          - name: model-cache
            mountPath: /mnt/cache
    volumes:
      - name: model-cache
        hostPath:
          path: /var/cache/models/llama-3-70b
          type: DirectoryOrCreate
    model:
      runtime: vllm-runtime
      storageUri: file:///mnt/cache

The init container checks if the model is already on the host’s /var/cache/models/llama-3-70b. If not, it downloads it. If yes, it’s a no-op. The runtime container then loads from the cached location.

This is the same pattern as LocalModelCache but implemented manually. The downside is you have to handle cache eviction yourself (LocalModelCache does it automatically).

52.6 Daemonset alternatives

Another approach: run a DaemonSet on every GPU node that pre-pulls models on startup. The DaemonSet’s job is to download the model and keep it on local disk; it doesn’t run the inference itself.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: model-prefetcher
spec:
  selector:
    matchLabels:
      app: model-prefetcher
  template:
    metadata:
      labels:
        app: model-prefetcher
    spec:
      nodeSelector:
        nvidia.com/gpu.product: H100
      containers:
        - name: prefetcher
          image: amazon/aws-cli:latest
          command:
            - sh
            - -c
            - |
              aws s3 sync s3://my-bucket/models/llama-3-70b/ /mnt/cache/
              # keep running so the pod stays alive
              sleep infinity
          volumeMounts:
            - name: model-cache
              mountPath: /mnt/cache
      volumes:
        - name: model-cache
          hostPath:
            path: /var/cache/models/llama-3-70b

This DaemonSet pre-pulls the model on every GPU node. When a vLLM pod is scheduled to a node, the model is already there.

The advantage over init containers: no per-pod download overhead. The DaemonSet pulls once per node, and all subsequent pods on that node use the cache.

The disadvantage: more complex management. You need to coordinate which models are cached on which nodes, handle versioning, evict old caches.

For most production deployments, LocalModelCache via KServe is the cleanest option. DaemonSet is the manual alternative.

52.7 Warmup endpoints

Even after the model is loaded, vLLM may have JIT compilation and kernel caching that happens on the first request. To smooth out the first-request latency, you can run a warmup pass before declaring the pod ready.

The pattern: after vLLM starts and the model is loaded, run a fake request through it to trigger any lazy initialization. Only then mark the pod as ready.

# Inside the runtime container, after vLLM starts:
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "llama-3-70b", "messages": [{"role": "user", "content": "hi"}], "max_tokens": 1}'

After the warmup request returns, the pod is fully ready and the first real request won’t see JIT compilation.

KServe has a warmup hooks feature that runs custom commands after the runtime starts but before the pod is ready. You can use this to run warmup requests.

The catch is that lifecycle hooks have timeout issues for slow warmup. The K8s postStart hook has an implicit timeout (a few minutes). For models that need long warmup (multimodal, very large models), the hook can be killed before warmup completes. The fix is to make warmup idempotent and run it asynchronously:

lifecycle:
  postStart:
    exec:
      command:
        - sh
        - -c
        - |
          (sleep 5 && curl http://localhost:8000/health > /dev/null && \
           curl -s -X POST http://localhost:8000/v1/chat/completions ...) &
          # exit immediately to avoid timeout

This runs the warmup in the background and returns immediately, avoiding the timeout but losing the “wait until warmup is done” guarantee.

A cleaner approach: run warmup in a script inside the runtime container’s command, before starting the API server. This way the pod doesn’t pass its startup probe until warmup is complete.

52.8 The timing-coordination gotcha

A specific operational gotcha: the LocalModelCache must be ready before the InferenceService pod starts. If the pod starts on a node where the cache isn’t yet ready, the pod either falls back to slow remote download or fails to start.

KServe’s controller tries to handle this by waiting for the cache to be ready before scheduling pods to a node. But the coordination isn’t always perfect — race conditions can happen, especially during cluster scale-up where new nodes are joining and the cache controller is racing to populate them.

The fix is operational discipline:

  • Pre-warm the cache before deploying the InferenceService. Apply the LocalModelCache first, wait for it to be ready on enough nodes, then apply the InferenceService.
  • Monitor the cache state. kubectl get localmodelcache and verify the model is ready on the expected nodes.
  • Use init containers as a fallback. Even with LocalModelCache, have an init container that checks for the cached model and falls back to download if missing. Belt and suspenders.

Without this discipline, you’ll see intermittent slow cold starts on newly-added nodes. The first pod that hits a fresh node pays the download cost.

52.9 Patterns for fast cold start

A summary of patterns that produce fast cold start in production:

(1) Pre-cached model on node-local NVMe. The biggest single optimization. Cuts 3-5 minutes off cold start.

(2) Fast container image. Use a slim base image and minimize the image size. Aim for <2 GB images that pull quickly.

(3) Image pre-pull. Pre-pull the runtime image on every GPU node so the image-pull step is instant.

(4) Warmup on startup. Run a fake request before declaring ready, eliminating the first-request JIT lag.

(5) Parallel loading. Modern vLLM loads multi-shard models in parallel where possible. Make sure your version is recent enough.

(6) Surge capacity. Keep extra replicas always running so you don’t depend on cold starts during spikes.

(7) Predictive pre-warming. Scale up before predicted traffic spikes (when traffic is predictable).

(8) Smaller models when possible. A smaller model has a faster cold start. If your workload allows it, use a smaller model and serve it on more replicas.

These patterns combined can take cold start from “10 minutes” to “30 seconds” or less. The first pattern (pre-cached weights) is the biggest single win.

Stacking cold-start mitigations: each technique removes a slice of latency, with pre-cached weights providing the largest single improvement. baseline ~10 min cold start + cached weights ~3.2 min saved 6.8 min + image pre-pull ~2.4 min + warmup script ~1.8 min from 10 min → ~2 min total
Pre-caching weights eliminates the dominant download phase; stacking image pre-pull and warmup scripts removes the remaining latency to achieve under two minutes total cold start.

52.10 The mental model

Eight points to take into Chapter 53:

  1. Cold start has two slow steps: model download (3-5 min) and model loading (1-2 min).
  2. Pre-caching weights on node-local NVMe eliminates the download step.
  3. LocalModelCache CRD is the KServe-native way to do it. Init containers are the manual alternative.
  4. DaemonSets can pre-pull models on all GPU nodes.
  5. Warmup endpoints smooth out JIT and first-request latency.
  6. Lifecycle hooks have timeouts that can kill long warmup. Run warmup asynchronously or in the main command.
  7. Cache must be ready before pod starts. Operational discipline matters; pre-warm caches and monitor.
  8. Cold start can go from 10 minutes to 30 seconds with the right combination of patterns.

In Chapter 53 we look at the related but distinct problem of sharing KV cache across replicas.


Read it yourself

  • The KServe LocalModelCache documentation.
  • The Kubernetes hostPath and emptyDir volume documentation.
  • The vLLM model loading source code (vllm/model_executor/).
  • The KServe storage initializer documentation.
  • Examples of model caching in production from the KServe community.

Practice

  1. Compute the total cold start time for a 70B model: 140 GB download from S3 at 1 GB/s, plus 60 seconds of GPU loading. With a local cache, what’s the new total?
  2. Why does pre-caching on node-local NVMe save 3-5 minutes per cold start?
  3. Write a LocalModelCache manifest for caching a Llama 3 70B model on H100 nodes.
  4. Why do K8s lifecycle hooks have timeout issues for slow warmup? Trace what happens with a 5-minute warmup.
  5. Explain the timing-coordination gotcha. What goes wrong if the cache isn’t ready before the pod starts?
  6. How would you handle a model upgrade with the local cache pattern? Walk through the steps.
  7. Stretch: Set up a local K8s cluster with a small model. Implement an init container that downloads the model to a hostPath. Verify subsequent pods load from the cache.