Part III · Inference Internals & Production Serving
Chapter 51 ~14 min read

Autoscaling GPU inference with KEDA

"HPA on CPU works for web servers. For LLMs you need to scale on the right metric, and the right metric is never CPU"

In Chapters 30 and 31 we touched on autoscaling — adding and removing GPU replicas based on load. This chapter is the deep dive on how to actually do it. The Kubernetes-standard solution for LLM autoscaling is KEDA (Kubernetes Event-Driven Autoscaling), which lets you scale on custom metrics like the queue length, the number of in-flight requests, or any Prometheus query.

By the end you’ll know how to design a KEDA-based autoscaler for any LLM serving deployment, what metrics to use, how to set thresholds, and how to avoid the common pitfalls (flapping, slow scale-up, etc.).

Outline:

  1. Why HPA-on-CPU is wrong.
  2. KEDA in one paragraph.
  3. The right metrics for LLM autoscaling.
  4. KEDA ScaledObject configuration.
  5. Scaling on Prometheus queries.
  6. Cooldown and stabilization windows.
  7. Scale-to-zero.
  8. Predictive autoscaling and why it’s hard.
  9. The cold-start problem.
  10. Production patterns.

51.1 Why HPA-on-CPU is wrong

The standard Kubernetes autoscaler is the HorizontalPodAutoscaler (HPA), which scales pods based on CPU or memory utilization. For most web services, “scale up when CPU usage exceeds 70%” is a reasonable rule.

For LLM serving, this is completely wrong. GPU-based inference uses very little CPU. The CPU usage of a vLLM pod serving a 70B model is typically 5-15% — most of the work is on the GPU. If you scale based on CPU, you’ll never scale up, even when the GPU is at 100% and requests are queueing.

Memory utilization is similarly misleading. A vLLM pod allocates a fixed amount of GPU memory (controlled by --gpu-memory-utilization) at startup, and CPU memory usage is mostly the model weights staged from disk. Neither metric reflects actual load.

The right metric is a measure of the GPU’s actual workload. Specifically:

  • vllm:num_requests_running — how many requests vLLM is currently processing.
  • vllm:num_requests_waiting — how many are queued.
  • te_queue_size — for TEI, the embedder queue size.
  • Custom metrics like “average TPOT over the last minute” or “p99 latency.”

These metrics directly reflect “is the GPU doing useful work and is it keeping up?” — which is what autoscaling needs to know. The standard HPA can’t read these metrics natively (it only knows CPU and memory). Hence KEDA.

HPA on CPU stays flat while GPU is saturated; KEDA on num_requests_running tracks actual load and scales replicas up correctly. replicas time → threshold HPA (CPU) never scales up KEDA (num_requests_running) scales with actual GPU load
HPA on CPU never crosses the scale threshold because GPU-bound inference barely moves CPU usage; KEDA on vLLM's own request metrics scales replicas in proportion to real load.

51.2 KEDA in one paragraph

KEDA is a Kubernetes operator that extends HPA with custom metrics from many sources: Prometheus, Redis, Kafka, AWS SQS, GCP PubSub, RabbitMQ, etc. You define a ScaledObject (a CRD) that says “scale this Deployment based on this metric, with these thresholds.” KEDA reads the metric, computes the desired replica count, and updates the Deployment via the standard K8s scaling API.

Under the hood, KEDA creates a regular HPA with KEDA’s own external metric provider. The end result is the same as HPA-on-CPU but with arbitrary metric sources. For LLM serving, the most common source is Prometheus (querying vLLM’s exported metrics).

KEDA is the modern standard for K8s-based custom-metric autoscaling. It’s mature, widely deployed, and well-supported. As of late 2025, it’s a CNCF graduated project.

51.3 The right metrics for LLM autoscaling

The metric you scale on determines the scaling behavior. The choices for LLM serving:

vllm:num_requests_running

Number of requests vLLM is currently processing (in the active batch). This is the most common and most appropriate metric for LLM autoscaling.

The intuition: each vLLM replica can handle a certain number of concurrent requests before it’s saturated. When num_requests_running is below that threshold, the replica has spare capacity. When it’s at or above, you need more replicas.

Threshold tuning: typically 30-80 requests per replica, depending on the model size, batch settings, and acceptable latency. Lower threshold = more replicas, lower latency, higher cost. Higher threshold = fewer replicas, higher latency, lower cost.

vllm:num_requests_waiting

Number of requests in the scheduler’s queue, waiting to be admitted to a batch. Better than num_requests_running for tight-latency workloads because it directly measures backpressure.

The threshold: typically 0-5 per replica. Anything above zero queue length means the system is at or beyond capacity.

vllm:gpu_cache_usage_perc

KV cache utilization as a percentage. Useful as a backstop to scale up when memory pressure is high, even if the request count is low.

Threshold: typically 80-90%. Above this, you’re at risk of evictions.

Combination metrics

Most production deployments scale on a combination:

  • Primary: num_requests_running (smooths out short spikes).
  • Secondary: num_requests_waiting (catches sudden bursts).
  • Backstop: gpu_cache_usage_perc (prevents memory exhaustion).

KEDA supports multiple triggers in a single ScaledObject; the desired replica count is the maximum across all triggers. This gives you a safety net.

te_queue_size (for TEI)

For embedding services, the queue size is the right metric (Chapter 49).

Application-level metrics

Some teams use application-level metrics like “average request latency over the last 5 minutes” or “tokens generated per second.” These can be useful but are noisier than vLLM’s built-in metrics.

For most LLM deployments, vllm:num_requests_running is the right default. Tune the threshold based on your latency requirements.

51.4 KEDA ScaledObject configuration

A complete KEDA ScaledObject for an LLM deployment:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-3-70b-scaler
  namespace: ai-models
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-3-70b
  pollingInterval: 15      # seconds between metric polls
  cooldownPeriod: 300      # seconds to wait before scaling down
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_requests_running
        threshold: "50"
        query: |
          sum(vllm:num_requests_running{deployment="llama-3-70b"})
          / count(up{deployment="llama-3-70b"} == 1)
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_requests_waiting
        threshold: "5"
        query: |
          sum(vllm:num_requests_waiting{deployment="llama-3-70b"})

Walking through this:

  • scaleTargetRef — the Deployment to scale.
  • pollingInterval — how often KEDA checks the metric (15 seconds is typical).
  • cooldownPeriod — how long to wait after scaling down before scaling down again (5 minutes is conservative but prevents flapping).
  • minReplicaCount / maxReplicaCount — bounds on the replica count.
  • triggers — the metrics to scale on. Multiple triggers take the max.

The Prometheus queries are PromQL. The first query computes “average requests running per replica” — divides total in-flight requests by the number of healthy pods. The second is just “total queued requests.”

The desired replica count is computed by KEDA as:

desired_replicas = ceil(current_metric_value / threshold)

For example, if total in-flight requests are 250 and the threshold is 50, KEDA scales to 5 replicas (ceil(250/50)). If 350 requests come in, KEDA scales to 7.

51.5 Scaling on Prometheus queries

The Prometheus query is the heart of the configuration. A few patterns:

Average per replica:

sum(vllm:num_requests_running{deployment="X"}) / count(up{deployment="X"} == 1)

This averages the metric across replicas. Useful when you want “scale so each replica handles 50 requests on average.”

Max across replicas:

max(vllm:num_requests_running{deployment="X"})

Useful for catching outlier replicas that are saturated even when others are idle.

Rate over a window:

rate(vllm:request_total{deployment="X"}[5m])

The request rate over the last 5 minutes. Useful for predictive scaling.

Combined:

sum(vllm:num_requests_running{deployment="X"}) / count(up{deployment="X"} == 1)
+ sum(vllm:num_requests_waiting{deployment="X"}) / count(up{deployment="X"} == 1)

The sum of running and waiting requests, averaged per replica. Counts both backlog and active load.

The right query depends on your workload and your latency goals. Start with “average requests running per replica” and refine based on observed behavior.

51.6 Cooldown and stabilization windows

Two parameters that prevent autoscaler flapping:

cooldownPeriod (KEDA): how long to wait after scaling down before scaling down again. Default 300 seconds. Increase if you see flapping; decrease if you want faster scale-down.

stabilizationWindowSeconds (HPA, which KEDA uses underneath): the window over which the autoscaler considers metric values when deciding to scale. Default 300 seconds for scale-down, 0 for scale-up.

The asymmetry — fast scale-up, slow scale-down — is intentional. You want to add capacity quickly when load spikes (otherwise users see queues and timeouts). You want to remove capacity slowly because adding it back is expensive (cold starts, model loading) and you don’t want to thrash.

Scale-up is immediate when the metric exceeds the threshold; scale-down waits for the full cooldown window to prevent flapping during brief load dips. replicas time → load spike immediate scale-up load drops cooldown (5 min) slow scale-down
Replicas spike up immediately on a load increase (stabilizationWindow = 0) and only scale down after the full cooldown period — preventing thrashing from brief dips.

For LLM serving, typical values:

  • pollingInterval: 15-30 seconds.
  • cooldownPeriod: 5-10 minutes.
  • Scale-up stabilization: 0-30 seconds (immediate response to load).
  • Scale-down stabilization: 5-10 minutes (slow to release capacity).

These prevent flapping while keeping latency responsive. Tune based on your traffic patterns.

51.7 Scale-to-zero

KEDA supports scale-to-zero — scale a deployment down to zero replicas when there’s no traffic, then scale up when traffic arrives. This is useful for low-traffic models where you don’t want to pay for idle GPUs.

The configuration:

spec:
  minReplicaCount: 0    # KEDA-specific, allows zero
  idleReplicaCount: 0   # how many replicas to maintain when "idle"

When a request arrives at a scaled-to-zero deployment, KEDA scales up. The first request waits for the cold start, which can be 30 seconds to several minutes for a large model.

The trade-off:

  • Cost: zero when idle.
  • Latency: huge spikes for the first request after a scale-up.

Scale-to-zero is appropriate for:

  • Low-traffic models (< 1 request per minute).
  • Development / staging environments.
  • Models with predictable traffic patterns where you can pre-warm.

It’s not appropriate for production user-facing chat where any request might come at any time. The cold-start latency is too painful.

For production, scale to a minimum of 1-2 replicas instead of zero. The cost difference is small compared to the user experience improvement.

51.8 Predictive autoscaling

A more sophisticated pattern: predict load and scale ahead of time. Instead of reacting to current load, look at historical patterns and pre-scale before the traffic arrives.

For example, if you know your traffic peaks at 9 AM weekdays, scale up to peak capacity at 8:55 AM. Avoids the cold-start spike at the start of the peak.

Predictive autoscaling is hard for several reasons:

  • Traffic patterns are noisy. What looks like a daily pattern is partly noise.
  • Model loading is slow. Pre-warming takes minutes, and predictions need to be that early.
  • Mistakes are costly. Over-predicting wastes money; under-predicting causes the latency spike you were trying to avoid.

Some platforms (KEDA’s predictive scaler addon, custom solutions) implement this. For most teams, reactive scaling with a 1-2 replica minimum is good enough.

For very predictable workloads (scheduled batch jobs, etc.), you can use a CronJob to scale up before the batch and down after.

51.9 The cold-start problem

This is the #1 operational issue with LLM autoscaling: scaling up takes minutes. The sequence:

  1. KEDA decides to scale up.
  2. K8s creates a new pod.
  3. The pod is scheduled to a GPU node.
  4. The init container or storage initializer downloads the model. This is the slow step — minutes for a 70B model from S3, even with high bandwidth.
  5. The runtime container starts.
  6. vLLM loads the model into GPU memory. Also slow — 30-90 seconds for a large model.
  7. The startup probe passes.
  8. The pod becomes ready.
  9. Traffic starts flowing to the pod.

For a 70B model, total cold start time is 3-10 minutes. During this time, the existing replicas are still handling all the load with whatever capacity they have. If the load spike is fast and large, the existing replicas may saturate, latencies spike, and users see the impact.

Cold start sequence for a 70B model: pod scheduling is fast but model download and GPU loading together take 3-10 minutes before the pod is ready. image pull model download ~3-5 min (140 GB) GPU load ~75 s JIT READY 0 s biggest bottleneck ~3–10 min total cold start
Model download accounts for the majority of cold start time for a 70B model — eliminating it with a pre-cached local copy is the single largest latency improvement available.

The mitigations:

  • Pre-loaded model cache (Chapter 52). Skip the download step by pre-pulling the model to node-local NVMe.
  • Faster startup probes that don’t add unnecessary delay.
  • Warmup endpoints that pre-compile the model and warm up the kernels before declaring ready.
  • Surge capacity — keep extra replicas always ready to absorb spikes.
  • Faster instance types — H100 nodes load faster than A100 nodes (because of better PCIe and NVMe).

Even with all of this, cold start remains the biggest operational pain in LLM autoscaling. Plan for it.

51.10 Production patterns

A few patterns that work in real production:

(1) Min replicas equal to expected baseline. If your baseline traffic needs 4 replicas, set minReplicaCount: 4. Don’t try to scale below your steady-state.

(2) Max replicas with cost cap. Set maxReplicaCount based on your budget, not based on what’s “possible.” If you only want to spend up to $X/month, calculate the max replicas accordingly.

(3) Multiple triggers for safety. Combine num_requests_running, num_requests_waiting, and gpu_cache_usage_perc so that any of them can trigger scaling.

(4) Dashboards for the autoscaler. Graph the scaling decisions over time. You should see the replica count tracking the load with appropriate lag.

(5) Manual override for incidents. Have a way to manually pin the replica count during incidents (e.g., kubectl scale --replicas=N). KEDA’s paused annotation can disable scaling temporarily.

(6) Per-model autoscaling. Each model gets its own ScaledObject. Don’t try to share autoscaling across different models — their loads are independent.

(7) Tier-based scaling. Premium-tier models get more aggressive scale-up; free-tier models get more conservative. Different ScaledObjects with different thresholds.

(8) Alerting on scale-up failures. If KEDA wants to scale up but K8s can’t (no GPU nodes available), alert. This is a common operational issue.

These patterns are the difference between an autoscaler that “kind of works” and one that’s reliable in production.

51.11 The mental model

Eight points to take into Chapter 52:

  1. HPA-on-CPU is wrong for LLMs. GPU is the bottleneck; CPU usage doesn’t reflect it.
  2. KEDA lets you scale on custom metrics, including Prometheus queries.
  3. vllm:num_requests_running is the right default metric. Combine with num_requests_waiting and gpu_cache_usage_perc.
  4. Scale up fast, scale down slow. Asymmetric stabilization windows.
  5. Scale-to-zero is for low-traffic dev/staging. Production needs minimum replicas.
  6. Predictive scaling is hard. Reactive is good enough for most.
  7. Cold start is the #1 pain. Mitigate with model caches, warmup, surge capacity.
  8. Alert on scale-up failures. “No GPU available” is a common production issue.

In Chapter 52 we look at the cold-start problem in detail: the model cache pattern.


Read it yourself

  • The KEDA documentation, especially the Prometheus scaler.
  • The Kubernetes HPA documentation for the underlying scaling math.
  • The vLLM metrics documentation for the available metrics.
  • The KEDA ScaledObject CRD reference.
  • The KEDA scale-to-zero documentation.

Practice

  1. Write a KEDA ScaledObject for a Llama 3 8B deployment that scales on num_requests_running and num_requests_waiting, with min=1, max=10.
  2. Why is HPA-on-CPU wrong for LLM serving? Construct a specific scenario where it would fail to scale.
  3. Why is the threshold for num_requests_running typically 30-80? What determines the right value?
  4. Why is scale-up faster than scale-down by design? Argue from the cost of mistakes.
  5. Calculate the cold start time for a 70B model: 140 GB download from S3 at 1 GB/s, plus ~75 seconds of GPU loading. Total?
  6. Why is scale-to-zero usually wrong for production user-facing LLM serving?
  7. Stretch: Set up KEDA on a local K8s cluster with a fake metric source. Scale a Deployment up and down based on the metric. Verify the cooldown behavior.