KServe InferenceService anatomy: runtime, predictor, transformer, autoscaling
"KServe is just YAML over Kubernetes Deployments — except when it isn't"
In Chapter 45 we positioned KServe as the default Kubernetes-native orchestration layer for LLM serving. In this chapter we go deep on what an InferenceService actually is — the parts, the controllers, the runtime adapters, and what each YAML field controls. By the end you’ll be able to read any KServe manifest and understand exactly what gets deployed, configure your own InferenceServices for production, and debug them when things go wrong.
This is a hands-on reference chapter. I’ll walk through the InferenceService anatomy, then the runtimes, then the autoscaling integration.
Outline:
- The InferenceService CRD.
- The predictor.
- The transformer (preprocessing/postprocessing).
- The runtime adapter.
- ServingRuntime CRD and how it composes with InferenceService.
- Autoscaling integration.
- Storage and model loading.
- Health checks and probes.
- Common manifests you’ll see.
47.1 The InferenceService CRD
KServe defines a Kubernetes Custom Resource Definition called InferenceService (often abbreviated ISVC). An InferenceService represents a deployed model that can serve inference requests.
A minimal InferenceService:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-70b
namespace: ai-models
spec:
predictor:
model:
modelFormat:
name: huggingface
runtime: vllm-runtime
storageUri: s3://my-bucket/models/llama-3-70b/
That’s the smallest manifest that does something useful. KServe’s controller reads this, picks the vllm-runtime ServingRuntime (a separate CRD, see §47.5), and creates the underlying Kubernetes objects to serve the model.
Behind the scenes, the controller creates:
- A Deployment that runs vLLM with the model.
- A Service that load-balances across the Deployment’s pods.
- Optionally a Knative Service if Knative integration is enabled.
- A VirtualService if Istio is integrated.
- Various ConfigMaps and Secrets for configuration.
You don’t write any of these yourself. KServe generates them from the InferenceService spec.
The full schema has many more fields. The key sections are predictor, transformer, explainer, and the various optional pieces. We’ll cover the relevant ones for LLM serving.
47.2 The predictor
The predictor is the core of an InferenceService. It’s what runs the actual model. For LLMs, the predictor section specifies:
- The model format (
huggingface,vllm,tensorrt-llm, etc.). - The runtime (which ServingRuntime to use).
- The storage URI (where to pull the model from).
- Resources (CPU, memory, GPU).
- Autoscaling settings (min/max replicas, target metric).
- Container args (passed to the runtime).
- Environment variables.
A more detailed predictor spec:
spec:
predictor:
minReplicas: 2
maxReplicas: 10
containerConcurrency: 100
timeout: 600
model:
modelFormat:
name: huggingface
runtime: vllm-runtime
storageUri: s3://my-bucket/models/llama-3-70b/
resources:
limits:
nvidia.com/gpu: "8"
memory: 320Gi
requests:
nvidia.com/gpu: "8"
memory: 320Gi
args:
- --tensor-parallel-size=8
- --max-model-len=32768
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
env:
- name: VLLM_ATTENTION_BACKEND
value: FLASH_ATTN
Read this carefully:
minReplicas/maxReplicasset the autoscaling bounds.containerConcurrencyis the maximum concurrent requests per pod (the Knative concurrency target).timeoutis the request timeout in seconds.model.runtimereferences a ServingRuntime namedvllm-runtime(we’ll define it in §47.5).model.storageUriis where KServe downloads the model from. Supports S3, GCS, OCI, HTTP, PVC, etc.resources.limitsdeclares the GPU and memory requirements. KServe ensures the pod is scheduled on a node that has these resources.argsare passed to the underlying vLLM container as command-line arguments. This is where you set vLLM’s tuning flags (Chapter 48).envsets environment variables in the container.
The predictor is the core. The other parts (transformer, explainer) are optional add-ons.
47.3 The transformer
The transformer is an optional sidecar that does preprocessing or postprocessing around the predictor. The pattern: a request arrives at the transformer, the transformer modifies it (e.g., tokenizes, formats), forwards to the predictor, gets the response, modifies it (e.g., decodes, formats), and returns to the client.
For LLM serving, transformers are sometimes used to:
- Apply chat templates before the request reaches vLLM (Chapter 5).
- Format outputs for downstream consumers.
- Add custom guardrails that the runtime doesn’t support natively.
- Multiplex multiple model APIs into a single endpoint.
A transformer is just another container in the same pod (or a separate pod) that intercepts traffic. KServe’s controller wires the transformer in front of the predictor automatically.
For most modern LLM serving, transformers are not used. vLLM’s OpenAI-compatible API handles most of what a transformer would do. Transformers are more common in classical ML serving (image classification, etc.) where the input/output formats are less standardized.
You’ll see transformers in some KServe deployments. For LLM-specific work, you can usually ignore them.
47.4 The runtime adapter
The runtime adapter is the glue between the InferenceService spec and the actual model server (vLLM, TensorRT-LLM, etc.). It’s the code that:
- Reads the InferenceService spec.
- Loads the model from the storage URI.
- Starts the underlying server with the right arguments.
- Exposes the server’s API (typically on port 8080 or 8000).
- Handles health checks and metrics endpoints.
KServe ships with adapters for many runtimes:
kserve-tritonserverfor Triton Inference Server.kserve-mlserverfor MLServer (BentoML-style serving).kserve-huggingfaceserverfor Hugging Face Transformers-based models.kserve-vllm-runtime(or community variants) for vLLM.- Custom runtimes for TensorRT-LLM, TGI, etc.
Each runtime is defined in a ServingRuntime CRD (next section), which describes how to launch the container, what model formats it supports, and what arguments to pass.
The KServe community has been catching up with vLLM-specific runtimes. As of late 2025, the standard is to either:
- Use a community-maintained vLLM ServingRuntime, or
- Define your own ServingRuntime with the official vLLM image.
Option 2 is more common for production because it gives you full control over the vLLM version and configuration.
47.5 ServingRuntime CRD
A ServingRuntime is a separate CRD that describes a runtime that can be used by InferenceServices. It’s the “template” for how to launch a particular runtime.
A vLLM ServingRuntime looks like:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-runtime
spec:
supportedModelFormats:
- name: huggingface
version: "1"
autoSelect: true
- name: vllm
version: "1"
autoSelect: true
containers:
- name: kserve-container
image: vllm/vllm-openai:v0.6.0
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model=/mnt/models
- --port=8080
- --host=0.0.0.0
ports:
- containerPort: 8080
name: http1
protocol: TCP
env:
- name: HF_HOME
value: /mnt/cache
volumeMounts:
- name: model-storage
mountPath: /mnt/models
This says: “when an InferenceService specifies runtime: vllm-runtime, launch a container with the vllm/vllm-openai:v0.6.0 image, run the OpenAI-compatible API server, and mount the model at /mnt/models.”
The InferenceService spec then augments this with model-specific configuration: the storage URI (which becomes the model mount), additional args, and resource requests.
The two-CRD design (ServingRuntime + InferenceService) lets you:
- Define a ServingRuntime once per runtime version (vLLM 0.6.0, vLLM 0.6.1, etc.).
- Use it for many InferenceServices, each with a different model.
- Centrally upgrade the runtime version without touching the per-model manifests.
For production, you’ll typically have a few ServingRuntimes (one per runtime version, maybe one per quantization scheme) and many InferenceServices that reference them.
47.6 Autoscaling integration
KServe supports several autoscaling integrations:
Knative-based autoscaling (default)
If you have Knative installed, KServe uses Knative’s autoscaler by default. It scales based on request concurrency — the number of in-flight requests per pod.
spec:
predictor:
minReplicas: 1
maxReplicas: 10
containerConcurrency: 50 # target 50 concurrent requests per pod
Knative’s autoscaler measures concurrency and scales replicas to keep each pod near the target. It supports scale-to-zero (minReplicas: 0) for low-traffic models.
The downside: Knative’s concurrency-based scaling is not great for LLMs because LLM requests have variable cost. A pod handling 50 long-context decode requests is much more loaded than one handling 50 short-context requests, but Knative sees them as the same.
For LLM serving, KEDA-based scaling (next section) is usually preferred.
KEDA integration
KServe can integrate with KEDA (Kubernetes Event-Driven Autoscaling, Chapter 51) for scaling on custom metrics. The pattern:
- Define a KEDA
ScaledObjectthat scales the InferenceService’s underlying Deployment based on a custom metric (e.g.,vllm:num_requests_runningfrom Prometheus). - KEDA reads the metric and adjusts the Deployment’s replica count.
- KServe’s InferenceService spec sets
minReplicas: 1andmaxReplicas: 1to disable Knative’s autoscaler, letting KEDA take over.
This is the more LLM-appropriate pattern because the metric (num_requests_running) reflects actual GPU load, not just request count.
We cover KEDA in detail in Chapter 51.
HPA fallback
If you don’t use Knative or KEDA, KServe can use the standard Kubernetes HPA (HorizontalPodAutoscaler) with CPU or memory metrics. This is almost always the wrong choice for LLMs because CPU usage doesn’t reflect GPU load. Don’t use HPA-on-CPU for LLM serving.
47.7 Storage and model loading
The storageUri field tells KServe where to pull the model from. KServe supports many storage backends via storage initializers — small init containers that run before the main runtime starts and download the model to a shared volume.
Supported URIs:
s3://bucket/path/— AWS S3.gs://bucket/path/— Google Cloud Storage.https://example.com/model.tar.gz— HTTP/HTTPS download.oci://registry/image:tag— OCI image (model packaged as a container).pvc://pvc-name/path— Kubernetes PersistentVolumeClaim.hf://owner/model-name— Hugging Face Hub (newer support).
The storage initializer downloads the model to /mnt/models (the conventional location), and the runtime container mounts the same path and loads from it.
For large models (70B+), the download time is significant. A 70B model in bf16 is 140 GB; even at 1 GB/s download, that’s 140 seconds — and many storage backends are slower. To avoid paying this cost on every pod start, you use a model cache (Chapter 52): pre-pull the model to node-local NVMe so subsequent pods load instantly.
KServe supports the model cache pattern via the LocalModelCache CRD or via direct PVC mounts.
47.8 Health checks and probes
KServe exposes the standard Kubernetes health probes:
- Liveness probe: is the container alive? If not, restart it.
- Readiness probe: is the container ready to serve? If not, drain traffic.
- Startup probe: has the container finished starting up?
For LLM serving, the startup probe is critical because model loading is slow. A 70B model can take 60-180 seconds to load into GPU memory. Without a generous startup probe timeout, K8s will kill the container before it finishes loading.
A typical configuration:
spec:
predictor:
model:
...
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 60
periodSeconds: 10
timeoutSeconds: 5
This says: poll /health every 10 seconds, give up after 60 failures (= 600 seconds = 10 minutes). After the startup probe passes, the regular liveness/readiness probes take over.
vLLM exposes /health on its API port; the response is immediate once the server is ready.
Setting the startup probe correctly is the difference between “deploys cleanly” and “K8s keeps restarting the pod because it doesn’t realize loading takes 3 minutes.” Get it right.
47.9 Common manifests you’ll see
A few patterns you’ll encounter in real KServe deployments:
Multi-model serving (MMS)
KServe has a Multi-Model Serving (MMS) feature that lets one InferenceService serve multiple models in the same pod, with dynamic loading and unloading. This is useful for scenarios with many small models (e.g., per-tenant fine-tunes).
For LLMs, MMS is less common than running each model as its own InferenceService, because LLMs are large and loading/unloading is slow. The exception is multi-tenant LoRA serving, where the base model is shared and adapters are loaded on demand — but vLLM handles this internally without needing KServe’s MMS.
Canary deployments
KServe supports traffic split between versions:
spec:
predictor:
canaryTrafficPercent: 10
model:
runtime: vllm-runtime
storageUri: s3://bucket/models/llama-3-70b-v2/
This routes 10% of traffic to the new version while the rest goes to the existing version. Useful for gradual rollouts.
GPU resources and node selectors
For GPU scheduling:
spec:
predictor:
nodeSelector:
nvidia.com/gpu.product: H100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
model:
resources:
limits:
nvidia.com/gpu: "8"
This pins the deployment to H100 nodes and requests 8 GPUs per pod. Without the nodeSelector, K8s picks any GPU node (which might be the wrong type).
Init containers
For custom model preparation (e.g., merging LoRA adapters before serving):
spec:
predictor:
initContainers:
- name: merge-lora
image: my-org/merge-lora:latest
args:
- --base=/mnt/base
- --adapter=/mnt/adapter
- --output=/mnt/models
volumeMounts:
- name: model-storage
mountPath: /mnt/models
model:
runtime: vllm-runtime
storageUri: s3://bucket/models/llama-3-70b-base/
The init container runs before the main runtime, preparing the model.
47.10 The mental model
Eight points to take into Chapter 48:
- InferenceService is the KServe CRD for a deployed model.
- Predictor is the core: the runtime, the model, the resources.
- ServingRuntime is a separate CRD that describes how to launch a runtime. One ServingRuntime, many InferenceServices.
- Autoscaling can use Knative concurrency, KEDA custom metrics, or HPA. KEDA is the right choice for LLMs.
- Storage initializers pull the model from S3/GCS/etc. into
/mnt/models. - Startup probes must be generous (10+ minutes for 70B models) to allow loading.
- Multi-model serving is uncommon for LLMs; usually each model is its own ISVC.
- GPU node selection is your responsibility — set nodeSelectors correctly.
In Chapter 48 we drill into vLLM’s actual configuration: every flag that matters for production.
Read it yourself
- The KServe documentation, particularly the InferenceService API reference.
- The KServe ServingRuntime documentation.
- The KServe v1beta1 API spec (the YAML schema for InferenceService).
- The community vLLM-runtime examples on GitHub.
- The KServe troubleshooting guide.
Practice
- Write a complete InferenceService manifest for a Llama 3 8B deployment with vLLM, using a storage URI, GPU resources, and a startup probe.
- Define a ServingRuntime for vLLM v0.6.0 that supports the huggingface model format.
- Why is the startup probe critical for LLM serving? What goes wrong if you forget to set it?
- When would you use Knative-based autoscaling vs KEDA-based for an LLM InferenceService?
- Read a real KServe InferenceService manifest from a public repository. Identify the predictor, the runtime reference, and the autoscaling configuration.
- Why do most LLM deployments not use the transformer pattern? Argue from the OpenAI-compatible API design.
- Stretch: Set up KServe on a local K8s cluster and deploy a small open model with vLLM. Verify the InferenceService creates a working endpoint.