Chapter 47: KServe InferenceService anatomy: runtime, predictor, transformer, autoscaling

In Chapter 45 we positioned KServe as the default Kubernetes-native orchestration layer for LLM serving. In this chapter we go deep on what an InferenceService actually is — the parts, the controllers, the runtime adapters, and what each YAML field controls. By the end you’ll be able to read any KServe manifest and understand exactly what gets deployed, configure your own InferenceServices for production, and debug them when things go wrong.

This is a hands-on reference chapter. I’ll walk through the InferenceService anatomy, then the runtimes, then the autoscaling integration.

Outline:

The InferenceService CRD.
The predictor.
The transformer (preprocessing/postprocessing).
The runtime adapter.
ServingRuntime CRD and how it composes with InferenceService.
Autoscaling integration.
Storage and model loading.
Health checks and probes.
Common manifests you’ll see.

47.1 The InferenceService CRD

KServe defines a Kubernetes Custom Resource Definition called InferenceService (often abbreviated ISVC). An InferenceService represents a deployed model that can serve inference requests.

A minimal InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-70b
  namespace: ai-models
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      runtime: vllm-runtime
      storageUri: s3://my-bucket/models/llama-3-70b/

That’s the smallest manifest that does something useful. KServe’s controller reads this, picks the vllm-runtime ServingRuntime (a separate CRD, see §47.5), and creates the underlying Kubernetes objects to serve the model.

Behind the scenes, the controller creates:

A Deployment that runs vLLM with the model.
A Service that load-balances across the Deployment’s pods.
Optionally a Knative Service if Knative integration is enabled.
A VirtualService if Istio is integrated.
Various ConfigMaps and Secrets for configuration.

You don’t write any of these yourself. KServe generates them from the InferenceService spec.

A single InferenceService CRD is the only manifest you write; KServe's controller generates all underlying Kubernetes objects from it.

The full schema has many more fields. The key sections are predictor, transformer, explainer, and the various optional pieces. We’ll cover the relevant ones for LLM serving.

47.2 The predictor

The predictor is the core of an InferenceService. It’s what runs the actual model. For LLMs, the predictor section specifies:

The model format (huggingface, vllm, tensorrt-llm, etc.).
The runtime (which ServingRuntime to use).
The storage URI (where to pull the model from).
Resources (CPU, memory, GPU).
Autoscaling settings (min/max replicas, target metric).
Container args (passed to the runtime).
Environment variables.

A more detailed predictor spec:

spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    containerConcurrency: 100
    timeout: 600
    
    model:
      modelFormat:
        name: huggingface
      runtime: vllm-runtime
      storageUri: s3://my-bucket/models/llama-3-70b/
      resources:
        limits:
          nvidia.com/gpu: "8"
          memory: 320Gi
        requests:
          nvidia.com/gpu: "8"
          memory: 320Gi
      args:
        - --tensor-parallel-size=8
        - --max-model-len=32768
        - --gpu-memory-utilization=0.9
        - --enable-prefix-caching
      env:
        - name: VLLM_ATTENTION_BACKEND
          value: FLASH_ATTN

Read this carefully:

minReplicas / maxReplicas set the autoscaling bounds.
containerConcurrency is the maximum concurrent requests per pod (the Knative concurrency target).
timeout is the request timeout in seconds.
model.runtime references a ServingRuntime named vllm-runtime (we’ll define it in §47.5).
model.storageUri is where KServe downloads the model from. Supports S3, GCS, OCI, HTTP, PVC, etc.
resources.limits declares the GPU and memory requirements. KServe ensures the pod is scheduled on a node that has these resources.
args are passed to the underlying vLLM container as command-line arguments. This is where you set vLLM’s tuning flags (Chapter 48).
env sets environment variables in the container.

The predictor is the core. The other parts (transformer, explainer) are optional add-ons.

47.3 The transformer

The transformer is an optional sidecar that does preprocessing or postprocessing around the predictor. The pattern: a request arrives at the transformer, the transformer modifies it (e.g., tokenizes, formats), forwards to the predictor, gets the response, modifies it (e.g., decodes, formats), and returns to the client.

For LLM serving, transformers are sometimes used to:

Apply chat templates before the request reaches vLLM (Chapter 5).
Format outputs for downstream consumers.
Add custom guardrails that the runtime doesn’t support natively.
Multiplex multiple model APIs into a single endpoint.

A transformer is just another container in the same pod (or a separate pod) that intercepts traffic. KServe’s controller wires the transformer in front of the predictor automatically.

For most modern LLM serving, transformers are not used. vLLM’s OpenAI-compatible API handles most of what a transformer would do. Transformers are more common in classical ML serving (image classification, etc.) where the input/output formats are less standardized.

You’ll see transformers in some KServe deployments. For LLM-specific work, you can usually ignore them.

47.4 The runtime adapter

The runtime adapter is the glue between the InferenceService spec and the actual model server (vLLM, TensorRT-LLM, etc.). It’s the code that:

Reads the InferenceService spec.
Loads the model from the storage URI.
Starts the underlying server with the right arguments.
Exposes the server’s API (typically on port 8080 or 8000).
Handles health checks and metrics endpoints.

KServe ships with adapters for many runtimes:

kserve-tritonserver for Triton Inference Server.
kserve-mlserver for MLServer (BentoML-style serving).
kserve-huggingfaceserver for Hugging Face Transformers-based models.
kserve-vllm-runtime (or community variants) for vLLM.
Custom runtimes for TensorRT-LLM, TGI, etc.

Each runtime is defined in a ServingRuntime CRD (next section), which describes how to launch the container, what model formats it supports, and what arguments to pass.

The KServe community has been catching up with vLLM-specific runtimes. As of late 2025, the standard is to either:

Use a community-maintained vLLM ServingRuntime, or
Define your own ServingRuntime with the official vLLM image.

Option 2 is more common for production because it gives you full control over the vLLM version and configuration.

47.5 ServingRuntime CRD

A ServingRuntime is a separate CRD that describes a runtime that can be used by InferenceServices. It’s the “template” for how to launch a particular runtime.

A vLLM ServingRuntime looks like:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-runtime
spec:
  supportedModelFormats:
    - name: huggingface
      version: "1"
      autoSelect: true
    - name: vllm
      version: "1"
      autoSelect: true
  
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:v0.6.0
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - --model=/mnt/models
        - --port=8080
        - --host=0.0.0.0
      ports:
        - containerPort: 8080
          name: http1
          protocol: TCP
      env:
        - name: HF_HOME
          value: /mnt/cache
      volumeMounts:
        - name: model-storage
          mountPath: /mnt/models

This says: “when an InferenceService specifies runtime: vllm-runtime, launch a container with the vllm/vllm-openai:v0.6.0 image, run the OpenAI-compatible API server, and mount the model at /mnt/models.”

The InferenceService spec then augments this with model-specific configuration: the storage URI (which becomes the model mount), additional args, and resource requests.

The two-CRD design (ServingRuntime + InferenceService) lets you:

Define a ServingRuntime once per runtime version (vLLM 0.6.0, vLLM 0.6.1, etc.).
Use it for many InferenceServices, each with a different model.
Centrally upgrade the runtime version without touching the per-model manifests.

One ServingRuntime definition serves many InferenceServices; bumping the runtime version is a single-object edit, not a fleet-wide change.

For production, you’ll typically have a few ServingRuntimes (one per runtime version, maybe one per quantization scheme) and many InferenceServices that reference them.

47.6 Autoscaling integration

KServe supports several autoscaling integrations:

Knative-based autoscaling (default)

If you have Knative installed, KServe uses Knative’s autoscaler by default. It scales based on request concurrency — the number of in-flight requests per pod.

spec:
  predictor:
    minReplicas: 1
    maxReplicas: 10
    containerConcurrency: 50  # target 50 concurrent requests per pod

Knative’s autoscaler measures concurrency and scales replicas to keep each pod near the target. It supports scale-to-zero (minReplicas: 0) for low-traffic models.

The downside: Knative’s concurrency-based scaling is not great for LLMs because LLM requests have variable cost. A pod handling 50 long-context decode requests is much more loaded than one handling 50 short-context requests, but Knative sees them as the same.

For LLM serving, KEDA-based scaling (next section) is usually preferred.

Knative scales on request count (blind to variable GPU cost), while KEDA scales on actual GPU-level metrics — making KEDA the right choice for LLM inference.

KEDA integration

KServe can integrate with KEDA (Kubernetes Event-Driven Autoscaling, Chapter 51) for scaling on custom metrics. The pattern:

Define a KEDA ScaledObject that scales the InferenceService’s underlying Deployment based on a custom metric (e.g., vllm:num_requests_running from Prometheus).
KEDA reads the metric and adjusts the Deployment’s replica count.
KServe’s InferenceService spec sets minReplicas: 1 and maxReplicas: 1 to disable Knative’s autoscaler, letting KEDA take over.

This is the more LLM-appropriate pattern because the metric (num_requests_running) reflects actual GPU load, not just request count.

We cover KEDA in detail in Chapter 51.

HPA fallback

If you don’t use Knative or KEDA, KServe can use the standard Kubernetes HPA (HorizontalPodAutoscaler) with CPU or memory metrics. This is almost always the wrong choice for LLMs because CPU usage doesn’t reflect GPU load. Don’t use HPA-on-CPU for LLM serving.

47.7 Storage and model loading

The storageUri field tells KServe where to pull the model from. KServe supports many storage backends via storage initializers — small init containers that run before the main runtime starts and download the model to a shared volume.

Supported URIs:

s3://bucket/path/ — AWS S3.
gs://bucket/path/ — Google Cloud Storage.
https://example.com/model.tar.gz — HTTP/HTTPS download.
oci://registry/image:tag — OCI image (model packaged as a container).
pvc://pvc-name/path — Kubernetes PersistentVolumeClaim.
hf://owner/model-name — Hugging Face Hub (newer support).

The storage initializer downloads the model to /mnt/models (the conventional location), and the runtime container mounts the same path and loads from it.

For large models (70B+), the download time is significant. A 70B model in bf16 is 140 GB; even at 1 GB/s download, that’s 140 seconds — and many storage backends are slower. To avoid paying this cost on every pod start, you use a model cache (Chapter 52): pre-pull the model to node-local NVMe so subsequent pods load instantly.

KServe supports the model cache pattern via the LocalModelCache CRD or via direct PVC mounts.

47.8 Health checks and probes

KServe exposes the standard Kubernetes health probes:

Liveness probe: is the container alive? If not, restart it.
Readiness probe: is the container ready to serve? If not, drain traffic.
Startup probe: has the container finished starting up?

For LLM serving, the startup probe is critical because model loading is slow. A 70B model can take 60-180 seconds to load into GPU memory. Without a generous startup probe timeout, K8s will kill the container before it finishes loading.

A typical configuration:

spec:
  predictor:
    model:
      ...
      startupProbe:
        httpGet:
          path: /health
          port: 8080
        failureThreshold: 60
        periodSeconds: 10
        timeoutSeconds: 5

This says: poll /health every 10 seconds, give up after 60 failures (= 600 seconds = 10 minutes). After the startup probe passes, the regular liveness/readiness probes take over.

vLLM exposes /health on its API port; the response is immediate once the server is ready.

Setting the startup probe correctly is the difference between “deploys cleanly” and “K8s keeps restarting the pod because it doesn’t realize loading takes 3 minutes.” Get it right.

47.9 Common manifests you’ll see

A few patterns you’ll encounter in real KServe deployments:

Multi-model serving (MMS)

KServe has a Multi-Model Serving (MMS) feature that lets one InferenceService serve multiple models in the same pod, with dynamic loading and unloading. This is useful for scenarios with many small models (e.g., per-tenant fine-tunes).

For LLMs, MMS is less common than running each model as its own InferenceService, because LLMs are large and loading/unloading is slow. The exception is multi-tenant LoRA serving, where the base model is shared and adapters are loaded on demand — but vLLM handles this internally without needing KServe’s MMS.

Canary deployments

KServe supports traffic split between versions:

spec:
  predictor:
    canaryTrafficPercent: 10
    model:
      runtime: vllm-runtime
      storageUri: s3://bucket/models/llama-3-70b-v2/

This routes 10% of traffic to the new version while the rest goes to the existing version. Useful for gradual rollouts.

GPU resources and node selectors

For GPU scheduling:

spec:
  predictor:
    nodeSelector:
      nvidia.com/gpu.product: H100
    tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
    model:
      resources:
        limits:
          nvidia.com/gpu: "8"

This pins the deployment to H100 nodes and requests 8 GPUs per pod. Without the nodeSelector, K8s picks any GPU node (which might be the wrong type).

Init containers

For custom model preparation (e.g., merging LoRA adapters before serving):

spec:
  predictor:
    initContainers:
      - name: merge-lora
        image: my-org/merge-lora:latest
        args:
          - --base=/mnt/base
          - --adapter=/mnt/adapter
          - --output=/mnt/models
        volumeMounts:
          - name: model-storage
            mountPath: /mnt/models
    model:
      runtime: vllm-runtime
      storageUri: s3://bucket/models/llama-3-70b-base/

The init container runs before the main runtime, preparing the model.

47.10 The mental model

Eight points to take into Chapter 48:

InferenceService is the KServe CRD for a deployed model.
Predictor is the core: the runtime, the model, the resources.
ServingRuntime is a separate CRD that describes how to launch a runtime. One ServingRuntime, many InferenceServices.
Autoscaling can use Knative concurrency, KEDA custom metrics, or HPA. KEDA is the right choice for LLMs.
Storage initializers pull the model from S3/GCS/etc. into /mnt/models.
Startup probes must be generous (10+ minutes for 70B models) to allow loading.
Multi-model serving is uncommon for LLMs; usually each model is its own ISVC.
GPU node selection is your responsibility — set nodeSelectors correctly.

In Chapter 48 we drill into vLLM’s actual configuration: every flag that matters for production.

Read it yourself

The KServe documentation, particularly the InferenceService API reference.
The KServe ServingRuntime documentation.
The KServe v1beta1 API spec (the YAML schema for InferenceService).
The community vLLM-runtime examples on GitHub.
The KServe troubleshooting guide.

Practice

Write a complete InferenceService manifest for a Llama 3 8B deployment with vLLM, using a storage URI, GPU resources, and a startup probe.
Define a ServingRuntime for vLLM v0.6.0 that supports the huggingface model format.
Why is the startup probe critical for LLM serving? What goes wrong if you forget to set it?
When would you use Knative-based autoscaling vs KEDA-based for an LLM InferenceService?
Read a real KServe InferenceService manifest from a public repository. Identify the predictor, the runtime reference, and the autoscaling configuration.
Why do most LLM deployments not use the transformer pattern? Argue from the OpenAI-compatible API design.
Stretch: Set up KServe on a local K8s cluster and deploy a small open model with vLLM. Verify the InferenceService creates a working endpoint.