Composing an inference platform: the umbrella pattern
"A serving framework runs a model. An inference platform runs a fleet"
The previous two chapters covered the runtime layer (Chapter 44) and the orchestration layer (Chapter 45) separately. This chapter is about how they compose with everything else — the AI gateway, the autoscaler, the model cache, the KV cache fabric, the observability stack, the metering pipeline — into a single coherent platform.
The architectural pattern that has emerged is the inference platform umbrella: a single deployable unit (often a Helm chart) that wires all the components together with consistent configuration. Examples in the wild include factory-models-style umbrella charts, NVIDIA’s NIM containers, and various proprietary platforms.
This chapter is about why the umbrella pattern exists, what it contains, and how to build one.
Outline:
- The pieces and the wiring problem.
- The umbrella pattern.
- The components of a production platform.
- The Helm chart as platform.
- The NIM-style container.
- Configuration consistency.
- The deployment lifecycle.
- The reference architecture.
46.1 The pieces and the wiring problem
To run a production LLM, you need at minimum:
- A runtime (vLLM, TensorRT-LLM, etc.) — runs the model.
- An orchestrator (KServe, etc.) — manages replicas.
- An AI gateway (Envoy AI Gateway, etc.) — routes requests.
- An autoscaler (KEDA, HPA) — scales replicas based on load.
- A model cache (LocalModelCache, init container) — pre-pulls weights.
- An observability stack — Prometheus metrics, logs, traces.
- A KV cache layer (LMCache, etc.) — optionally, for cross-replica sharing.
These are seven (or more) independent components, each with its own configuration, deployment manifest, and operational concerns. Wiring them together by hand is tedious and error-prone:
- The runtime needs to know about the model storage location.
- The orchestrator needs to know about the runtime’s resource requirements.
- The gateway needs to know about the orchestrator’s service endpoints.
- The autoscaler needs to know about the runtime’s metrics endpoint.
- The model cache needs to know about the runtime’s expected model directory.
- The observability stack needs to know about all of the above.
- The KV cache layer needs to coordinate with the runtime’s allocator.
Without coordination, you end up with seven separately-managed deployments and a mountain of glue YAML. The umbrella pattern solves this by packaging all the components together with consistent configuration.
46.2 The umbrella pattern
The pattern: a single configuration object (a Helm chart values file, a custom resource, a manifest set) describes the whole platform. From it, all the individual components are generated with consistent settings.
For example, a single values file might say:
model:
name: llama-3-70b
storageUri: s3://models/llama-3-70b/
runtime:
image: vllm/vllm-openai:latest
args:
- --max-model-len=32768
- --tensor-parallel-size=8
- --quantization=fp8
resources:
gpu: 8
autoscaling:
minReplicas: 2
maxReplicas: 20
metric: vllm:num_requests_running
threshold: 50
gateway:
enabled: true
routePath: /v1/llama-3-70b
modelCache:
enabled: true
storageClass: nvme-local
observability:
prometheusScrape: true
grafanaDashboard: true
From this single file, the umbrella chart generates:
- The KServe InferenceService manifest.
- The KEDA ScaledObject for autoscaling.
- The Envoy AI Gateway routing rule.
- The LocalModelCache CRD (if supported).
- The PodMonitor for Prometheus scraping.
- The Grafana dashboard ConfigMap.
The user only sees the values file. The umbrella handles the wiring.
This is the same pattern as Helm umbrella charts in the broader Kubernetes world. What’s new is that LLM serving has enough independent components that the pattern is essential, not optional.
46.3 The components of a production platform
Let me enumerate the components in detail, since you’ll see all of them in any production LLM platform.
Runtime layer
- vLLM (or alternative) as the inference engine.
- Model weights loaded from storage.
- GPU resources declared via Kubernetes resource requests.
- Health probes for liveness and readiness.
Orchestration layer
- KServe InferenceService (or equivalent) wrapping the runtime.
- Multiple replicas for throughput and availability.
- Resource limits to prevent runaway memory.
- Pod anti-affinity to spread replicas across nodes.
Autoscaling
- KEDA ScaledObject with a custom metric (e.g.,
vllm:num_requests_running). - Min and max replicas to bound the cost.
- Cooldown periods to prevent flapping.
- Optionally scale-to-zero for low-traffic models.
Gateway / routing
- AI gateway (Envoy AI Gateway, vLLM router, custom proxy).
- Path-based or header-based routing to map requests to model fleets.
- Request/response transformation if needed (e.g., to adapt non-OpenAI APIs).
- Authentication at the gateway level.
Model storage and caching
- Model artifact storage (S3, GCS, OCI registry, NFS).
- LocalModelCache or init container to pre-pull weights to node-local storage.
- NVMe-backed PVCs for fast loading.
- Versioning to track which model version is deployed.
Observability
- Prometheus metrics scraped from each replica (vLLM exposes them on
/metrics). - Grafana dashboards for visualization.
- Alerts for SLO violations (high latency, error rate, GPU utilization).
- Distributed tracing for request flow.
- Structured logs shipped to Loki or similar.
KV cache layer (optional)
- LMCache or similar for cross-replica KV sharing.
- Redis backend for the shared cache tier.
- Configuration for tier sizes and eviction policies.
Metering and billing
- Metering sidecar to count tokens and requests per tenant.
- Event stream to a billing aggregator.
- Per-tenant quotas enforced at the gateway.
Security
- TLS termination at the gateway.
- Authentication (JWT, API key, OAuth).
- Authorization per tenant.
- Network policies restricting access between components.
That’s a lot of pieces. The umbrella chart is what makes them deployable as one unit.
46.4 The Helm chart as platform
In the Kubernetes ecosystem, the umbrella pattern is typically implemented as a Helm chart (or a Kustomize overlay, or an Argo CD ApplicationSet). The chart defines:
- The default values for every component.
- The templates that generate the K8s manifests.
- The dependencies on other charts (vLLM, KServe, KEDA, Envoy AI Gateway, etc.).
A single helm install factory-models -f my-values.yaml deploys the entire platform.
The structure of an umbrella chart (typical):
factory-models/
├── Chart.yaml # chart metadata
├── values.yaml # default values
├── templates/
│ ├── inference-service.yaml # KServe ISVC
│ ├── scaled-object.yaml # KEDA scaler
│ ├── gateway-route.yaml # Envoy AI Gateway routing
│ ├── model-cache.yaml # LocalModelCache CRD
│ ├── pod-monitor.yaml # Prometheus scraping
│ ├── prometheus-rules.yaml # alerting rules
│ └── ...
└── charts/ # subcharts (Envoy Gateway, KEDA, etc.)
When you change a value (e.g., model.name), all the templates that reference it regenerate consistently. The single source of truth is the values file.
This is the standard pattern for production LLM platforms in 2025. Most internal platforms at large companies follow this pattern; many open-source examples exist.
46.5 The NIM-style container
NVIDIA’s NIM (NVIDIA Inference Microservices) is an alternative pattern: instead of a Helm chart that deploys multiple components, package the entire serving stack (runtime + orchestration + observability) into a single OCI container that you deploy as a normal K8s Deployment.
A NIM container includes:
- The model weights (or a downloader).
- The runtime (TensorRT-LLM or vLLM).
- An OpenAI-compatible API server.
- Prometheus metrics endpoint.
- Health probes.
- Configuration via environment variables.
The pitch: “you don’t need a Helm chart and seven components — just kubectl run this container and it works.”
The trade-offs:
NIM advantages:
- Simpler deployment (one container, one Deployment).
- Faster to get started.
- Tested as a unit by NVIDIA.
- Vendor-supported.
NIM disadvantages:
- Less flexible than the umbrella chart (you can’t easily swap components).
- Tied to NVIDIA’s stack.
- Updates require waiting for new container releases.
- Some optimizations (cross-replica caching, custom autoscaling) are harder.
NIM is the right choice for simple deployments where you want NVIDIA to do the integration work. The umbrella chart is the right choice for flexible deployments where you want fine-grained control.
Most large production teams use the umbrella chart approach. Most teams just getting started use NIM (or equivalent vendor-packaged options) because it’s simpler.
46.6 Configuration consistency
The deepest reason for the umbrella pattern is configuration consistency. Without a single source of truth, you end up with subtle inconsistencies:
- The autoscaler scales on
vllm:num_requests_running, but the runtime is configured with a different metric port. - The gateway routes requests to
llama-3-70b, but the InferenceService is namedllama3-70b(different). - The model cache pulls from one S3 bucket, but the runtime expects the model in a different path.
- The observability stack scrapes one metrics endpoint, but the runtime exposes them on a different port.
Each of these is a small bug, but together they cause production incidents. The umbrella chart eliminates them by deriving every reference from a single source of truth.
The senior insight: the platform’s complexity is the wiring, not the components. Each component (vLLM, KServe, KEDA, etc.) is well-documented and well-tested in isolation. The complexity is in making them work together correctly. The umbrella pattern is an answer to that complexity.
46.7 The deployment lifecycle
A typical deployment lifecycle for an umbrella-based platform:
sequenceDiagram
participant Dev as Developer
participant Git as Git repo
participant ArgoCD as Argo CD
participant Helm as Helm chart
participant K8s as Kubernetes
participant vLLM as vLLM pod
Dev->>Git: commit values.yaml
ArgoCD->>Git: detect change
ArgoCD->>Helm: render templates with new values
Helm->>K8s: apply manifests
K8s->>K8s: schedule GPU pods
K8s->>vLLM: start init container - pull model weights
vLLM->>vLLM: load model into GPU memory
vLLM->>K8s: readiness probe passes
K8s->>Dev: deployment complete
GitOps-driven deployment: committing a values.yaml file is the only manual step — the rest of the platform lifecycle is automated.
(1) Author the values file. A team member writes a values.yaml describing the desired model deployment.
(2) Commit to git. The values file is committed to a git repo (the GitOps source of truth).
(3) Argo CD detects the change. A GitOps controller (Argo CD, Flux) sees the new commit and reconciles the cluster.
(4) Helm renders the chart. The umbrella chart’s templates are rendered with the new values.
(5) Kubernetes applies the manifests. The KServe InferenceService, KEDA ScaledObject, gateway route, etc. are created or updated.
(6) KServe creates the Deployment. The InferenceService controller spawns the underlying K8s Deployment.
(7) Pods schedule onto GPU nodes. The K8s scheduler picks GPU nodes that match the resource request.
(8) Init container pulls the model. The LocalModelCache or init container downloads the model weights to local storage.
(9) vLLM starts up. The runtime loads the model into GPU memory.
(10) Health checks pass. The pod becomes ready and starts receiving traffic.
(11) KEDA monitors metrics. The autoscaler watches the configured metric and scales replicas.
(12) Gateway routes traffic. Requests come in via the AI gateway and are load-balanced across the replicas.
(13) Observability captures everything. Metrics, logs, and traces flow to the monitoring stack.
This is the standard flow. Once it’s set up, deploying a new model is as simple as committing a new values file. The platform handles the rest.
46.8 The reference architecture
Putting it all together, a reference architecture for a production LLM platform:
┌─────────────────────────────────────────────────────────┐
│ AI Gateway (Envoy AI Gateway) │
│ - OpenAI-compatible API │
│ - Path-based routing │
│ - Authentication │
│ - Per-tenant rate limiting │
└─────────────────┬───────────────────────────────────────┘
│
┌────────┴────────┐
│ │
v v
┌────────────────┐ ┌────────────────┐
│ KServe ISVC │ │ KServe ISVC │
│ llama-3-70b │ │ qwen-2.5-72b │
└────────┬───────┘ └────────┬───────┘
│ │
v v
┌────────────────┐ ┌────────────────┐
│ vLLM replicas │ │ vLLM replicas │
│ (TP=8) │ │ (TP=8) │
└────────┬───────┘ └────────────────┘
│
v
┌────────────────┐ ┌──────────────┐
│ LMCache │<─────>│ Redis (KV │
│ (per replica) │ │ sharing) │
└────────┬───────┘ └──────────────┘
│
v
┌────────────────┐
│ LocalModelCache│
│ (NVMe) │
└────────────────┘
▲
│ (Cross-cutting)
│
┌──┴────────────────────────────────────────┐
│ KEDA ScaledObjects (autoscaling) │
│ Prometheus (metrics) │
│ Grafana (dashboards) │
│ Loki (logs) │
│ Metering sidecar (token counting) │
│ ArgoCD (GitOps deployment) │
└───────────────────────────────────────────┘
This is the reference. Production deployments vary, but the core pieces are always present in some form.
The umbrella chart packages all of this. The values file describes the model and the desired configuration. The chart generates everything else.
46.9 The mental model
Eight points to take into Chapter 47:
- A production LLM platform has many components. Runtime, orchestrator, gateway, autoscaler, cache, observability, security.
- The umbrella pattern packages them as one deployable unit. One values file → many manifests.
- Helm charts are the standard implementation. Other approaches (Kustomize, ApplicationSets) work too.
- NIM-style single-container packaging is the simpler alternative for less flexible deployments.
- Configuration consistency is the deepest reason for the pattern. Wiring is where bugs live.
- GitOps deployment (Argo CD, Flux) is the standard for managing umbrella charts.
- The reference architecture has gateway → orchestrator → runtime → cache layers, with observability cross-cutting.
- Deploying a new model becomes a values file change. This is the operational ideal.
In Chapter 47 we look at one of the most important components in detail: KServe’s InferenceService anatomy.
Read it yourself
- The Helm documentation on subcharts and umbrella patterns.
- The KServe documentation on the InferenceService CRD.
- The NVIDIA NIM documentation.
- Argo CD documentation on Application and ApplicationSet patterns.
- Open-source LLM serving Helm charts on GitHub (search for “vLLM helm chart” or “LLM serving chart”).
Practice
- Sketch the components of a production LLM platform. Identify the configuration consistency requirements between any two components.
- Why does the umbrella pattern exist? Argue at the level of “configuration drift between independent deployments.”
- Write a Helm values file that specifies a Llama 3 70B deployment with vLLM, KEDA autoscaling, and Envoy AI Gateway routing. Identify which fields cross-reference between components.
- Compare the umbrella chart approach to the NIM-style single-container approach. Which is better for which scenario?
- Read an open-source LLM serving Helm chart (any one). Identify the dependencies between subcharts.
- Why is GitOps a natural fit for the umbrella pattern?
- Stretch: Build a minimal umbrella Helm chart that deploys a small vLLM model, a KEDA scaler, and a service. Test it on a local K8s cluster (kind, k3s, etc.).