Part III · Inference Internals & Production Serving
Chapter 45 ~14 min read

Inference servers and orchestration: KServe, BentoML, Seldon, Ray Serve, Triton Inference Server

"The runtime gives you a model that runs. The orchestrator gives you a fleet that stays up"

In Chapter 44 we covered the runtimes — the engines that actually run a model on a GPU. This chapter covers the orchestration layer above the runtime: how to manage many model replicas across many GPUs across many nodes, with autoscaling, traffic split, model versioning, and operational tooling.

The orchestration layer is what turns “I have a model running on a GPU” into “I have a multi-tenant production fleet that survives outages, scales with demand, and tracks every request.” It’s a different concern from the runtime, and the major orchestration frameworks (KServe, BentoML, Seldon, Ray Serve, Triton Inference Server) compete in this space.

This chapter is the comparative tour. By the end you’ll know which orchestration layer to pick and why.

Outline:

  1. The orchestration layer’s job.
  2. KServe — the Kubernetes-native default.
  3. BentoML — the Pythonic alternative.
  4. Seldon — the older enterprise option.
  5. Ray Serve — the framework-integrated approach.
  6. Triton Inference Server — NVIDIA’s general-purpose option.
  7. Hugging Face Inference Endpoints / managed alternatives.
  8. The decision matrix.

45.1 The orchestration layer’s job

What does an orchestration layer do that a raw runtime (like vLLM) doesn’t?

Orchestration layer responsibilities: sits between the gateway and the runtime replicas, managing lifecycle, autoscaling, and traffic. AI Gateway — routes requests, authenticates, rate-limits Orchestration Layer (KServe / BentoML / Ray Serve) replica lifecycle autoscaling traffic split / canary model versioning replica 0 replica 1 replica 2 replica N Each replica is a vLLM (or other runtime) process holding the full model shard on its GPU allocation.
The orchestration layer is what turns a single-process vLLM into a production fleet — it manages dozens of replicas, routes traffic, and scales in response to load, without the runtime knowing any of this exists.

(1) Multi-replica management. A single vLLM process serves a single model on a fixed set of GPUs. Production needs many replicas — typically dozens or hundreds — for throughput, availability, and load balancing. The orchestrator manages the lifecycle of these replicas: starting them up, distributing them across nodes, restarting them when they crash.

(2) Autoscaling. The orchestrator monitors load (queue length, GPU utilization, request rate) and adds or removes replicas dynamically. Without autoscaling, you over-provision (waste money) or under-provision (drop requests).

(3) Traffic management. Route requests to replicas based on load, location, model version, or other criteria. Handle canary deployments (route 5% of traffic to a new model version, watch metrics, decide whether to roll out fully). Handle A/B tests.

(4) Model versioning and rollback. Track which model versions are deployed where. Roll back to a previous version when something goes wrong. Manage the lifecycle of model artifacts.

(5) Multi-model serving. Serve dozens or hundreds of models on a shared cluster, with intelligent placement based on resource availability and traffic patterns.

(6) Health checks and recovery. Monitor each replica’s health, restart failing replicas, drain traffic from unhealthy nodes during cordoning.

(7) Observability. Expose metrics (request rate, latency, error rate, GPU utilization) to a monitoring system. Trace requests through the stack.

(8) Configuration management. Apply configuration changes to replicas without manual intervention. GitOps integration.

These are operational concerns. They’re separate from “how fast is the model” (which is the runtime’s job) and from “how does the API look” (which is the gateway’s job).

The orchestrators differ in how they accomplish these tasks, what abstractions they expose, and how integrated they are with the rest of the cloud-native stack.

45.2 KServe — the Kubernetes-native default

KServe (originally KFServing, part of the Kubeflow project) is a Kubernetes-native model serving framework. It exposes models as Kubernetes custom resources (CRDs) and integrates deeply with the rest of the K8s ecosystem.

The core concept: an InferenceService (CRD). You write a YAML manifest that says “deploy this model with these resources, this autoscaler, this runtime,” and KServe creates and manages the underlying K8s objects (Deployments, Services, HPAs, Istio VirtualServices, etc.).

KServe InferenceService lifecycle: one YAML manifest generates K8s Deployment, Service, HPA, and Istio VirtualService. InferenceService YAML manifest (one file) KServe controller reconciles Deployment Service KEDA ScaledObj Istio VirtualSvc Serving fleet N replicas load balanced
A single KServe InferenceService manifest drives the creation of every underlying Kubernetes object — developers write YAML once and KServe handles the operational wiring.

A minimal InferenceService for an LLM:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-70b
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      runtime: vllm-runtime
      storageUri: s3://models/llama-3-70b/
    minReplicas: 2
    maxReplicas: 10
    containerConcurrency: 100

KServe handles the rest: pulling the model, starting vLLM, exposing it as a service, configuring the autoscaler, registering with Istio (if you use it), exposing metrics to Prometheus, etc.

Strengths of KServe:

  • Kubernetes-native. If you’re already on K8s, KServe fits naturally. It uses the same APIs, the same RBAC, the same observability stack.
  • GitOps friendly. InferenceService manifests can be committed to git and deployed via ArgoCD or Flux.
  • Multi-runtime support. Works with vLLM, TensorRT-LLM, Triton Inference Server, custom runtimes, and many others. Adding a new runtime is straightforward.
  • Autoscaling out of the box. Integrates with KEDA (Chapter 51) for scale-to-zero and custom metrics.
  • Traffic split and canary. Built-in support for percentage-based traffic routing between model versions.
  • Model registry integration. Pull models from S3, GCS, HF Hub, OCI registries.

Weaknesses:

  • Operational complexity. KServe has a lot of components (controller, webhook, runtime adapter, etc.). Debugging issues requires K8s expertise.
  • Sometimes flaky. New versions of KServe occasionally have bugs that break existing deployments. Production teams pin to specific versions.
  • Multi-container InferenceService limitations. As noted in Chapter 36, KServe v0.16 panics on multi-container ISVCs, which breaks disaggregated prefill/decode setups. Workarounds exist but they’re awkward.

KServe is the default for K8s-based LLM serving. If you’re running on Kubernetes (which most production teams are), KServe is the natural choice. The operational complexity is real but manageable with the right team.

45.3 BentoML — the Pythonic alternative

BentoML is a model serving framework that emphasizes a Pythonic developer experience. Instead of writing YAML manifests, you write Python classes that describe how to serve your model.

A minimal BentoML service:

import bentoml

@bentoml.service(resources={"gpu": 1})
class LlamaService:
    def __init__(self):
        self.model = load_llama_model()
    
    @bentoml.api
    def generate(self, prompt: str) -> str:
        return self.model.generate(prompt)

You then run bentoml serve to start the service, and BentoML handles the deployment.

Strengths:

  • Developer-friendly. Python-first, less YAML, easy to iterate locally.
  • Great for prototyping. You can go from “I have a model” to “I have a serving endpoint” in minutes.
  • Cloud deployment. BentoML has cloud offerings (BentoCloud) that handle the K8s details for you.
  • Multi-model support. Easy to compose multiple models into one service.
  • Versioning and registry. Built-in model packaging and versioning.

Weaknesses:

  • Less Kubernetes-native than KServe. BentoML can run on K8s, but it’s a separate abstraction layer.
  • Smaller community for production-grade LLM serving. BentoML is more popular for traditional ML serving.
  • Performance tuning is harder. The Python-first abstraction can hide important runtime knobs.

BentoML is the right choice for:

  • Teams that prefer a Python developer experience over YAML.
  • Mixed-model deployments (LLMs + classifiers + custom models).
  • Prototyping and iteration speed.
  • Cloud-managed deployments via BentoCloud.

For pure K8s-based production LLM serving, KServe is more common. For developer-friendly multi-model serving, BentoML is excellent.

45.4 Seldon — the older enterprise option

Seldon Core is one of the older open-source model serving frameworks, predating KServe by a few years. It was designed for enterprise ML deployments with a focus on traditional ML models (classifiers, regressors, etc.).

Strengths:

  • Mature and stable. Used in many enterprise deployments for years.
  • Rich feature set. Explainability, drift detection, A/B testing, complex graph compositions.
  • Multi-model graphs. Compose multiple models into a pipeline (e.g., a feature processor → a classifier → a postprocessor).

Weaknesses:

  • LLM support is newer. Seldon was designed before the LLM era and is catching up.
  • Less integrated with vLLM/TensorRT-LLM than KServe.
  • Complexity. The feature surface is large; learning curve is steep.

Seldon is the right choice when:

  • You have an existing Seldon deployment for classical ML and want to add LLM serving.
  • You need explainability or drift detection alongside serving.
  • You have complex multi-model pipelines.

For new LLM-focused deployments, KServe is more common. Seldon is still alive but losing market share to KServe in the LLM space.

45.5 Ray Serve — the framework-integrated approach

Ray Serve is a model serving framework built on top of Ray (the distributed computing framework from Anyscale, originally UC Berkeley). It’s tightly integrated with the rest of the Ray ecosystem (Ray Train, Ray Tune, Ray Data).

The pitch: if you’re already using Ray for distributed training, data processing, and orchestration, use Ray Serve for serving too. The same cluster handles everything.

Strengths:

  • Ray ecosystem integration. If you’re using Ray for other things, Serve fits naturally.
  • Python-first. Like BentoML, you write Python classes.
  • Distributed by design. Scaling across multiple nodes is natural.
  • Model composition. Easy to chain multiple models together.
  • Strong actor model. Each model replica is a Ray actor, which gives you per-replica state management.

Weaknesses:

  • Tied to Ray. If you’re not using Ray, the operational overhead of running a Ray cluster is significant.
  • Less LLM-specific tooling. Not as deeply integrated with vLLM/TensorRT-LLM as KServe.
  • Newer in the LLM space. Catching up to KServe and BentoML.

Ray Serve is the right choice when:

  • You already use Ray for training or data processing.
  • You want a unified Python-first stack for end-to-end ML.
  • You have complex multi-model serving requirements.

For LLM-only serving on a fresh K8s cluster, KServe is more common. For Ray-based teams, Ray Serve is the natural choice.

45.6 Triton Inference Server — NVIDIA’s general-purpose option

We touched on Triton Inference Server in Chapter 44. It’s both a runtime (running ONNX, TensorRT, vLLM, etc. backends) and an orchestration-layer-of-sorts (handling multi-model serving, dynamic batching, model versioning).

Strengths:

  • Production-grade. Used by many large NVIDIA customers.
  • Multi-backend. Serves models from many ML frameworks under a unified API.
  • Dynamic batching for non-LLM models. (For LLMs, the runtime backend handles batching.)
  • Model repository pattern. Models are organized in a directory structure with versioning.
  • Metrics and observability. Built-in Prometheus metrics.
  • GRPC and HTTP APIs.

Weaknesses:

  • NVIDIA-centric. Best on NVIDIA hardware; less optimized for AMD/Intel.
  • Less K8s-native than KServe. It’s a server, not a K8s controller. You can run it on K8s, but the integration is not as deep.
  • LLM features are newer. vLLM backend support was added later than vLLM’s standalone capabilities.

Triton Inference Server is the right choice when:

  • You’re heavily invested in NVIDIA’s stack.
  • You have mixed ML workloads (LLMs + classifiers + custom models).
  • You want a battle-tested production server.

For pure LLM serving, KServe or vLLM directly is often simpler. For mixed workloads, Triton Inference Server is competitive.

45.7 Hugging Face Inference Endpoints / managed alternatives

For teams that don’t want to manage the orchestration themselves, managed services are a real option:

  • Hugging Face Inference Endpoints — managed serving for HF Hub models. Pay per hour.
  • Together AI / Fireworks AI / DeepInfra / Anyscale Endpoints — managed LLM serving for popular open models. Pay per token.
  • AWS SageMaker / GCP Vertex AI / Azure ML — cloud-provided ML serving.
  • Replicate — managed inference for any open model.
  • Modal — Python-first serverless ML.

These services handle all the orchestration, scaling, and operational concerns. You just call an API.

The trade-offs:

  • Convenience: very high. No K8s, no GPUs to manage.
  • Cost per token: similar to or slightly higher than self-hosted.
  • Flexibility: lower. You’re constrained to the models and configurations the provider supports.
  • Data privacy: data goes through a third party.
  • Vendor lock-in: yes.

For prototyping, low-volume production, or teams without ops capacity, managed services are often the right choice. For high-volume production with specific requirements (custom models, fine-tunes, data privacy), self-hosted is usually better.

The crossover is similar to the self-host-vs-API decision (Chapter 30): managed wins below a threshold of volume; self-hosted wins above. The threshold for orchestration is similar to the threshold for runtime: roughly 100M-500M tokens/month.

45.8 The decision matrix

Use caseRecommended orchestration
Default for K8s-based LLM servingKServe
Python-first developer experienceBentoML
Existing Ray ecosystemRay Serve
NVIDIA-heavy mixed ML workloadsTriton Inference Server
Existing Seldon deploymentSeldon
Don’t want to manage anythingManaged (HF, Together, etc.)
Multi-cloud deploymentKServe (most portable)
Edge / mobile(Not really an orchestration question — use llama.cpp directly)
Disaggregated PDKServe with workarounds, or custom

For most production LLM teams in 2025: KServe + vLLM. That’s the modern default stack. The operational complexity is real but the ecosystem support is the best.

graph TD
  Start[Choose orchestration layer] --> Q1{Already on Kubernetes?}
  Q1 -->|Yes| Q2{Already using Ray?}
  Q1 -->|No| Q3{Want zero ops?}
  Q2 -->|Yes| RaySrv[Ray Serve]
  Q2 -->|No| Q4{Python-first DX?}
  Q4 -->|Yes| BML[BentoML]
  Q4 -->|No| KS[KServe]
  Q3 -->|Yes| Mgd[Managed — HF / Together / etc.]
  Q3 -->|No| Q5{NVIDIA-heavy mixed ML?}
  Q5 -->|Yes| Triton[Triton Inference Server]
  Q5 -->|No| KS
  style KS fill:var(--fig-accent-soft),stroke:var(--fig-accent)

KServe is the default for Kubernetes-native teams; managed services are often the right call below ~100M tokens/month.

45.9 The mental model

Eight points to take into Chapter 46:

  1. The orchestration layer is separate from the runtime. It manages replicas, autoscaling, traffic, versioning.
  2. KServe is the Kubernetes-native default. Custom resources, GitOps-friendly, multi-runtime.
  3. BentoML is the Python-first alternative. Better DX, less K8s integration.
  4. Seldon is the older enterprise option. Mature but less LLM-focused.
  5. Ray Serve is for teams already using Ray.
  6. Triton Inference Server is for NVIDIA-heavy mixed workloads.
  7. Managed services are the right choice when ops capacity is limited.
  8. KServe + vLLM is the modern default for production LLM serving on K8s.

In Chapter 46 we look at how the runtime, orchestration, gateway, autoscaler, and observability all compose into a single platform — the umbrella pattern.


Read it yourself

  • The KServe documentation and the InferenceService CRD reference.
  • The BentoML documentation and quickstart.
  • The Seldon Core documentation.
  • The Ray Serve documentation.
  • The Triton Inference Server documentation.
  • The KServe + vLLM integration guide on the KServe docs.

Practice

  1. Write a KServe InferenceService manifest for serving Llama 3 8B with vLLM, with min 1 / max 5 replicas and autoscaling on vllm:num_requests_running.
  2. Why is KServe Kubernetes-native and BentoML not? Compare the core abstractions.
  3. When would you choose Ray Serve over KServe? Construct a specific scenario.
  4. Read the KServe v1beta1 API reference for the InferenceService spec. Identify the fields for autoscaling, traffic split, and storage.
  5. Why are managed services like Together AI and Fireworks AI competitive on price with self-hosting? Trace where their efficiency comes from.
  6. What’s the difference between KServe’s autoscaler and the standard Kubernetes HPA? Why does KServe need a custom one?
  7. Stretch: Set up a local KServe deployment with a small open model and the vLLM runtime. Verify it serves requests via the OpenAI-compatible API.