Chapter 120: Design a multi-tenant model serving platform

Multi-tenant model serving is the meta-question. It ties together Parts I through IX: model internals, inference, retrieval, agents, distributed systems, data, observability, and build/deploy. Every box on the diagram from earlier chapters reappears here, now with the added complexity of isolation, quotas, billing, and the operational problem of running many models for many customers at once. The candidate who answers this well demonstrates the full range of the book.

This is the densest design chapter. It cross-references heavily. The implicit audience is a company building an internal ML platform — or a managed-inference provider like Together, Fireworks, Anyscale, or Replicate — that hosts models for other teams or other companies.

Outline:

Clarify: what kind of multi-tenancy.
Estimate: the scale of a platform with 100 tenants.
High-level architecture — platform components.
Drill 1: tenancy isolation models.
Drill 2: GPU sharing strategies.
Drill 3: model registry, versioning, and rollout.
Multi-tenant LoRA serving.
Billing, metering, and quotas.
Eval, observability, and the platform SLO.
Tradeoffs volunteered.
The mental model.

120.1 Clarify: what kind of multi-tenancy

“Multi-tenant” means different things at different scales. The clarify phase:

1. Internal platform for multiple teams, or external product for multiple customers? Interviewer says: external product — a company building a hosted inference platform for third-party customers, like a hosted alternative to OpenAI. This matters because external products have stricter isolation requirements (GDPR, SOC 2) and real billing.

2. What models? A catalog of popular open models (Llama family, Mistral, Qwen, DeepSeek) plus customer-supplied fine-tunes and LoRA adapters.

3. How many tenants and what’s the distribution? ~500 tenants, heavily skewed: 10 tenants produce 80% of traffic, 200 tenants produce 1% each, the rest are trial accounts with near-zero usage.

4. What’s the SLA model? Three tiers: free (best effort, rate-limited), standard (99.5% availability, 2s p95 TTFT), premium (99.9% availability, 1s p95 TTFT, reserved capacity).

5. What’s the isolation requirement? Tenant data must never leak across tenants. Tenant A’s requests, logs, KV cache, and billing data must be invisible to tenant B, even in a breach.

6. What are the compliance requirements? SOC 2, GDPR, HIPAA-eligible (for specific tenants). Data residency in US and EU.

The candidate writes these down. The design that follows is shaped by the asymmetric tier, the hard isolation requirement, and the scale.

120.2 Estimate: the scale of a platform with 100 tenants

Total tenants: 500, with a long tail.
Active tenants per day: ~100.
Active models in catalog: 20 base models, plus ~2000 customer fine-tunes (mostly LoRA adapters).
Total inference QPS: ~10k at peak across all tenants.
Total output tokens/day: ~50B (aggregated across all tenants).
GPU fleet: At ~1000 tokens/sec per GPU, 50B/day / 86,400 / 1000 ≈ 580 GPUs average. Peak sizing ~2000 GPUs. This is a ~4-MW fleet, modest for a hosted provider.
Training compute (for fine-tuning): separate from serving, see Chapter 121.
Model storage: 20 base models × 100 GB average = 2 TB. 2000 LoRA adapters × 200 MB = 400 GB. Plus customer-uploaded full fine-tunes at ~100 GB each × 50 = 5 TB. Total model artifact storage: ~7 TB, fits on object storage comfortably.
Metering events: 10k QPS × ~2 events per request = 20k events/sec → ~1.7B events/day to the billing pipeline.

Cost: 2000 GPUs at $1.50/hour (reserved bulk pricing) × 720 = $2.16M/month for compute. Plus storage, network, metering, ops: round to ~$2.5M/month total platform cost. Gross margin on a hosted API is typically 40–60%, implying revenue ~$4–6M/month to be healthy.

120.3 High-level architecture — platform components

  [ customer (tenant A, B, C, ...) ]
         |
         v
  [ edge / CDN (per region) ]
         |
         v
  [ API gateway + OpenAI-compatible front door ]
     - JWT-based tenant identification
     - per-tenant rate limiting
     - request size limits
     - header-based region routing
         |
         v
  [ admission + router ]
     - model resolution (tenant X's "gpt-4-mini" -> real model ID)
     - quota check against tenant tier
     - region pinning + data residency
     - fine-tune / LoRA adapter resolution
         |
         v
  [ tenancy + isolation layer ]
     - tag request with tenant context
     - inject tenant ACL into all downstream calls
     - choose serving pool (shared vs reserved)
         |
         v
  +---------------------+---------------------+
  |                                           |
  v                                           v
  [ shared pool ]                        [ reserved pools ]
  - vLLM fleets for base models         - premium tenants with
  - multi-tenant LoRA hot-swap            guaranteed capacity
  - continuous batching                 - dedicated GPU groups
  - prefix caching                      - separate autoscaling
         |                                           |
         +-------------------+-----------------------+
                             |
                             v
                [ response stream to gateway ]
                             |
                             v
                [ metering sidecar ]
                - emit per-request event
                - include tenant, model, input/output tokens,
                  latency, region
                             |
                             v
                [ metering pipeline (Kafka -> warehouse) ]
                             |
                             v
              [ billing aggregator (monthly) ]
                             |
                             v
                [ invoicing / Stripe integration ]

  Separate:
  - Model registry (object store + metadata DB)
  - Control plane (Kubernetes API + platform API)
  - Observability stack (metrics/logs/traces per tenant)
  - Admin UI + customer console

Technologies labeled: Envoy AI Gateway for the front door; vLLM (or TensorRT-LLM for specific models) for serving; KServe or custom K8s operators for orchestration; KEDA for per-pool autoscaling; Kafka for metering; a billing service talking to Stripe for invoicing; ArgoCD for GitOps-driven deploys of platform components.

graph LR
  T["Tenant A / B / C"] --> CDN["Edge / CDN"]
  CDN --> GW["API Gateway\nEnvoy AI · JWT · rate limit"]
  GW --> ADM["Admission + Router\nmodel resolve · quota · region"]
  ADM --> ISOL["Tenancy + Isolation Layer\ntag context · ACL inject · pool select"]
  ISOL --> SHARED["Shared Pool\nvLLM · LoRA hot-swap · prefix cache"]
  ISOL --> RESERVED["Reserved Pools\nPremium: dedicated GPUs"]
  SHARED --> STREAM["Response Stream\n→ Gateway → Tenant"]
  RESERVED --> STREAM
  STREAM --> METER["Metering Sidecar\nKafka billing event"]
  METER --> BILLING["Billing Aggregator\n→ Stripe invoice"]
  ADM --> REG["Model Registry\nPostgres + S3 artifacts"]
  style ISOL fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The isolation layer is the platform’s defining component — it assigns every request to the right pool, propagates the tenant context, and enforces ACLs before any GPU work begins.

The interviewer says “drill into tenancy isolation first.”

120.4 Drill 1: tenancy isolation models

The problem. Two tenants can share the same physical GPU, the same process, the same KV cache pool. How do we guarantee that tenant A never sees tenant B’s data, even under bug conditions?

Three levels of isolation, from strongest to weakest:

Tiered isolation maps SLA tiers to hardware isolation tiers — premium tenants pay for dedicated clusters, free tenants share a single process, and the economics work for all three.

1. Physically separate clusters. One Kubernetes cluster per tenant, or even one VPC per tenant. Used for enterprise customers with strict compliance. Pros: strongest guarantee, audit-friendly, noise-free. Cons: operational cost scales linearly with tenant count, resource waste (each cluster has idle capacity).

2. Namespace isolation within a shared cluster. One K8s namespace per tenant, with network policies, RBAC, and resource quotas. Pods are separated at the process level; each tenant has its own vLLM replica pool. Pros: strong isolation without per-tenant cluster cost. Cons: minimum per-tenant resource footprint (~1 GPU), inefficient for small tenants.

3. Shared process with in-process tenant tagging. One vLLM replica serves multiple tenants; each request is tagged with its tenant ID, and the serving stack enforces isolation at the request level. Cheapest but requires trust in the serving process to not leak data (no cross-tenant KV cache sharing unless explicit, no cross-tenant memory peek). This is the approach most hosted providers use for the long tail of small tenants.

The architecture: tiered isolation. Premium tenants get dedicated namespaces with reserved GPUs. Standard tenants share large pools with in-process isolation and strict prefix-cache tagging. Free tenants share the largest pools with aggressive rate limiting.

In-process isolation mechanics. The key pieces:

Per-request tenant context propagated through gateway → admission → serving → metering. The tenant ID is a first-class field in every log, trace, and metric.
Prefix cache scoped by tenant. vLLM’s prefix cache is keyed by (tenant_id, prompt_hash). Tenants cannot share prefix cache entries even if their prompts are identical. This costs some efficiency (duplicate work for identical system prompts across tenants) but prevents data leakage via timing side channels.
KV cache eviction is tenant-aware. When memory pressure forces eviction, the policy considers per-tenant quotas — so a high-QPS free tenant can’t evict a premium tenant’s working set.
Logging is partitioned. Logs are tagged with tenant_id; the log query layer enforces that a tenant can only see their own logs. The platform’s own operators see everything, with audit trail.

Audit trail. Every administrative action (log access, model update, rate-limit change) is logged to an append-only audit log. Required for SOC 2.

Data residency. Tenants declare their region. Requests are routed to fleets in that region; model artifacts for that tenant live in that region’s object storage. Cross-region access is denied at the admission layer.

The hard question: side-channel leakage. Two requests running on the same GPU simultaneously can leak via timing (one request’s latency depends on another’s KV cache state). Mitigation: at the highest isolation tier, pin requests to dedicated GPUs. Accept that at lower tiers, side-channel leakage is theoretically possible but operationally negligible. This is a tradeoff every hosted provider makes implicitly; senior candidates make it explicit.

GPU sharing is the central operational problem of multi-tenant serving. Three mechanisms:

1. Multi-Instance GPU (MIG). NVIDIA hardware partitioning, available on A100, H100, H200. An H100 can be split into 2/3/4/7 “instances,” each with dedicated memory and compute. Strong isolation guarantee. The cost: fixed partition shapes, can’t be changed dynamically, and the partitions are each smaller GPUs (so a 7-way split gives 7 small GPUs, each ~10 GB HBM). Useful for small-model tenants and for strict isolation. Not useful for 70B serving because the partitions are too small.

2. Multi-Process Service (MPS). NVIDIA runtime that lets multiple processes share a GPU at the compute-stream level. Weaker isolation than MIG (no memory protection between processes) but more flexible — any two processes can share. Useful when running multiple small models on one GPU, or when you need to co-locate a primary serving process and a sidecar (metering, warmup).

3. Time-sharing via serving framework. The simplest approach: one vLLM process per GPU, with continuous batching mixing requests from multiple tenants. Isolation is enforced by the serving process, not the hardware. This is what most multi-tenant LLM serving actually uses, because it’s the most efficient — the batch is always full, and the per-tenant overhead is just a tenant tag on each request.

The architecture choice: time-sharing (option 3) is the default, with MIG as a fallback for tenants who need hardware-level isolation. MPS is used only for co-locating infra processes, not for tenant separation.

Heterogeneous GPU pools. The platform has multiple GPU pools:

Large pool (H100/H200): for 70B-class serving, shared across standard tenants.
Medium pool (A100): for 13B/7B-class models.
Small pool (T4/A10): for embeddings, rerankers, small classifiers.
Reserved pool (per premium tenant): H100/H200, dedicated, no sharing.
Fine-tune pool (Chapter 121): a separate fleet for training jobs.

Requests are routed to pools based on model size and tenant tier. A small-model request from a free tenant goes to the small pool; a large-model request from a premium tenant goes to their reserved H100s.

Pool autoscaling. Each pool has its own KEDA scaler on num_requests_running. Pools with reserved capacity have a minimum replica count equal to the reservation. Shared pools scale from a warm base to the ceiling as load demands.

The queueing challenge. When a pool is at capacity, new requests queue. A senior candidate’s move: implement priority queueing where premium tenants’ requests jump ahead of free-tier requests. This requires a separate admission buffer per priority level and a scheduler that pulls from buffers in order. Raw FIFO is unfair; priority queues match the SLA tiers.

120.6 Drill 3: model registry, versioning, and rollout

The model registry is a central service that stores model artifacts and their metadata. Every deployed model has:

Model ID: canonical name (e.g., llama-3.3-70b-instruct).
Version: semver or hash-based.
Artifact URI: object storage path for weights.
Format: Safetensors, GGUF, TensorRT engine.
Quantization: bf16, FP8, INT4, etc.
Runtime: compatible serving frameworks.
Hardware requirements: min GPU count, HBM size.
Status: active, canary, deprecated, archived.
Tenant scope: public (any tenant can use) or private (specific tenant’s fine-tune).

The registry is backed by Postgres for metadata and S3 for artifacts. Models are pulled from S3 to local node NVMe on pod startup (Chapter 52 — the cold-start problem is acute at this scale).

Versioning uses immutable tags. A tenant pins their deployment to llama-3.3-70b-instruct:2024-11-15, and the platform guarantees that version stays available until the tenant upgrades. Deprecation gives 90 days’ notice.

Customer fine-tunes and LoRA adapters. A tenant uploads a LoRA adapter to the registry via an API call. The platform validates the adapter’s base model, its format, and runs a smoke test (a few inference calls). If it passes, the adapter is indexed and made available for that tenant’s serving.

Multi-tenant LoRA serving is the architectural detail that makes hosted LoRA economical. Instead of running a dedicated replica per tenant’s adapter, the platform loads many adapters into a single serving process and swaps them in and out per request. vLLM supports this via the --enable-lora flag and up to dozens of adapters per replica. Cost drops dramatically — a tenant with 100 requests/day can share a replica with 50 other small tenants, paying only for their share of compute.

Rollout. New model versions (including new adapters) go through:

Upload → registry.
Smoke test → reject if basic inference fails.
Shadow deployment → served alongside the old version with 0% traffic, used for comparison.
Canary → 1% → 10% → 100% over hours to days, with automatic rollback on metric regressions.
Full deployment → old version is deprecated.

For internal-to-platform rollouts (new base model in the catalog), the process includes a golden-set eval gate (quality regression check) and a performance benchmark (latency and throughput regression check).

120.7 Multi-tenant LoRA serving

This deserves its own section because it’s the distinguishing feature of a modern platform.

The architecture. A single vLLM replica loads the base model once (say Llama 3 70B, ~140 GB in BF16). LoRA adapters are tiny (~200 MB each).

graph LR
  REQ1["Request\ntenant A · adapter_A"] --> SCHED["vLLM Scheduler\nadapter-aware batching"]
  REQ2["Request\ntenant B · adapter_B"] --> SCHED
  REQ3["Request\ntenant C · adapter_B"] --> SCHED
  BASE["Base Model\nLlama 70B bf16\n~140 GB HBM"] --> FWD["Forward Pass\ngrouped matmul\nper-adapter LoRA weights"]
  SCHED --> FWD
  HOT["Hot Adapters\n≤32 in GPU memory\n~200 MB each"] --> FWD
  COLD["Cold Adapters\nCPU or NVMe\nload on first request +100 ms"] -->|"LRU eviction"| HOT
  FWD --> OUT["Streaming tokens\n→ per-tenant response"]
  style BASE fill:var(--fig-accent-soft),stroke:var(--fig-accent)

LoRA hot-swap lets dozens of fine-tuned tenants share one base-model replica — the throughput cost is ~5–10%, but the economics allow small-tenant business to be viable. The serving process keeps ~32 adapters hot in GPU memory and swaps others from CPU memory on demand. Each request carries an adapter ID in the request metadata; the serving loop applies the right adapter for each token in the batch.

The batching implication. Without LoRA, a batch is just a bunch of requests sharing the same base model forward pass. With LoRA, a batch may have requests using different adapters. The serving framework must handle this — either by grouping requests by adapter within a batch (PagedAttention with adapter-aware scheduling) or by applying adapters in parallel via grouped matmul. vLLM and SGLang both support this.

The throughput cost of hot-swapping adapters is small (~5–10%). The economic benefit is large: a premium tenant with a LoRA fine-tune pays for ~1/32 of a replica instead of a full replica. At scale, LoRA hot-swap is the difference between a viable small-tenant business and a bankrupt one.

Adapter upload flow.

Tenant runs fine-tuning (Chapter 121), producing a LoRA adapter.
Tenant pushes adapter to platform registry via API.
Registry validates format, verifies base model compatibility.
Registry runs smoke test on a test replica.
Adapter is indexed and routable.
On first request, the adapter is pulled from S3 to the serving node’s NVMe, then loaded into GPU memory.

Adapter eviction. When too many adapters are active, cold adapters are evicted from GPU memory back to CPU memory or NVMe. Subsequent requests for evicted adapters incur a small latency hit (~100 ms) to reload. LRU eviction works well; some platforms use LFU for more popular adapter caching.

120.8 Billing, metering, and quotas

Metering pipeline (Chapter 83). Every request emits a metering event to Kafka from the metering sidecar:

{
  "event_id": "...",
  "tenant_id": "...",
  "model": "llama-3.3-70b-instruct",
  "input_tokens": 1847,
  "output_tokens": 432,
  "cache_hit_tokens": 800,
  "region": "us-east",
  "duration_ms": 1523,
  "timestamp": "...",
  "request_id": "..."
}

The event is idempotent (keyed by request_id) and partitioned by tenant_id for ordering. A downstream aggregator sums per-tenant usage over 1-minute and 1-hour windows, writing to a time-series store (e.g., Timescale) for real-time dashboards and quota enforcement.

Quota enforcement. The admission layer queries the quota service before each request. Quotas are defined as (tenant, period, metric, limit), e.g., “tenant_123, month, output_tokens, 100M.” If the request would exceed quota, admission returns 429.

Soft quotas vs hard quotas. Soft: the system throttles when approaching the limit (e.g., reduces max_tokens). Hard: the system rejects beyond the limit. Customers typically prefer soft quotas with alerts.

Billing. Monthly aggregation from the time-series store → invoice line items → Stripe. Idempotent by (tenant, period). Handling: prorations, credits, disputes. A senior candidate flags that billing is as much an accounting system as an engineering system, and the corner cases (a request that spans the billing-period boundary) must be handled explicitly.

Cost allocation. Internal to the platform, the GPU cost per tenant is allocated based on actual GPU-seconds consumed, not just token count. This matters because two tenants with the same token count can have very different GPU costs (long-context vs short-context, prefix-cache hit rate). The internal cost model is richer than the customer-facing price.

120.9 Eval, observability, and the platform SLO

Platform SLOs:

SLO	Target
API gateway availability	99.99%
Premium tier TTFT p95	1 s
Standard tier TTFT p95	2 s
Model registry availability	99.9%
Metering event delivery	99.99%
Billing accuracy	99.999% (financial-grade)

Observability must be tenant-aware. Every metric has a tenant_id dimension (carefully — cardinality can explode if not controlled). Dashboards show per-tenant TTFT, error rate, token usage, and cost. Alerts fire per-tenant for SLO violations on premium customers.

Per-tenant logs and traces. Tenants can query their own request logs and traces through the customer console. The backend enforces tenant scoping on every query.

Eval pipeline. Every base model in the catalog has a golden set that runs nightly. Regression alerts fire if quality drops. New model versions must pass eval before being added to the catalog. Customer fine-tunes run a lighter eval (smoke test + the tenant’s own eval set if provided).

The noisy-neighbor problem. When one tenant’s usage spikes, other tenants in the same shared pool see TTFT degradation. Detection: per-tenant TTFT percentiles compared to pool-average. Mitigation: dynamic isolation — move the noisy tenant to a separate replica temporarily. A senior candidate flags this as a platform-specific operational problem that does not exist in single-tenant systems.

120.10 Tradeoffs volunteered

Dedicated pools vs shared pools: dedicated for premium, shared for standard and free.
MIG vs MPS vs time-sharing: time-sharing by default; MIG for strict isolation tiers.
Prefix cache scoped vs unscoped: scoped (per-tenant) for isolation; costs some efficiency.
LoRA hot-swap vs dedicated replicas per adapter: hot-swap for cost; dedicated for premium adapters with low latency SLOs.
Priority queue vs FIFO: priority queue for SLA tiers; more complex but matches business model.
Central model registry vs per-runtime caching: central registry as source of truth, per-node caching for cold-start avoidance.
Per-tenant Kubernetes namespace vs per-tenant cluster: namespace for standard; cluster for enterprise-grade compliance.
Soft vs hard quotas: soft by default, hard for abuse prevention.
Billing on tokens vs billing on GPU-seconds: tokens for customer pricing (simpler); GPU-seconds internally for cost allocation.

120.11 The mental model

Eight points to take into Chapter 121:

Multi-tenancy is a tiered isolation problem. Premium gets dedicated pools; standard gets namespace-level; free gets in-process with tenant tags.
GPU sharing is done by time-sharing inside the serving process, with MIG as an escape hatch for strict isolation.
Multi-tenant LoRA hot-swap is the defining feature of a modern platform. Without it, small-tenant economics don’t work.
The model registry is the source of truth for artifacts and versions. Everything downstream pulls from it.
Metering is an idempotent event stream partitioned by tenant, with Kafka as the backbone and aggregation into a time-series store.
Quotas are enforced at admission, not at serving. Rejection should happen before any expensive GPU work.
Per-tenant SLOs, not aggregate. Observability must preserve tenant context end to end.
The noisy-neighbor problem is real. Detection is per-tenant TTFT deviation; mitigation is dynamic isolation.

Chapter 121 applies the same multi-tenant framework to a different workload: fine-tuning, where GPUs run training jobs instead of serving inference.

Read it yourself

The Together AI and Fireworks blog posts on multi-tenant LoRA serving architectures.
The vLLM LoRA documentation and Punica paper (Chen et al. 2023) for the technical foundations.
NVIDIA’s documentation on MIG, MPS, and Time-Slicing on Kubernetes.
KEDA and KServe documentation on multi-tenant deployment patterns.
AWS’s SageMaker Multi-Model Endpoints documentation for a comparable architecture.
Google’s “Secure Multi-Tenancy at Scale” SRE talks.
The GDPR and SOC 2 compliance documentation for the regulatory requirements.

Practice

A tenant complains their requests are slow. Walk through the debugging steps, from request ID to root cause, using the observability stack.
A free-tier tenant floods the system with 10x their normal request volume. Design the automated response: detection, rate limiting, shedding, alerts.
A premium tenant wants strict hardware isolation for compliance reasons. Which of the three isolation models fits, and what does it cost per month for a 70B model?
The platform adds a new base model. Walk through the full onboarding: eval, quantization, pool setup, catalog entry, announcement.
A LoRA adapter upload from a tenant fails the smoke test. Design the tenant-facing error reporting and the internal alert.
The metering pipeline has a bug that lost 6 hours of events for 3 tenants. Design the reconciliation procedure, including customer communication.
Stretch: Design the migration plan for moving a tenant from the shared pool to a dedicated pool with zero downtime and no billing disruption. What changes in routing, state, and quotas?