Chapter 121: Design a multi-tenant fine-tuning service

The variant interviewers love. A multi-tenant fine-tuning service is the training-side analog of Chapter 120’s serving platform. The workload shape is completely different: jobs run for tens of minutes to days, GPU allocation is batch-oriented rather than request-oriented, and the failure modes include OOMs, bad hyperparameters, and divergent loss curves instead of tail latency. The economic model is also different — customers pay for GPU-hours rather than tokens, and cost accounting is about fair billing for long-running jobs.

This chapter works a canonical version: a hosted fine-tuning service that accepts user datasets, runs LoRA or QLoRA fine-tuning on a shared GPU pool, and delivers adapters that can be served by the multi-tenant serving platform from Chapter 120. The full lifecycle from “user uploads dataset” to “fine-tuned model serving” is in scope.

Outline:

Clarify: what kinds of fine-tuning jobs.
Estimate: job scale, GPU pool sizing.
High-level architecture — from upload to adapter.
Drill 1: job scheduling on a shared GPU pool.
Drill 2: GPU quotas, isolation, and cost accounting.
Dataset validation and preprocessing.
Training runtime: what actually runs on the GPU.
Eval automation and quality gates.
Artifact versioning and handoff to serving.
Failure modes, idempotency, and recovery.
Tradeoffs volunteered.
The mental model.

121.1 Clarify: what kinds of fine-tuning jobs

The clarify phase for a training problem differs from serving.

1. What fine-tuning methods? LoRA, QLoRA, full fine-tuning, SFT, DPO? Interviewer says: primarily LoRA and QLoRA for cost reasons, with full fine-tuning available for enterprise tenants as a premium feature. SFT is the default training objective; DPO available as an advanced option.

2. What base models? Llama, Mistral, Qwen families. Sizes from 1B to 70B. Larger models (70B) are premium-only due to GPU cost.

3. Dataset sizes? Typically 1k–1M examples per job. Median ~10k. Upper bound ~10M for serious customers.

4. Expected job duration? LoRA on 7B with 10k examples: ~30 minutes. LoRA on 70B with 100k examples: ~4 hours. Full fine-tune on 70B with 1M examples: ~24 hours on 32× H100.

5. Job volume? ~500 jobs/day at steady state, ~50 concurrent during peak.

6. SLA model? Free tier (no SLA, best-effort queue), standard (queued, typically starts within 1 hour), premium (reserved capacity, starts immediately).

7. How does the fine-tuned model get served? It becomes an artifact in the registry, consumed by the serving platform from Chapter 120.

8. Compliance? Same as serving — SOC 2, GDPR. Training data is especially sensitive (it often contains customer IP).

121.2 Estimate: job scale, GPU pool sizing

Jobs/day: 500.
Distribution: 70% LoRA on small models (~15 min each), 20% LoRA on medium (~2 hours), 9% LoRA on 70B (~4 hours), 1% full fine-tune (~24 hours).
Average GPU-hours per job:
- Small LoRA: 0.5 GPU-hours.
- Medium LoRA: 4 GPU-hours (2 GPUs × 2 hours).
- 70B LoRA: 16 GPU-hours (4 GPUs × 4 hours).
- Full fine-tune: 32×24 = 768 GPU-hours.
Total daily GPU-hours: 500 × (0.7×0.5 + 0.2×4 + 0.09×16 + 0.01×768) ≈ 500 × 10.3 ≈ 5150 GPU-hours/day.
GPUs needed at steady state: 5150 / 24 ≈ 215 GPUs average.
Peak sizing: ×1.5 = ~325 GPUs, or ~40 H100 nodes (8 GPUs each).
GPU pool breakdown: 16 nodes (128 GPUs) for LoRA small/medium, 16 nodes (128 GPUs) for 70B LoRA, 8 nodes (64 GPUs) for full fine-tune / premium reserved.

Cost: 325 GPUs × $2/hour × 720 = $468k/month for training compute. Much less than serving at the same platform because training jobs can queue and batch; there’s no tail-latency tax.

Storage:

Datasets: customers upload datasets averaging 500 MB, totaling ~5 TB hot at any time. Cold datasets (completed jobs) move to cheaper storage.
Checkpoints: one per step in theory, keep last N, ~5 GB per checkpoint for 7B LoRA, ~50 GB for full 70B. Total active checkpoints: ~50 TB.
Final artifacts: LoRA adapters ~200 MB, full fine-tunes ~100 GB. ~5 TB total.

121.3 High-level architecture — from upload to adapter

  [ customer (tenant) ]
         |
         v
  [ API + console ]
     - upload dataset
     - configure job (base model, method, hyperparams)
     - submit
         |
         v
  [ job API (control plane) ]
     - validate request
     - check quotas
     - store job metadata (Postgres)
     - return job_id
         |
         v
  [ dataset ingestion ]
     - upload to object store (S3)
     - format validation (JSONL, Parquet)
     - schema check (chat format, SFT format)
     - PII / safety scan (optional, opt-in)
     - compute statistics (token counts, distribution)
         |
         v
  [ job queue (priority queue) ]
     - per-tier priority
     - per-tenant quota check
         |
         v
  [ scheduler ]
     - assign job to GPU pool
     - select GPU count (2, 4, 8, 32)
     - reserve nodes via K8s operator
         |
         v
  [ training pod (K8s Job) ]
     - pull base model from registry
     - pull dataset from S3
     - run training (HuggingFace TRL, DeepSpeed, or Axolotl)
     - emit metrics (loss, throughput, eval)
     - checkpoint to S3
         |
         v
  [ eval runner ]
     - automated golden set eval
     - metric extraction
     - quality gate
         |
         v
  [ artifact registry ]
     - tag adapter with (tenant, job_id, base_model, version)
     - tenant-visible metadata
         |
         v
  [ handoff to serving platform (Chapter 120) ]
     - adapter becomes servable via the multi-tenant
       LoRA hot-swap fleet
         |
         v
  [ notification (email, webhook) ]

  Separate:
  - Observability (per-job metrics, logs)
  - Billing (GPU-hour metering)
  - Failure recovery (checkpoint resumption)

graph LR
  TENANT["Tenant\nupload dataset + config"] --> API["Job API\nvalidate · quota check · job_id"]
  API --> VALID["Dataset Ingestion\nformat · schema · token count\nPII scan (opt-in)"]
  VALID --> QUEUE["Priority Job Queue\npremium &gt; standard &gt; free"]
  QUEUE --> SCHED["Scheduler\ngang-schedule · locality-aware\nKueue / Volcano"]
  SCHED --> POD["Training Pod\nK8s Job · GPU reserved"]
  BASE["Model Registry\nbase weights"] --> POD
  DATA["Dataset\nS3 tenant prefix"] --> POD
  POD --> CKPT["Checkpoints\nS3 · every ~10 min"]
  POD --> METRICS["Metrics\nloss · throughput · GPU util\nPrometheus + W&B"]
  POD --> EVAL["Eval Runner\ngolden set + tenant eval\nquality gate"]
  EVAL --> REG["Artifact Registry\nadapter: tenant · base · version"]
  REG --> SERVE["Serving Platform\n(Chapter 120) LoRA hot-swap"]
  REG --> NOTIFY["Notification\nemail / webhook"]
  style SCHED fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The scheduler is the operational heart of a fine-tuning service — gang scheduling, priority queues, and locality constraints are what make multi-tenant training fair and efficient.

Technologies:

Job queue: Kubernetes Job API + a custom scheduler, or Ray with its job abstraction, or Temporal for orchestration (Chapter 80).
Training runtime: HuggingFace TRL for SFT and DPO, PEFT for LoRA, DeepSpeed for ZeRO, FSDP for full fine-tunes (Chapter 12).
Dataset storage: S3 with encryption at rest, per-tenant prefixes.
Artifact registry: shared with serving platform (Chapter 120).
Observability: Prometheus metrics from training pods, TensorBoard or W&B for loss curves.
Billing: per-job GPU-hour metering emitted on job completion.

The interviewer says “drill into the scheduler.”

121.4 Drill 1: job scheduling on a shared GPU pool

The problem. A pool of ~325 GPUs, hundreds of jobs per day with different sizes (2 GPUs to 32 GPUs), different priorities (free, standard, premium), different durations (minutes to hours). Schedule them to minimize cost and maximize fairness.

Key constraints:

Gang scheduling: a distributed training job needs all its GPUs simultaneously.

Gang scheduling requires all GPUs for a job to be free simultaneously — partial reservation causes deadlock, so schedulers use atomic reservation with priority-ordered backfilling.

You can't start a 32-GPU job with 31 GPUs. - **Locality**: GPUs in a distributed job should be on the same node or within the same high-bandwidth fabric (NVLink, InfiniBand). Putting 4 GPUs of a single job on 4 different nodes destroys training throughput. - **Preemption is painful**: preempting a training job means losing work since the last checkpoint, and checkpoints are expensive to write. Unlike serving (which can be preempted with near-zero loss), training preemption has real cost. - **Fairness**: no single tenant should monopolize the pool.

The scheduling algorithm. A simplified Kubernetes-native approach:

Admit the job: check tenant quota (GPU-hours per month). Reject if exceeded.
Enqueue in a priority queue: premium > standard > free, with FIFO within a priority.
At scheduling tick: find the highest-priority job whose GPU requirement fits in the currently available capacity on a contiguous set of nodes.
Gang schedule: once found, reserve all its GPUs atomically and launch the K8s Job.
Backfilling: if a small job can fit in the gap while waiting for a big job, allow it to run if it won’t delay the big job (EASY backfilling).

Per-pool specialization. The three GPU pools have different schedulers because the workload shapes differ. The small-LoRA pool runs many small jobs and benefits from dense packing. The 70B-LoRA pool runs fewer medium jobs and needs co-location on single nodes. The full-fine-tune pool runs few large jobs and needs multi-node gang scheduling with InfiniBand.

Volcano and Kueue: the two K8s-native batch schedulers worth naming. Volcano is more mature for HPC-style gang scheduling; Kueue is simpler and newer. Either works. A custom scheduler is sometimes warranted when specific business rules (per-tenant budgets, dynamic priority) matter.

Priority inversion. A classic problem: a low-priority job is running, and a high-priority job arrives that needs the same GPUs. Two options: preempt (lose the low-priority job’s progress) or wait. Preempt if the premium tenant’s SLA requires it; wait if the loss cost is high. Many platforms use a hybrid: preempt only if the low-priority job has been running <N% of its expected duration (so the wasted work is limited).

Queue depth as a metric. Alert on queue depth and queue wait time per tier. A premium tenant whose jobs wait >5 minutes is a SLA violation.

121.5 Drill 2: GPU quotas, isolation, and cost accounting

Per-tenant quotas. Three kinds:

Soft quota: a monthly GPU-hour allowance. Exceeding it is allowed but triggers overage billing.
Hard quota: an absolute cap beyond which no jobs can run. Used for free tier and risk management.
Concurrency quota: the max number of jobs a tenant can have in-flight at once. Prevents a single tenant from flooding the queue.

Quotas are stored in a central service and enforced at job admission. The same service emits quota-usage events to the billing pipeline.

Isolation during training. Training jobs run in their own K8s Jobs, one job per pod, one pod per training run. GPU assignment is via the NVIDIA device plugin and K8s scheduling. A tenant’s training pod cannot see other tenants’ data because each pod mounts only the tenant’s own dataset and base model from S3.

Network isolation. If a job uses multi-node training, its nodes are on a dedicated virtual network (CNI plugin) so that NCCL traffic between workers doesn’t leak. For premium tenants, physically dedicated InfiniBand partitions are an option but rarely used — the overhead of standing them up per job is too high.

Cost accounting. The meter runs from pod-start to pod-stop, not from job submission. Queue time is not billed. Total cost = GPU-hours × tier price. For LoRA jobs, this is accurate. For full fine-tunes on reserved nodes, the cost is higher per hour because reserved capacity is priced at a premium.

Refund policy for failures. If a job fails due to platform error (not user error), the tenant is refunded. Classification is important:

Platform error: node crashed, network issue, OOM due to scheduler misallocation. Refund full job cost.
User error: bad dataset, OOM due to user hyperparameters, divergent training. Charge in full.
Ambiguous: rare OOMs, platform-suggested hyperparameters that cause OOMs. Discretionary refund.

121.6 Dataset validation and preprocessing

Datasets are the highest-variance part of the pipeline. A poorly formatted dataset destroys the job.

Validation steps (run before queueing):

Format check: JSONL, Parquet, or CSV. Schema validation against the expected training format (e.g., {prompt, completion} for SFT, {chosen, rejected} for DPO).
Size check: minimum 100 examples (below this, fine-tuning is pointless), maximum per-tier cap.
Token count: tokenize the dataset with the base model’s tokenizer. Reject if any example exceeds max_seq_len.
Safety scan (optional): run a classifier over the dataset to detect CSAM, obvious NSFW, or policy violations. Opt-in for most tenants because it adds latency and has false positives.
PII scan (optional): detect emails, phone numbers, etc. Warn the tenant but don’t block.
Quality heuristics: detect duplicates, extremely short/long examples, imbalanced classes. Report to the tenant as warnings.
Split: auto-split into train/val if the dataset doesn’t have a pre-defined split.

The validation result is a report the tenant can read in the console. Most rejected jobs are rejected here, before any GPU time is spent.

Preprocessing. Tokenization happens once, upfront, and the tokenized dataset is cached to S3. Subsequent runs with the same base model reuse the cache. This avoids the per-job tokenization overhead for repeat experiments.

121.7 Training runtime: what actually runs on the GPU

The training pod is a Kubernetes Job with:

- init container: pull base model from registry to local NVMe
- main container: 
    - HuggingFace TRL SFTTrainer or DPOTrainer
    - PEFT for LoRA adapter
    - DeepSpeed ZeRO stage 3 for memory efficiency
    - FSDP for full fine-tune variants
- sidecar: metrics exporter (Prometheus)
- sidecar: log shipper (Vector to Loki)

Configuration is auto-selected. Based on (model size, dataset size, number of GPUs, method), the platform chooses sensible defaults:

Learning rate: 2e-4 for LoRA, 2e-5 for full fine-tune.
Batch size: determined by GPU memory, target total batch size ~64–256.
Gradient accumulation: to hit the target batch size with the per-device batch that fits.
LoRA rank: 16 or 32 depending on model size.
Optimizer: AdamW or Adafactor (for large models).
Precision: BF16 default, or fp16 with loss scaling on older GPUs.

Advanced users can override. Defaults work for >90% of cases.

Checkpointing. Every N steps (or every M minutes), the model state is saved to S3. Full fine-tunes save full model checkpoints; LoRA saves just the adapter. Checkpointing takes ~1–5 minutes and is the main serialization bottleneck. The checkpoint cadence is tuned so that the platform can recover from a pod failure without losing more than ~10 minutes of work.

Resumption. If a training job fails partway, it can resume from the last checkpoint. The job API supports resumption with the same job_id. The scheduler re-enqueues the job with the resumption flag.

Monitoring during training. Per-job metrics:

Loss per step (training and eval).
Throughput (tokens/sec, steps/sec).
GPU utilization (SMI metrics via DCGM exporter).
Memory usage.
Gradient norms.

Alerts fire on anomalies: loss NaN/inf, throughput dropping below 50% of expected, GPU utilization dropping (signals a data-loading bottleneck). Some alerts are user-facing; others are internal for platform ops.

121.8 Eval automation and quality gates

After training completes, the job isn’t done until the adapter passes eval.

Default evaluation: the platform runs a small golden set against the fine-tuned model to verify basic functionality (it still generates coherent text, doesn’t regress drastically from the base model on general capabilities). This is ~50 prompts, ~5 minutes on a serving replica.

Tenant eval: the tenant can provide their own eval set. The platform runs it and reports metrics. Common patterns: accuracy on a held-out classification set, exact-match on a structured task, LLM-as-judge on open-ended generation.

Quality gates: the tenant can configure “only promote the adapter if eval metric X > Y.” Failed gates leave the adapter in a “not-promoted” state where it’s stored but not served.

Comparison to base model. Always show the tenant how the fine-tuned model compares to the base on the same eval set. If the fine-tune is worse on general capabilities (catastrophic forgetting), the platform warns the user and recommends adjusting hyperparameters.

Hyperparameter suggestions. If a tenant’s job fails quality gates, the platform can suggest hyperparameter changes: lower learning rate, more epochs, higher LoRA rank, better data curation. This is a user-experience lever, not an engineering one, but it’s what separates a usable fine-tuning service from one that frustrates users.

121.9 Artifact versioning and handoff to serving

When training and eval complete, the adapter is promoted to the shared artifact registry (Chapter 120).

Artifact metadata:

adapter_id: unique ID.
tenant_id: owner.
base_model_id: what base model this LoRA adapts.
base_model_version: the specific version (prevents base-model drift from invalidating the adapter).
training_job_id: traceability back to the job.
eval_metrics: stored alongside.
status: draft, promoted, deprecated.
created_at.

Promotion flow: the tenant (or their CI pipeline) calls POST /adapters/{adapter_id}/promote. The platform validates that the adapter is compatible with a currently-served base model and registers it. The serving platform’s LoRA hot-swap fleet (Chapter 120) can now route requests to it.

Immutability. Promoted adapters are immutable. To “update” a fine-tune, the tenant submits a new job and gets a new adapter_id. The tenant can switch their production traffic between versions via their own config.

Deprecation. Adapters can be deprecated (no longer served) and archived (moved to cold storage). Tenant-initiated. 30-day grace period before hard deletion.

121.10 Failure modes, idempotency, and recovery

Failure modes and mitigations:

OOM during training. Symptom: pod crashes with CUDA OOM. Cause: batch size too large, sequence too long, or LoRA rank too high. Mitigation: the platform auto-retries once with reduced batch size; if it still fails, the job fails with a user-facing error recommending adjustments.
Node hardware failure. Symptom: pod disappears mid-training. Mitigation: the job is re-queued from the last checkpoint. Invisible to the user except as a delay.
Divergent training (loss explodes). Symptom: loss becomes NaN. Mitigation: an auto-abort triggers; the user is notified; recommended hyperparameter changes are suggested.
Dataset corruption. Symptom: training starts but errors out on a malformed example. Mitigation: validation should have caught this pre-queue; if it slipped through, the job aborts and the user is told which example is bad.
Base model unavailable. Symptom: the base model the tenant specified has been deprecated. Mitigation: validation at submission rejects jobs on deprecated models; ongoing jobs are allowed to finish.
Slow queue. Symptom: premium tenant’s job waits too long. Mitigation: preempt a low-priority job or scale out the pool manually.
Checkpoint corruption. Symptom: resumption from checkpoint fails. Mitigation: keep the last N checkpoints, not just the most recent, so resumption can fall back.
Partial-result leakage. Symptom: a failed job leaves intermediate files in shared storage. Mitigation: cleanup is part of the K8s Job’s finalizer; deleted when the job is deleted.

Idempotency. Submitting the same job twice (with the same idempotency_key) returns the same job_id and does not run the training twice. This is necessary because training jobs are expensive and retries should not cost the user twice.

At-least-once metering with idempotent billing (Chapter 122 vocabulary) — the metering event for a completed job is keyed by job_id, so duplicate events don’t double-bill.

121.11 Tradeoffs volunteered

Custom scheduler vs Volcano vs Kueue: Volcano for maturity; custom if business rules demand it.
Preempt vs wait under priority inversion: wait if low-priority job is close to done, preempt otherwise.
Default hyperparameters vs forced user config: defaults for usability, override for power users.
Auto-eval vs user-provided eval: both — auto for baseline sanity, user for task-specific.
On-prem vs cloud training pool: cloud for elasticity, on-prem for stable-workload cost.
LoRA rank default: 16 for 7B, 32 for 70B — trade off quality vs adapter size.
Full fine-tune availability: premium-only due to cost and compliance.
Checkpoint cadence: every 10 minutes balances recovery cost and write overhead.
Single-job-per-pod vs multi-job-per-pod: single for isolation; multi would be cheaper but blurs tenant boundaries.

121.12 The mental model

Eight points to take into Chapter 122:

Training jobs are batch workloads, not request workloads. GPU scheduling is gang-scheduled and locality-aware.
Dataset validation prevents most failures. Spend aggressively on pre-submission validation to save GPU time.
LoRA is the default. Full fine-tuning is a premium feature for compliance reasons, not technical ones.
Scheduling is priority-queued with backfilling. Volcano or Kueue is the operational backbone.
Checkpoint cadence balances recovery cost against write overhead. ~10 minutes is a sensible default.
Eval is automated. An adapter isn’t promoted until it passes the baseline sanity eval.
Idempotent metering and idempotent job submission are non-negotiable for a billed training service.
Handoff to serving (Chapter 120) uses the shared artifact registry. The same adapter lifecycle that training produces is what serving consumes.

Chapter 122 shifts gears: the vocabulary interviewers respect. Fifty phrases that signal senior-level systems thinking, with examples.

Read it yourself

Hu et al., LoRA: Low-Rank Adaptation of Large Language Models.
Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs.
HuggingFace TRL and PEFT documentation.
Axolotl (github.com/axolotl-ai-cloud/axolotl) as an open-source reference implementation of a fine-tuning platform’s core runtime.
Volcano (volcano.sh) and Kubernetes Kueue documentation for batch scheduling on K8s.
DeepSpeed and FSDP documentation for the memory optimization techniques.
The OpenAI fine-tuning API and Together AI fine-tuning API docs for comparable managed services.

Practice

Estimate the job queue wait time at steady state if the scheduler is FIFO, there are 100 jobs in the queue, average job duration is 2 hours, and there are 10 concurrent slots available.
A tenant uploads a 10M-row dataset. Walk through the validation pipeline: what checks run, what resources they use, and how long they take.
Design the checkpoint strategy for a 4-hour job on 4 H100s. What’s the cadence, what gets saved, and how does recovery work if a node crashes at hour 3?
A training job’s loss goes NaN at step 500 of 5000. Walk through the detection, user notification, and recommended remediation.
A premium tenant needs to preempt a free-tier job to get their GPUs. Design the preemption handshake, including how the free-tier job’s partial work is preserved (or not).
Build the cost model for a LoRA fine-tune on Llama 70B with 50k examples. How many GPU-hours, what’s the customer price, what’s the platform margin?
Stretch: Extend the design to support DPO (Direct Preference Optimization) with preference data. What changes in dataset validation, training runtime, and eval? What new failure modes emerge?