Chapter 126: Design: the production infrastructure

There is a tier of infrastructure below the serving fleet that the interview candidate who only practices “design a chatbot” never sees. This tier is the production-grade machinery that keeps ML systems honest: the eval harness that decides whether a model is allowed to ship, the model registry that knows where every artifact lives and who owns it, and the GPU scheduler that allocates the compute the whole enterprise runs on. Each of these is an entire 45-minute interview question by itself. This chapter covers all three.

§126.1 Design an eval harness

The problem

The interviewer asks: “How would you design an evaluation harness for a team that is shipping new LLM versions every week? The harness needs to catch regressions before they hit production, support LLM-as-judge scoring, maintain a growing golden set, detect drift over time, and gate model deploys in CI.”

What they are testing: does the candidate understand that eval is a pipeline, not a script? Do they know the difference between offline eval (golden set on fixed data) and online eval (real traffic with LLM-judge or behavioral metrics)? Do they understand the eval-as-gating pattern — where the CI pipeline can not merge a model without a passing eval score? And do they understand the meta-problem: evaluating the evaluator?

Clarifying questions

What is the eval signal? Human labels on a golden set, an LLM-as-judge scoring responses, behavioral metrics (correct JSON schema, no hallucinated citations, task completion), or all three?
What is the latency budget for an eval run? A run that takes 4 hours is unusable as a CI gate (it blocks every deploy). A run that takes 15 minutes is practical. This constrains how many cases can be in the golden set per run.
What constitutes a passing eval? An absolute threshold (score ≥ 0.85), or a relative threshold (score does not drop more than 2% from the previous version)?
Who curates the golden set? ML engineers, subject-matter experts, or a mix? Is there a feedback loop from production failures?
How many models and tasks? A single model on a single task is trivial. 50 models on 200 tasks requires a registry of eval configs and a parallel execution strategy.
What is the drift detection requirement? Weekly comparisons against a baseline? Continuous anomaly detection on production outputs?

Back-of-envelope

Eval run cost: assume 500 golden-set cases, each requiring one model call (input ~800 tokens, output ~300 tokens) plus one LLM-judge call (~1500 tokens in, ~100 tokens out). Total tokens per eval run: 500 × (1100 + 1600) = 1.35M tokens. At self-hosted 70B cost of $0.45/M tokens: $0.60 per eval run. At $10/M tokens (GPT-4o): $13.50 per run. Running 10 eval runs per day (every commit to main): self-hosted costs $6/day, GPT-4o costs $135/day. Self-hosted eval infrastructure is clearly the right call at any reasonable volume.

Eval run latency: 500 cases in parallel at 20 RPS sustained throughput on the eval cluster: 500/20 = 25 seconds for model calls, plus 25 seconds for judge calls = ~1 minute total. At 10 RPS (conservative): ~1.5 minutes. A 15-minute CI gate budget is achievable with a dedicated eval cluster of ~2× H100s.

Golden set storage: 500 cases × 10 versions of expected outputs (each output ~2 KB) = 10 MB. Trivial. With 10 models × 100 tasks × 500 cases each = 500k cases × 2 KB = 1 GB for a large-scale deployment. Still trivial.

Architecture

graph LR
  CI["CI System\n(GitHub Actions / Buildkite)"] --> TRIGGER["Eval Trigger\nmodel artifact + eval config"]
  TRIGGER --> QUEUE["Eval Job Queue\nRay cluster dispatcher"]
  QUEUE --> WORKERS["Eval Workers\nRay Tasks: parallel case execution"]
  WORKERS --> MODELA["Model Under Test\nself-hosted vLLM · temp=0"]
  WORKERS --> JUDGE["LLM Judge\ndedicated vLLM replica\nstronger model than eval target"]
  WORKERS --> RUBRIC["Rubric Engine\nJSON schema check · regex · exact-match"]
  WORKERS --> RESULTS["Results Aggregator\nper-case scores → aggregate metrics"]
  RESULTS --> EVALDB["Eval DB\nPostgres: run × model × case × score"]
  EVALDB --> GATE["Eval Gate\npass/fail signal → CI"]
  EVALDB --> DRIFT["Drift Detector\nPyGlue / custom EWMA on score time series"]
  DRIFT --> ALERT["Alert\nSlack · PagerDuty on drift spike"]
  WORKERS --> GOLDENSET["Golden Set DB\nPostgres + S3: cases · expected · metadata"]
  GOLDENSET -.->|"production failure feedback"| GOLDENSET
  style WORKERS fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Eval workers are the bottleneck and the only thing worth scaling: parallelizing across Ray Tasks allows a 500-case eval to complete in under 2 minutes on modest GPU resources.

The eval trigger is a job that fires from CI after a model artifact is registered in the model registry (§126.2). It reads the eval config (which golden set, which tasks, which judge model, what scoring rubric) and submits a Ray job. The eval config is versioned and stored in the registry alongside the model.

Eval workers run as Ray Tasks — one task per golden-set case, fully parallelized. Each task does three things: (a) call the model under test with the golden-set input (deterministic: temperature 0, fixed random seed), (b) apply rubric checks (JSON schema validation, regex patterns, exact-match for classification tasks), and (c) call the LLM judge with a structured prompt asking for a score on a 1–5 rubric. The judge call is the expensive one; it can be skipped for cases with deterministic expected outputs.

The LLM judge is a separate vLLM replica running a model that is stronger than (or at minimum different from) the model under evaluation. Evaluating a model with itself is methodologically questionable — the model will not catch its own systematic biases. Typical choices: evaluate a fine-tuned 8B with a 70B judge, evaluate a 70B with a 405B or GPT-4o judge. The judge receives a structured prompt:

You are evaluating an AI assistant's response.
Task: [task description]
Input: [golden-set input]
Expected behavior: [rubric]
Response to evaluate: [model output]
Score from 1-5. Explain. Output JSON: {"score": N, "reason": "..."}

The judge must output valid JSON; a parsing failure defaults to score 1 (worst) to avoid rewarding model failures that confuse the judge.

Rubric engine applies deterministic checks before the LLM judge. Deterministic checks are cheaper and more reliable: JSON schema validation (does the output parse as valid JSON matching the schema?), format checks (does it start with the required header?), exact-match for classification (is the output exactly “positive” or “negative”?). Only cases that pass deterministic checks proceed to LLM-judge scoring. This cuts judge costs by 30–60% for structured-output tasks.

Eval gate reads the aggregate score from the database and returns a pass/fail signal to CI. The pass condition is configurable:

Absolute threshold: aggregate judge score ≥ 0.85 on the relevant task suite.
Relative regression threshold: aggregate score is no more than 2% lower than the baseline (last prod version).
Per-task gate: some tasks have higher thresholds (safety tasks: ≥ 0.95; creative tasks: ≥ 0.75).

The gate can be configured as “blocking” (CI fails) or “advisory” (CI warns but passes). Safety tasks are always blocking.

A safety score regression on v14 drops below the 0.85 threshold and blocks the CI gate — the general and format tasks are unaffected, showing that regression detection must be per-task, not just aggregate.

Drift detection runs asynchronously on production traffic. Every N requests (configurable; default 1,000), a sample is sent to the judge with the same rubric used in offline eval. The result is written to the eval DB alongside the offline scores. A separate drift detector (EWMA-based anomaly detection on the score time series) alerts when the production score diverges from the offline baseline by more than a configurable sigma. This catches model drift not visible in offline eval: distribution shift in inputs, prompt template changes, latency-induced incomplete outputs.

Key trade-offs

Fixed golden set vs growing set. A fixed golden set is reproducible but stale. A growing set (fed by production failures) improves coverage but makes version-over-version comparisons harder (different eval cases). Solution: maintain a “locked” subset (used for regression comparison) and a “live” subset (growing; used for absolute scores). Compare locked-subset scores across model versions; use live-subset for absolute coverage.
LLM-judge vs human eval. LLM-judge is cheap (< $1/run) and fast. Human eval is 100–10,000× more expensive and slow (days, not minutes). Use LLM-judge for regression gating; reserve human eval for calibration (a periodic human-eval run that validates the judge’s agreement with human labels, monthly or quarterly).
Synchronous CI gate vs async. Synchronous gate (eval runs as part of the PR check) is the cleanest model. Async (eval runs in the background and the PR merges immediately with a pending flag) is faster but allows bad models to get registered before the gate fires. Synchronous is strongly preferred for safety-gating; async is acceptable for non-critical model updates.
Judge model version pinning. If the judge model is updated (e.g., to a better LLM), all historical scores become incomparable to new scores. Solution: pin the judge model version in the eval config; upgrading the judge is a versioned change that triggers a re-evaluation of the golden set at the new baseline.
Eval latency vs eval coverage. More cases = better coverage but longer CI feedback cycles. The sweet spot for a blocking CI gate is 500 cases completable in under 5 minutes with a dedicated eval cluster. Larger suites (5,000 cases) are acceptable as nightly jobs, not blocking gates.

Failure modes and their mitigations

Judge model disagrees with itself. LLM judges have high variance on ambiguous cases. Mitigation: run each case through the judge with 3 samples and take the median score; this reduces variance significantly at 3× cost.
Golden set overfitting. Teams optimize prompts specifically for the golden cases they can inspect. Mitigation: keep a held-out test set that is never shown to prompt engineers; run it only for the final production gate, not during development iteration.
Eval cluster contention. If multiple teams trigger eval jobs simultaneously, the eval cluster is saturated and CI feedback is delayed. Mitigation: separate eval cluster pools by tier (blocking CI gate vs nightly runs); the blocking-gate pool always has reserved capacity.
Judge API cost spike. If the eval config accidentally runs 50,000 cases through GPT-4o instead of the self-hosted judge, the cost spikes unexpectedly. Mitigation: per-eval-run dollar cap (configurable; default $5 for standard runs). Alert and pause if the cap is breached.
Eval DB write failures. If the results aggregator fails to write to Postgres, the eval run is lost. Mitigation: idempotent writes keyed by (eval-run-id, case-id); re-run is always safe.

What a senior candidate says that a mid-level candidate misses

The mid-level candidate describes a script that calls the model and checks outputs. The senior candidate describes eval as a continuous pipeline with distinct offline, online, and drift-detection legs — and the relationship between them. The key insight is that offline eval and online eval measure different things: offline eval catches pre-deployment regressions in controlled conditions, online eval catches post-deployment drift on real traffic. Neither is sufficient alone.

The second differentiator is the meta-problem of evaluating the evaluator. If the judge model has a systematic bias (it always gives higher scores to longer responses, or it fails to penalize confident hallucination), the eval harness is not measuring what you think. The calibration loop — periodic human-eval runs that check judge agreement — is the governance mechanism that keeps the eval system honest. Most candidates never raise this.

Follow-up drills

“The team ships a new fine-tuned model that scores 0.93 on the general task and 0.89 on the safety task (above both thresholds), but the product team notices users are complaining more. What metric might be missing from the eval harness?”
“Your CI gate adds 5 minutes to every PR. The eng team is complaining. How do you reduce eval latency without reducing eval quality?”
“The judge model is GPT-4o. A new GPT-4o version is released with different behavior. How do you handle the transition without invalidating your historical score comparisons?”

§126.2 Design a model registry

The problem

The interviewer asks: “Design a model registry like Hugging Face Hub — multi-tenant, with artifact storage, access control, version lineage, tag-based discovery, deduplication across forks, and fast download acceleration.”

What they are testing: this is a data management interview as much as an ML interview. The interviewer wants to see storage tier strategy (what goes in S3 vs what goes in a CDN vs what stays warm on NVMe), access control models, lineage graph design, and the deduplication problem (a fine-tuned model is mostly the same weights as its base — how do you avoid storing 140 GB twice for every fine-tune?).

Clarifying questions

Internal-only or internet-facing? An internal registry for one company vs a public hub (Hugging Face–style) differ enormously in CDN requirements, abuse prevention, and access control granularity.
Artifact types? Model weights only, or also adapters (LoRA), tokenizers, configs, eval results, datasets, and code? The schema and storage strategy differ by artifact type.
Download scale? A popular public model (LLaMA 2 on HuggingFace) gets millions of downloads per month. An internal registry might serve 100 downloads/day. This determines CDN strategy.
Lineage depth? Do you need to track fine-tune-of-fine-tune chains (adapter stacks)? This affects the lineage graph data model.
Quota and billing? Is storage metered per tenant? Is download bandwidth metered?
Compliance? Model weights can contain memorized training data; is there a data-provenance requirement?

Back-of-envelope

Model weight sizes: 7B fp16 ≈ 14 GB, 70B fp16 ≈ 140 GB, 70B fp8 ≈ 70 GB, 405B fp8 ≈ 200 GB. A registry with 1,000 model versions × average 40 GB each = 40 TB of storage at rest. At S3 pricing ($0.023/GB/month), that is $920/month for the storage tier — effectively negligible. The cost is in egress.

Download acceleration math: a 70B fp16 model is 140 GB. Over a single S3 connection (100 MB/s): 1400 seconds = 23 minutes. Over 64 parallel connections: 22 seconds. Serving from a CDN with a 10 Gbps edge: ~115 seconds (single connection), or ~2 seconds via torrent-style parallel chunk download. For a public registry with millions of downloads, CDN offload is the primary cost lever.

Deduplication: a fine-tuned 70B model shares ~99% of its weights with the base model. Without deduplication, storing 100 fine-tunes of the same base costs 100 × 140 GB = 14 TB. With block-level deduplication (content-addressed chunks of 64 MB), the marginal cost of each fine-tune is only the changed adapter weights: typically 1–4 GB for LoRA. Deduplication reduces that 14 TB to ~140 GB base + 100 × 2 GB adapters = 340 GB. This is the most important design decision in a model registry at any reasonable scale.

Architecture

graph LR
  AUTHORS["Authors\n(push / upload)"] --> API["Registry API\nREST · auth · quota check"]
  API --> METADB["Metadata DB\nPostgres: models · versions · tags · ACL · lineage"]
  API --> CHUNKSTORE["Chunk Store\nS3: content-addressed 64 MB chunks\nchecksum-keyed"]
  CHUNKSTORE --> DEDUP["Dedup Index\nRedis: SHA-256 → chunk ID\nbefore write check"]
  CHUNKSTORE --> CDN["CDN\nCloudFront / Fastly\nedge caches for hot models"]
  CLIENTS["Clients\n(download / pull)"] --> CDN
  CDN -.->|"miss"| CHUNKSTORE
  API --> LINEAGE["Lineage Graph\nNeo4j or Postgres adjacency list\nparent → child version edges"]
  API --> SEARCH["Tag + Text Search\nElasticsearch: model cards · tags · metrics"]
  API --> SCANQUEUE["Safety Scan Queue\nKafka: new artifact → scan worker"]
  SCANQUEUE --> SCANNER["Artifact Scanner\nmalware scan · license check\n· weight fingerprinting"]
  SCANNER --> METADB
  style CHUNKSTORE fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The chunk store is the design pivot: content-addressed storage with a deduplication index means a fine-tuned model only stores the changed weights, reducing storage cost by an order of magnitude for fine-tune-heavy workloads.

Content-addressed chunk store is the core innovation. Every model artifact is split into 64 MB chunks. Each chunk is identified by its SHA-256 hash. Before writing a chunk, the upload pipeline checks the dedup index (Redis hash table: SHA-256 → S3 key). If the chunk already exists, only a pointer is stored; the bytes are not re-uploaded. This is the same technique used by Git’s object store and by ZFS’s block-level deduplication. The 64 MB chunk size is chosen to balance dedup efficiency (smaller chunks = more matches but more index entries) against write overhead (one S3 PUT per chunk).

A model artifact’s manifest is a JSON file listing all chunk hashes in order. Downloading a model means: fetch the manifest, resolve each chunk hash to an S3 key, download chunks in parallel. A client library (think huggingface_hub or a custom CLI) handles the parallel download automatically.

Metadata DB (Postgres) stores:

models table: model ID, owner tenant, name, visibility (public/private), license, created_at.
versions table: version ID, model ID, version tag (e.g., main, v1.2, fp8), manifest S3 key, eval scores (JSON), created_at, author.
access_control table: model ID, principal (user/team/org), permission (read/write/admin).
tags table: model ID, tag key, tag value (e.g., task: text-classification, language: en, base_model: llama-3.1-70b).

Lineage graph is a directed acyclic graph where an edge (parent_version_id, child_version_id) means “this model was fine-tuned from that one.” Stored as an adjacency list in Postgres (or Neo4j for large multi-hop traversals). Lineage queries: “what are all models that derive from LLaMA 3.1 70B?” — this is a recursive CTE in Postgres or a BFS in Neo4j. Lineage is non-optional for compliance: when a base model is found to have a data-poisoning issue, you need to enumerate all derived fine-tunes.

Download acceleration uses two layers. Layer 1: CloudFront CDN with a 30-day TTL for popular artifacts. Hot models (top 100 by download count in the last 7 days) are pre-warmed to CDN edges in all regions. Layer 2: parallel multi-part downloads via a client library that fetches chunks concurrently from S3 presigned URLs, saturating the client’s NIC. A 10 Gbps client can pull a 70B fp8 model (70 GB) in ~60 seconds with 64 parallel chunk downloads.

Storage tiers are determined by access frequency:

Hot tier (NVMe on registry cache nodes): last 7 days of download requests. ~200 GB per cache node, 4 nodes = 800 GB hot. Cached models serve from local NVMe (~3 GB/s), bypassing S3 entirely.
Warm tier (S3 Standard): all models with at least one download in the last 90 days.
Cold tier (S3 Glacier Instant Retrieval): no downloads in 90+ days. Retrieval time: < 1 second but cost is slightly higher. $0.004/GB/month vs $0.023/GB for Standard.

Content-addressed deduplication reduces marginal storage cost of each fine-tune to only the changed weight blocks — typically 1–3 GB instead of 140 GB — because unchanged chunks are stored once and referenced by pointer.

Artifact scanner runs asynchronously after every upload. It checks: (a) malware scan (YARA rules for known malicious payloads), (b) license compatibility check (SPDX license from model card vs org policy), (c) weight fingerprinting (check whether the weights match any known blacklisted models). The scanner result is written back to the metadata DB; a model with a failed scan is quarantined and its visibility is set to restricted pending human review.

Quota enforcement tracks per-tenant storage used (sum of unique chunks × chunk size) and per-tenant download bandwidth (rolling 30-day sum). Quotas are checked at upload time (storage) and at download initiation (bandwidth). Overages trigger a soft warning at 80% and a hard block at 100%.

Key trade-offs

Content-addressed vs version-named storage. Content-addressed is self-deduplicating and immutable. Version-named is human-friendly but wastes storage for fine-tunes. Hybrid: content-addressed storage with human-readable tags as metadata. This is what Git does (SHA-addressed objects, branches as named pointers).
Monorepo-style registry vs federated. A monorepo registry (one central service, all tenants) is simpler to operate. A federated registry (each org hosts its own, with a global discovery layer) is more resilient and enables air-gapped deployments. Enterprise customers typically require federation.
CDN for all artifacts vs selective. CDN costs money even for models nobody downloads. Pre-warming only the top-100 models and pulling the rest from S3 directly is the standard cost control.
NVMe hot cache vs pure S3. NVMe cache nodes (~$10k each, 30 TB) serve hot models at 3 GB/s vs S3 at 100 MB/s per connection. The cache pays off if popular models are downloaded > ~10 times/month (roughly 1.5 TB/month of egress per model at 140 GB). For an internal registry, most models are downloaded rarely; skip the NVMe cache and use parallel S3.
Lineage in Postgres vs Neo4j. For lineage depths < 10 and model counts < 100,000, Postgres recursive CTEs are sufficient. At HuggingFace scale (1M+ models, deep fine-tune chains), a graph database (Neo4j) or a dedicated lineage service is needed. Start with Postgres.

Failure modes and their mitigations

Partial upload corruption. An upload dies after 60% of chunks are written. Mitigation: the upload protocol is resumable — the client tracks which chunks have been acknowledged. A manifest is only written after all chunks are confirmed. Incomplete uploads are garbage-collected after 24 hours.
Chunk store hotspot. The most popular model’s chunks are hit millions of times per day, saturating the S3 prefix (S3 has per-prefix request-rate limits of ~5,500 GET/s). Mitigation: randomize the S3 key prefix (prepend a 4-char hash hex to the SHA-256). S3’s request rate scales linearly with prefix diversity; randomized prefixes distribute load.
Dedup index miss (hash collision in Redis). A SHA-256 collision is cryptographically infeasible, but a Redis key expiry policy that evicts a dedup entry is real. Mitigation: the dedup index TTL is never set below 90 days; missed dedup is a storage cost issue, not a correctness issue.
Download token expiry. Presigned S3 URLs expire; a slow client that takes > 1 hour to download a 140 GB model will see URLs expire mid-download. Mitigation: the client library fetches fresh presigned URLs for each chunk just before download, not all at once.
Quarantined model with active users. An artifact scanner catches a license violation on a model that is already in production use. Mitigation: a quarantine does not immediately delete the artifact; it sets visibility to restricted and notifies the owner. Existing users with explicit access continue to work; new downloads are blocked until resolved.

What a senior candidate says that a mid-level candidate misses

The mid-level candidate talks about “store models in S3 with version IDs.” The senior candidate immediately raises the deduplication problem as the central design challenge. Without dedup, every fine-tune wastes 99% of its storage on unchanged base weights. This is not a theoretical concern — a team of 50 ML engineers each creating 10 fine-tunes per week at 140 GB each would fill 70 TB per week. Content-addressed storage with a chunk-level dedup index solves this with essentially zero extra engineering complexity once the upload protocol is designed correctly.

The second differentiator is lineage tracking as a compliance requirement. When a base model is found to have a data problem — memorized PII, poisoned training data, a licensing violation — the org needs to know every derived model that was ever trained from it. Without a lineage graph, this requires scanning every model’s training config, which is fragile and slow. With a lineage graph, it is a 100 ms graph traversal.

Follow-up drills

“A user uploads a 405B model and claims they fine-tuned it from LLaMA, but your dedup index shows zero chunk overlap with any known LLaMA version. What does that tell you and what do you do?”
“You need to support model weight encryption for models that contain proprietary data. How does content-addressed storage interact with encryption? What breaks?”
“Your registry has 1M models. Describe the tag-based discovery query path from the client’s search query to the list of results. What indexes do you need?”

§126.3 Design a GPU scheduler

The problem

The interviewer asks: “Design a GPU scheduler for an ML platform that serves a mix of training jobs, batch inference jobs, and interactive notebook sessions. You need to handle multi-GPU gang scheduling, preemption, fairness across tenants, topology-aware placement, and mixed spot+on-demand instance pools.”

What they are testing: this is the hardest infrastructure interview question in the ML space. The candidate needs to understand queuing theory (not just FIFO), the gang scheduling problem (a 64-GPU job is all-or-nothing), topology awareness (NVLink vs Ethernet matters for distributed training), preemption (who gets kicked and how are they compensated), and the cost model for spot instances. The interviewer is looking for someone who has operated a large GPU cluster, not someone who has read the Slurm docs.

Clarifying questions

Workload mix? What fraction of the cluster is training (hours–days), batch inference (minutes–hours), and interactive (seconds–minutes)? This determines the preemption policy and the priority model.
GPU heterogeneity? A homogeneous H100 cluster vs a mixed H100/A100/H200 fleet. Heterogeneous requires topology-aware matching; homogeneous is simpler.
Multi-tenancy model? Fixed quota per team vs fair-share vs market-based (teams bid for GPUs)? This is a policy choice that shapes the entire scheduler design.
Spot instance policy? Are tenants willing to have jobs checkpointed and restarted when spot instances are reclaimed? Training jobs can checkpoint; serving jobs generally cannot.
Target utilization? A cluster at 95% utilization has no slack for bursts. A cluster at 75% has breathing room. The scheduler’s job is partly to maintain a target utilization range.
Job types’ GPU connectivity requirements? Tensor-parallel training requires NVLink; gradient-only all-reduce can use InfiniBand; inference serving jobs don’t care about connectivity.

Back-of-envelope

A 1,000-GPU cluster (250 nodes × 4× H100 each) at $2/GPU-hr costs $2,000/hr or $1.44M/month at 100% utilization. At 75% target utilization: $1.08M/month. The efficiency loss from poor scheduling (gang scheduling stalls, fragmentation) is typically 5–15% of total capacity. A 10% efficiency improvement on a $1M/month cluster saves $100k/month — worth significant engineering investment.

Scheduler decision rate: at 1,000 GPUs with an average job duration of 30 minutes, the arrival rate is ~33 jobs/hour = ~1 job every 2 minutes. The scheduler does not need to be a low-latency system; scheduling decisions can take 100–500 ms without impacting throughput. The hot path is preemption detection (a spot instance is reclaimed with 2 minutes of notice by the cloud provider) — this needs to be handled in < 30 seconds.

Queue depth at target utilization: a cluster at 75% utilization with 1% monthly demand growth will have a queue depth of 0 for most of the month, with occasional spikes during batch submission windows (end of business day, Monday morning). The scheduler needs to handle queue depths of 50–200 concurrent waiting jobs without performance degradation.

Architecture

graph TD
  SUBMIT["Job Submit\n(CLI / SDK / notebook)"] --> JOBQ["Job Queue\nPriority queue per tenant tier\n(FIFO within tier)"]
  JOBQ --> SCHED["Scheduler Core\ngang-scheduling · topology-aware\npreemption · fairness"]
  SCHED --> TOPOMAP["Topology Map\nNVLink fabric · InfiniBand switches\n(refreshed every 60s)"]
  SCHED --> QUOTADB["Quota DB\nPostgres: team → fair-share allocation\naccumulated usage · burst credit"]
  SCHED --> NODEPOOL["Node Pool\non-demand nodes · spot nodes\nper-GPU state: free / allocated / draining"]
  NODEPOOL --> PREEMPT["Preemption Controller\nspot reclaim handler\ncheckpoint trigger · job reschedule"]
  SCHED --> DISPATCH["Job Dispatcher\nKubernetes Job + GPU resource claims\nor Slurm sbatch"]
  DISPATCH --> WORKERS["GPU Worker Nodes\nH100 / A100 / H200"]
  WORKERS --> MONITOR["Metrics Collector\nGPU utilization · job progress\ncheckpoint status"]
  MONITOR --> SCHED
  MONITOR --> COST["Cost Accounting\nper-job GPU-hours × $/hr\nKafka metering"]
  style SCHED fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The scheduler core is the only stateful component that must serialize decisions; everything else (dispatch, monitoring, cost accounting) is embarrassingly parallel and can scale horizontally.

Job queue is a priority queue with multiple tiers. Tiers, from highest to lowest: (1) Production inference (never preemptible), (2) Premium training (preemptible only for spot reclaim), (3) Standard training (preemptible by higher-priority jobs), (4) Batch inference (preemptible), (5) Interactive/notebook (preemptible, lowest priority, high fairness weight). Within each tier, FIFO by submission time. Inter-tier scheduling uses a weighted fair-share policy (see below).

Gang scheduling is the hard problem. A 64-GPU training job cannot start until 64 GPUs are free simultaneously. Naive FIFO can cause head-of-line blocking: a 64-GPU job at the head of the queue prevents 100 small 2-GPU jobs behind it from running, even though the cluster has plenty of 2-GPU slots. The solution is backfill scheduling: when the head-of-queue job cannot be scheduled (insufficient GPUs), the scheduler looks ahead in the queue and fills available slots with smaller jobs, provided those smaller jobs are guaranteed to complete before the head job’s estimated start time. This is the core of IBM Platform LSF and Slurm’s backfill algorithm.

The gang scheduling decision: a job requesting N GPUs with connectivity requirement C is schedulable if and only if there exist N free GPUs that all satisfy C. For NVLink-required jobs, that means all N GPUs must be in the same NVLink domain (an 8-GPU node or a DGX H100 16-GPU pod). For InfiniBand-required jobs (distributed training across nodes), they must be on nodes connected via the same InfiniBand switch fabric. The topology map (refreshed every 60 seconds from the cluster management plane) provides the connectivity graph.

Topology-aware placement assigns jobs to the best-connected available GPUs. The scoring function is:

placement_score = (jobs × NVLink_bonus) + (cross_node_IB_bandwidth) - fragmentation_penalty

NVLink bonus: +1000 for placing all GPUs in the same NVLink domain. IB bandwidth: proportional to available IB switch bandwidth on the chosen path. Fragmentation penalty: -100 for each GPU slot stranded (a job that uses 6 of 8 GPUs on a node, leaving 2 orphaned, pays a fragmentation penalty). The scheduler selects the placement with the highest score.

For a 16-GPU all-reduce training job, same-switch placement (left) gives full 400 Gbps InfiniBand bandwidth; cross-switch placement (right) adds a spine-switch hop that halves effective bandwidth and can double training step time.

Fairness policy: weighted fair-share (WFS). Each tenant has a “fair share” allocation defined as a fraction of total cluster capacity (e.g., Team A has 20%, Team B has 15%). Fairness is measured over a rolling 30-day window. A tenant that has used more than their fair share pays a scheduling penalty (their jobs are scheduled after under-share tenants at the same priority tier). A tenant that has used less than their share earns a burst credit (their jobs are prioritized). This is the dominant scheduler policy for academic clusters (SLURM FairShare) and cloud ML platforms.

Spot instance preemption handling: cloud spot instances can be reclaimed with a 2-minute warning. The preemption controller watches for reclaim signals (AWS instance interruption notice, GCP preemption notice) and takes ordered actions:

Mark the spot node as draining: stop accepting new jobs.
Trigger checkpoint for all preemptible jobs on the node (write optimizer state + weights to S3 or NFS).
If checkpoint completes before the 2-minute deadline: job state is checkpointed, job is re-queued.
If checkpoint fails (not enough time): job is marked failed and re-queued from scratch.
The node is returned to the cloud provider.

Checkpoint time for a 70B model: optimizer state is ~560 GB for bf16 Adam (2× weights for m and v tensors). At 5 GB/s write to NFS: ~112 seconds. Too slow for 2-minute spot reclaim. Solutions: (a) use async streaming checkpoints that write incrementally between training steps (PyTorch’s torchsnapshot); (b) store optimizer state in DRAM on a separate fault-tolerant RAM disk that survives the GPU reclaim; (c) increase checkpoint frequency so the loss on reclaim is < 1 hour of compute.

Cost accounting is per-job: GPU count × GPU type × $/hr × duration = job cost. Written to Kafka at job completion. A Spark batch job aggregates costs by team, project, and model per day. The finance dashboard shows monthly spend by team with breakdown by job type (training vs batch inference vs interactive). Alerts fire when a team’s daily spend exceeds 130% of their 30-day daily average.

Key trade-offs

FIFO vs backfill vs priority-preemption. FIFO is fair but inefficient (head-of-line blocking). Backfill improves throughput but requires job duration estimation (inaccurate estimates break the guarantee). Priority-preemption is efficient but complex and raises team friction. Production clusters typically use backfill + fair-share, with preemption only for spot reclaim and emergency cases.
Gang scheduling vs elastic training. Gang scheduling requires all GPUs to be free at once, which stalls large jobs. Elastic training (PyTorch Elastic) allows a job to start with fewer GPUs and scale up as more become available. Elastic training is the right answer for training workloads with flexible resource requirements; gang scheduling remains necessary for workloads where tensor parallelism requires a fixed world size (most model-parallel training).
Spot vs on-demand mix. Spot instances can be 70% cheaper but introduce preemption risk. The right default for training jobs is spot + checkpoint; for inference serving, on-demand (latency SLAs preclude preemption). A 50/50 split is typical for training clusters.
Heterogeneous GPU support. Adding H200s to an H100 cluster is tempting (more memory, better for large models). But mixing GPU types complicates topology-aware placement, creates scheduling fragmentation, and requires separate CUDA kernel binaries. The cost of operational complexity often exceeds the throughput gain. Name this tradeoff explicitly.
Kubernetes-native vs custom scheduler. Kubernetes with device-plugins and a custom scheduler plugin (volcano, yunikorn, kueue) is the most common production path and leverages existing cluster management tooling. A custom Slurm-style scheduler gives finer-grained control and better backfill performance but requires more operational investment. For a new cluster, start with Kubernetes + kueue; migrate to a custom scheduler only if kueue’s backfill semantics are insufficient.

Failure modes and their mitigations

Scheduler single-point-of-failure. The scheduler serializes all placement decisions; if it crashes, jobs stop being scheduled (but running jobs continue). Mitigation: scheduler state (pending jobs, node assignments) is persisted to Postgres; on restart, the scheduler reconciles in-cluster state and resumes. Hot standby via leader election (etcd-based) keeps failover under 30 seconds.
Topology map stale. If a node fails and the topology map isn’t updated, the scheduler places a job on the failed node. Mitigation: heartbeat-based health checks every 10 seconds; mark nodes as unknown if two heartbeats are missed; the scheduler never places new jobs on unknown nodes.
Gang scheduling deadlock. Two large jobs each hold half the GPUs needed and are waiting for the other to release. This is the classic distributed resource deadlock. Mitigation: the scheduler tracks “holding” jobs and applies a timeout: a job that has been waiting-for-resources while holding a partial allocation for > 15 minutes is preempted and re-queued.
Spot reclaim cascade. A region-level spot market event reclaims 20% of the cluster simultaneously. Every affected training job triggers checkpoints at once, saturating the NFS checkpoint storage. Mitigation: checkpoint bandwidth is throttled per job (max 1 GB/s); jobs checkpoint in staggered batches; the checkpoint store has dedicated NFS paths for the spot reclaim path, separate from normal checkpoint paths.
Quota exhaustion with runaway jobs. A bug in a user’s training script causes the job to loop indefinitely, consuming 64 GPUs for days. Mitigation: every job has a configurable max_wall_time (default: 24h for training, 2h for batch inference). Jobs that exceed max_wall_time are automatically terminated. Alert the submitter.

What a senior candidate says that a mid-level candidate misses

The mid-level candidate describes a FIFO queue with GPU resource requests. The senior candidate names backfill scheduling as the key throughput optimization and immediately identifies the head-of-line blocking problem as the reason simple FIFO is insufficient. They then identify the challenge with backfill: it requires accurate job duration estimates, which users systematically over-estimate (they request 24 hours and finish in 4). The solution — using historical job durations for similar job configurations to calibrate estimates — is a real production technique that only someone who has operated a cluster at scale would know.

The second differentiator is checkpoint design as a first-class concern for spot instances. A candidate who says “use spot instances to save money” without talking about incremental async checkpointing, the write bandwidth requirements, and the checkpoint store architecture has never been on-call when a region-wide spot reclaim event corrupts 30 checkpoints simultaneously. The details of checkpoint bandwidth throttling and staggered batch checkpointing are the signal that separates the senior candidate from the one who memorized the pitch for spot instances.

Follow-up drills

“A team submits 100 jobs all requesting 8 GPUs simultaneously. The cluster has 64 GPUs free. Describe the scheduler’s behavior over the next 30 minutes.”
“You’re told to add support for fractional GPU allocation (a job requesting 0.5 GPU on a shared node). What scheduler changes are required? What are the isolation concerns?”
“The cluster is at 95% utilization and the backlog of pending jobs is growing. A VP asks for a cost and capacity analysis. What metrics do you pull and what do you recommend?”

Read it yourself

Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — the paper that shaped how production LLM eval and serving infrastructure thinks about memory.
The Slurm Workload Manager documentation (slurm.schedmd.com), specifically the “Backfill Scheduling” and “Fairshare” sections — the canonical reference for cluster scheduler design.
The Volcano scheduler documentation (volcano.sh) — the Kubernetes-native gang scheduler; shows how to bolt advanced scheduling policies onto a Kubernetes cluster.
The Kueue documentation (kueue.sigs.k8s.io) — the emerging standard Kubernetes workload queuing system, covering cohorts, fair-share, and preemption.
The AWS Spot Instance best practices guide — the authoritative source on interruption notices, rebalance signals, and checkpoint design for spot-aware workloads.
Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NIPS 2015) — the original paper on eval debt; relevant to §126.1 golden set stagnation.
The HuggingFace Hub Python client source code — the best documentation for the content-addressed chunk download protocol described in §126.2.
The MLflow Model Registry documentation — a production model registry at smaller scale, useful for understanding the API design space before designing at Hugging Face scale.

Practice

Design the golden-set schema for an eval harness that supports three task types: (a) JSON extraction, (b) open-ended QA, and (c) safety classification. What fields does each case record need?
A company has 500 models, each fine-tuned from one of three base models. Sketch the Postgres schema that represents the lineage graph, and write the SQL query that returns all models derived from base model ID=1.
For the GPU scheduler, describe the backfill algorithm in pseudocode. Handle: (a) a job that needs 64 GPUs, (b) only 48 are free, (c) there are 20 smaller jobs waiting. When does backfill run, and what does it schedule?
Estimate the checkpoint storage bandwidth requirement for a cluster of 100 nodes × 8 H100 each, running 70B parameter training jobs, if all spot instances are reclaimed simultaneously (assume each job has 560 GB of optimizer state).
Stretch: Design the metering pipeline for a GPU scheduler: from per-job GPU-seconds to a monthly team bill, including the Kafka topic schema, the aggregation job, and the billing portal query. What are the SLA requirements for cost accuracy?