Design: the upstream platforms
"The LLM is not the system. The system is the infrastructure that keeps the LLM honest, fast, and billed correctly"
Most interview prep treats the LLM serving fleet as the whole design. It isn’t. The fleet is sandwiched between four infrastructure layers that are each large enough to be a standalone senior interview question: the embedding service, the AI gateway, the prompt management system, and the agent tool sandbox. Every production ML platform has all four. Most interviews will eventually hand you one.
This chapter works each of the four as its own design exercise, using the framework from Chapter 114. The goal is not to produce four separate reference architectures — it is to build a cross-cutting intuition for the classes of problems they share: throughput math, multi-tenancy, versioning, and safe execution.
§125.1 Design an embedding service
The problem
The interviewer asks: “We have a growing corpus — call it 500 million documents — and a query-time embedding requirement of 10,000 QPS for a retrieval system. How would you design the embedding service?”
What they are testing: do you know that embeddings have different compute characteristics from autoregressive decode? Do you understand batching math? Do you know about multi-model serving, caching, and the multi-tenant isolation problem that emerges when multiple teams share one fleet?
Clarifying questions
- Which models? One model, or a heterogeneous fleet (general-purpose sentence transformer, domain-fine-tuned, multilingual, large vision encoder)? This determines whether we need model routing.
- Latency target? p99 < 50 ms for query-time embedding is common. Ingestion-time (batch) can tolerate minutes. They use the same hardware; do they share a fleet?
- Payload size? Average document length, max document length. 512 tokens is typical for a sentence transformer; 8192 for a long-context model like E5-mistral.
- Multi-tenancy? How many teams? Do they share a quota? Is there a noisy-neighbor requirement?
- Output format? fp32, fp16, int8, or binary? Binary embeddings (1 bit per dimension) are a real production choice for very large indices — they cut storage 32× at a recall cost.
- Deduplication? Can we avoid re-embedding identical documents? This matters enormously at ingestion time if the corpus has duplicates.
Back-of-envelope
The dominant bottleneck is matrix multiply, not memory bandwidth. An embedding model like BGE-large-en-v1.5 (335M parameters) runs at roughly 20,000 input tokens per second per A10G (24 GB VRAM) with a batch of 512 sequences. Larger models like E5-mistral-7b (7B, fp8 on H100) achieve roughly 6,000 tokens/second but generate 4096-dim vectors instead of 1024-dim, so storage cost is 4× higher.
For query-time 10k QPS at an average of 64 tokens per query: 10,000 × 64 = 640,000 tokens/sec. BGE-large on A10G at 20k tokens/sec needs 32 A10Gs at peak. With 30% headroom: 42 A10Gs, or roughly 11 nodes of 4× A10G each. At ~$1.50/hr per A10G, that is ~$45k/month for the query path alone. The ingestion path (batch) runs separately on a separate pool of 8–16 A10Gs that amortize overnight jobs.
For storage: 500M documents × 1024 dims × 4 bytes = 2 TB of fp32 vectors. At fp16, 1 TB. With int8 per-component quantization, 500 GB. This is the raw vector footprint; add 20–40% for the HNSW graph index overhead (see Chapter 115 §115.9 Chain 3).
Architecture
graph LR
Clients["API Clients\n(query & ingest)"] --> GW["Embedding Gateway\nrouting · auth · rate limit"]
GW --> ROUTER["Model Router\nname→replicas"]
ROUTER --> M1["BGE-large pool\nTEI / A10G × 20"]
ROUTER --> M2["E5-mistral pool\nTEI / H100 × 8"]
ROUTER --> M3["CLIP pool\nTEI / A10G × 4"]
M1 --> CACHE["Semantic Cache\nRedis (hash→vector)"]
M2 --> CACHE
M3 --> CACHE
BATCH["Batch Ingest\nKafka topic"] --> BWORKER["Batch Workers\nvectorized doc processor"]
BWORKER --> ROUTER
CACHE --> VDB["Vector Store\nQdrant / Weaviate"]
VDB --> CLIENTS2["Downstream\nRetrieval Services"]
style M1 fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The embedding gateway is the single ingress point for both online query and offline ingest, routing by model name and maintaining per-tenant rate limits before requests ever touch GPU.
The serving runtime is Text Embeddings Inference (TEI) from Hugging Face. TEI is not a generic inference server — it is built specifically for encoder-only transformer models. It does automatic dynamic batching (assembling queries up to a token budget rather than a request count), FlashAttention 2 kernels, and it supports BERT-style models as well as causal LLMs used as encoders. An alternative is vLLM’s embedding mode, which is better if you are already running vLLM for generation and want a unified fleet.
The model router reads a config that maps model names to replica pools. It is a simple HTTP proxy with sticky routing by model name. Each TEI pool is horizontally scaled and sits behind a Kubernetes Service; KEDA autoscales each pool independently on Prometheus metric tei:queue_length.
The semantic cache is a Redis instance that maps a hash of the raw input string (normalized: lowercased, trimmed) to its embedding vector. Cache hit rate for query workloads is often 15–40% for production search applications because users repeat the same common queries. A cache hit returns in < 1 ms vs 5–15 ms for a model call.
The batch ingest path is entirely decoupled from query-time. Documents arrive on a Kafka topic. Batch workers consume from Kafka, chunk documents into token windows (with 10% overlap), deduplicate by SHA-256 hash of the chunk text, and emit only unseen chunks to the embedding fleet. Deduplication at ingest is the biggest cost lever: a typical news or blog corpus has 30–50% near-duplicate content.
Key trade-offs
- TEI vs vLLM embedding mode. TEI is faster for small encoder models; vLLM embedding mode is better if the model is a 7B decoder-as-encoder. Choose based on model architecture.
- Separate query and ingest fleets vs shared. Shared is cheaper; separate is safer. Ingestion can consume 100% of fleet capacity during a bulk re-indexing job, starving query traffic. Separate pools eliminate that blast radius.
- fp32 vs fp16 vs int8 embeddings. fp16 is free quality loss. int8 per-component with calibration loses ~0.5–1% recall on MTEB benchmarks but halves storage and bandwidth. Binary embeddings (used by Cohere and others) lose 3–8% recall but enable 32× storage compression and fast Hamming distance hardware acceleration.
- Multi-model fleet vs dedicated. Shared routing amortizes head-nodes and gateway costs; dedicated pools give cleaner SLOs and easier autoscaling. Shared is fine for 2–3 models; dedicated is better past 5.
- Caching before vs after chunking. Cache at the chunk level (after chunking), not the document level. Re-chunking the same document differently would otherwise generate different embeddings from the same source text.
Failure modes and their mitigations
- Model pool down. A TEI pod OOMs if someone sends a document with 50k characters to a model with 512-token max. Mitigation: validate token length at the gateway layer before routing; return HTTP 400, not 500.
- Kafka consumer lag spike. If a large batch job dumps 50M documents at once, the ingest topic lag explodes. Mitigation: set Kafka consumer group rate limits per batch job; the consumer is KEDA-autoscaled on consumer lag, but with an upper cap on worker count to prevent fleet saturation.
- Redis OOM on cache. If the semantic cache grows without eviction, it OOMs. Mitigation: LRU eviction policy, max 4 GB of cache per model, monitor
redis:used_memoryand alert at 80%. - Stale embeddings after model update. When you retrain the embedding model, all existing vectors in the index are misaligned with the new model. Mitigation: a model version is baked into every vector record; retrieval checks version before using a cached vector, and re-embedding is triggered for stale records on read (lazy) or via a background job (eager).
- Hot partition in Kafka. If ingest documents are partitioned by source and one source dominates (e.g., a large data dump), one partition gets all the load. Mitigation: partition by SHA-256 of the document ID, not by source.
What a senior candidate says that a mid-level candidate misses
The mid-level candidate talks about “call the API and store the result.” The senior candidate talks about the two separate service levels — query path and ingest path — and why they have fundamentally different latency contracts, failure modes, and scaling triggers. Query-time embedding sits on the critical path of a user’s request; ingestion is a background pipeline. Mixing them means a bulk re-index job can degrade search for real users. That is the first thing to separate in the architecture.
The second thing the senior candidate raises is model versioning as a first-class concern. Embedding models are not static. You retrain them, upgrade them, fine-tune them for a new domain. When you do, every cached vector in your index is from a different model than the one now running. The “version token” approach — embed the model name+version as a prefix in the Redis cache key and in every row of the vector store — is the standard mitigation. The mid-level candidate has never thought about this because they have never been on-call for the panic when a model is swapped mid-night and search recall tanks 20%.
Follow-up drills
- “You’ve described batching at the TEI level. What happens to your p99 latency as batch size increases? Draw the tradeoff curve and tell me where you’d set the batch size.”
- “The team wants to add a new 7B-parameter multilingual embedding model. Walk me through the capacity planning and deployment process, and specifically how you handle the rollout so you don’t disrupt existing model pools.”
- “Half of your ingest documents are in Japanese. Does the architecture change?”
§125.2 Design an AI gateway
The problem
The interviewer asks: “Design an AI gateway — a shared entry point for all LLM traffic in an organization. Multiple teams are calling multiple models (GPT-4o, Claude, Gemini, self-hosted Llama). How do you build this?”
What they are testing: do you understand that the gateway is not just a reverse proxy? It is the control plane for cost, compliance, safety, and observability across the entire LLM surface of a company. The interviewer wants to see that you have thought about auth, rate limiting, quota, audit logs, cost attribution, and PII redaction — not just routing.
Clarifying questions
- How many teams and models? A 10-team company with 3 models is different from a 1,000-team platform with 50 models. This determines whether the gateway config is static YAML or a live database.
- Which protocols? OpenAI-compatible HTTP, Anthropic’s API, or raw gRPC to internal models? A gateway that understands only one protocol can normalize early.
- Streaming? SSE (server-sent events) from the model must flow through the gateway without buffering, or TTFT blows up.
- PII requirements? Does PII need to be scrubbed from prompts before they leave the org? This requires inline redaction logic.
- Cost attribution model? Per team? Per user? Per project? The answer determines the metering granularity.
- Audit retention? Full prompt+response retention is expensive and creates compliance exposure. Most orgs log metadata only; some log full payloads with a retention policy.
Back-of-envelope
A gateway at 10,000 QPS with ~2 KB average request payloads: 20 MB/s inbound. A single Envoy proxy can handle 50,000 QPS at this payload size on a 4-core container; so 2 Envoy replicas saturate well below their ceiling. The gateway itself is not the bottleneck — the bottleneck is the upstream model APIs.
Audit log storage: 10,000 QPS × 86,400 s/day × 1 KB per metadata record = 864 GB/day. Over 90 days with metadata-only logging: 78 TB. At S3 prices ($0.023/GB/month), that’s ~$1,800/month for 90-day metadata retention. Full prompt+response logs (avg 4 KB) cost 4× more: ~$7,200/month. At volume, this is a meaningful line item; a smart candidate recommends tiered retention: 7 days full, 90 days metadata, forever for flagged/anomalous traffic.
Rate-limit state: each team needs a token bucket. At 1,000 teams × 2 integers (tokens, timestamp) × 8 bytes each = trivial. The state fits easily in Redis with room to spare.
Architecture
graph LR
Clients["Clients\n(apps / notebooks / agents)"] --> ENVOY["Envoy AI Gateway\nHTTP/2 · OpenAI-compat · SSE-aware"]
ENVOY --> AUTH["Auth + Quota\nJWT verify · token bucket\nper-team · per-model"]
AUTH --> PII["PII Redaction\nNER-based scrub\n(Presidio / regex)"]
PII --> ROUTER["Model Router\ncanary weights · cost-aware\nmodel alias → backend"]
ROUTER --> EXT["External Models\nGPT-4o · Claude · Gemini"]
ROUTER --> SELF["Self-hosted Models\nvLLM fleet · TP=2"]
ROUTER --> CACHE["Prompt Cache\nRedis\ncontent-hash→response"]
ENVOY --> AUDIT["Audit Sidecar\nKafka topic\nmeta or full payload"]
AUDIT --> STORE["Audit Store\nS3 + Athena"]
AUTH --> QUOTA["Quota DB\nRedis token buckets\nPostgres for hard limits"]
ROUTER --> METER["Cost Attribution\nper-request token count\n→ Kafka metering topic"]
style ENVOY fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The Envoy layer is stateless; all stateful concerns (quota, prompt cache, audit) are pushed to purpose-built stores so the gateway can scale horizontally without coordination.
The gateway is built on Envoy AI Gateway, which ships as a distribution of Envoy with an LLM-specific filter chain. It natively understands the OpenAI ChatCompletions API, parses streaming SSE responses, and can inject headers and mutate requests at the protocol level. The alternative is Kong Gateway with the AI plugin or a custom FastAPI proxy — Envoy is preferred here because its filter model is composable and the HTTP/3 path is native.
Auth and quota runs as an Envoy external authz filter (gRPC to a small Go service). JWT tokens carry team ID, model allow-list, and tier. The Go service checks the JWT, then calls Redis to check and decrement the token bucket. Typical token-bucket window: per-minute for burst control, per-day for spend control. Hard monthly limits are stored in Postgres and checked asynchronously (not on the hot path).
PII redaction uses Microsoft Presidio for named-entity recognition: SSN, credit card numbers, email addresses, phone numbers, person names. It runs as a filter on the request body before the prompt leaves the organization. False-positive rate for person names is ~5%; the filter is configurable to skip named entities where the risk is lower. A second pass on the response is optional — most orgs do not inspect model outputs for PII unless they are in regulated industries.
Model router reads a live config from a database (etcd or Postgres) that maps model aliases to backends with weights. A canary looks like: gpt-4o → { production: 95%, canary-v2: 5% }. The router also does cost-aware routing: if the self-hosted Llama fleet has spare capacity, cheaper traffic (non-time-sensitive) is redirected from external APIs to the self-hosted fleet. This “self-hosted spill” logic is the main cost-reduction lever beyond rate limiting.
Prompt cache works on deterministic, low-temperature requests (temperature 0, or a request header X-Prefer-Cached: true). The cache key is a SHA-256 of model + messages + temperature. Hit rate is 5–30% depending on the application; highest for developer/internal tools that repeatedly call the same prompts.
Audit sidecar writes one record per request to Kafka. The record is: timestamp, team-ID, model, input token count, output token count, latency, whether it was a cache hit, and — if full logging is enabled for this team — the full prompt and completion. Downstream, S3 + Athena makes the audit log queryable without a dedicated log cluster.
Key trade-offs
- Inline PII scrubbing vs async. Inline scrubbing adds 5–20 ms latency per request (NER is not free). For chat applications where TTFT matters, the tradeoff is real. Alternative: async scrubbing of the audit log only, with an opt-in real-time path for regulated workloads.
- Centralized gateway vs per-team sidecar. A centralized gateway is easier to operate and gives consistent policy enforcement. Per-team sidecars give each team autonomy and avoid a single failure domain. Centralized wins when compliance is a top concern; sidecar wins when team velocity is paramount.
- Prompt caching at the gateway vs at the model. Gateway-level caching is exact-match only; model-level prefix caching (in vLLM or the API provider) handles partial matches. The gateway cache is a cheap first filter; model-level caching is the deeper lever.
- Full-payload audit vs metadata-only. Full payload is expensive (4×) and creates legal exposure (do you want to store every user’s health question for 90 days?). Metadata-only is safe but blind. Tiered retention — full logs for flagged traffic, metadata for normal traffic — is the right default.
- Model alias abstraction vs direct model names. Aliases decouple application code from specific model versions. An alias
gpt-4ocan be transparently remapped togpt-4o-2025-04without app changes. This is non-negotiable at scale; without it, model version migrations require coordinating across dozens of teams.
Failure modes and their mitigations
- Redis quota service down. If the auth service can’t check quotas, fail open (let traffic through) with a flag that disables quota checks — but emit an alert immediately. Fail-closed would block all traffic, which is catastrophic. The risk of a few minutes of unmetered traffic is acceptable.
- PII filter adds too much latency. If NER inference is slow (> 30 ms), degrade gracefully: skip PII filtering and flag the request for async audit. Alert on skip rate; if > 1% of traffic is being skipped, the PII filter fleet needs more capacity.
- External model API rate-limited. If OpenAI 429s at high traffic, the gateway should queue the request for retry with exponential backoff, not immediately surface the error to the client. The queue budget: 5 seconds max, then surface the 429. Never retry indefinitely.
- Canary config poisoned. If a bad canary config routes 100% of traffic to a non-existent backend, everything breaks. Mitigation: validate configs before applying, use a config version system (etcd watch with rollback), and keep the previous config hot.
- Audit Kafka topic backpressure. If the Kafka brokers are slow, the sidecar must not block the request path. The sidecar writes to an in-process buffer (bounded, 10k records) and drops records if the buffer fills. Alert on
audit:drop_rate; the data is not fully durable on the hot path by design.
What a senior candidate says that a mid-level candidate misses
The mid-level candidate builds a routing proxy with auth. The senior candidate realizes that the gateway is the single point where the organization can impose cost controls and compliance at scale, and therefore designs it with that as the first principle. The cost-attribution pipeline — logging input/output token counts per team per model per day — is what makes a monthly AI spend review possible. Without it, the finance team has one vendor invoice and no idea which team is responsible for 40% of it.
The senior candidate also raises prompt caching with a content-addressed cache and the X-Prefer-Cached header as a first-class API feature. This lets application developers explicitly opt in to cached responses, which is the right model: some applications (temperature=0 tools, document extraction) absolutely want caching; others (creative writing, temperature=1 chat) never want stale responses. The gateway cannot know which without a hint from the caller.
Follow-up drills
- “Describe how you would do a canary rollout of a new model version through this gateway. Who sets the canary weights? How do you monitor whether the canary is healthy?”
- “A team says their prompts contain internal code and they don’t want the gateway to log them, even in metadata form. How do you handle this without giving them the ability to bypass all audit?”
- “GPT-4o raises their prices by 30%. Describe how you would use the gateway to shift spend to the self-hosted fleet without any application code changes.”
§125.3 Design a prompt management system
The problem
The interviewer asks: “Teams are managing their prompts in Google Docs, in code comments, in Notion. It’s a mess. Design a prompt management system that handles versioning, staging/production promotion, A/B testing, and regression testing.”
What they are testing: do you understand that prompts are software artifacts, not free-form strings? They have versions, they can regress, they can contain secrets, and changing them in production without a gate is how you get a 2 a.m. incident. The interviewer is looking for the candidate who treats prompt engineering as a proper engineering discipline.
Clarifying questions
- Who are the authors? ML engineers only, or non-technical product managers writing prompts in a UI? This determines whether the interface is an API or a web editor.
- What is “production”? A single prod environment, or multiple (staging, shadow, prod-US, prod-EU)?
- How are prompts delivered to the serving fleet? Does the serving fleet call the prompt service at request time, or does it bake the prompt in at deploy time?
- What are the A/B testing semantics? Per-user or per-request randomization? Sticky (same user always sees variant A) or not?
- Are there secrets in prompts? E.g., API keys for tool calls baked into the system prompt? How are they injected?
- What is the regression test set? A golden set of prompt+expected-output pairs, or an LLM-judge eval? Who owns it?
Back-of-envelope
The system is low-QPS for writes (engineers update prompts maybe 100 times/day across all teams). Reads are high-QPS because every inference request resolves the current prompt version: 10,000 QPS for the serving fleet, each making one prompt lookup per request. At ~100 bytes per prompt record (just the version pointer), that is 1 MB/s of read traffic — trivially cacheable.
Storage for prompt artifacts: a prompt template is typically 500–5,000 tokens. At 4 bytes/token, 5,000 tokens = 20 KB per version. With 500 prompts, 100 versions each: 500 × 100 × 20 KB = 1 GB — trivially small. The entire prompt corpus fits in DRAM.
The expensive part is running the regression suite. 500 prompts × 100 test cases × 1,000 output tokens at LLM judge = 50M tokens of model calls per regression run. At $0.50/M tokens (self-hosted 70B), that is $25 per run. Affordable for a daily CI job; too expensive to run on every commit.
Architecture
graph TD
AUTHORS["Authors\n(engineers · PMs · evals)"] --> UI["Prompt Editor UI\nversion diff · template vars"]
UI --> API["Prompt Service API\nREST · versioning · staging rules"]
API --> DB["Prompt Store\nPostgres + S3 for large templates"]
API --> CACHE["Read Cache\nRedis: version pointer hot path"]
API --> CIGATE["CI Eval Gate\ntrigger on merge to staging"]
CIGATE --> EVALSUITE["Regression Suite Runner\nLLM-judge on golden set\n(Ray Tasks)"]
EVALSUITE --> EVALSTORE["Eval Results Store\nPostgres: score × version × prompt"]
API --> AB["A/B Config Store\nexperiment → {v_a: 50%, v_b: 50%}\nRedis + experiment UI"]
FLEET["Serving Fleet\n(vLLM)"] --> CACHE
CACHE --> DB
API --> SECRETS["Secret Injector\nVault: inject keys at render time\nnot stored in prompt body"]
style API fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The prompt service sits between authors and the serving fleet: authors write to it, the fleet reads from it, and the CI eval gate sits on every path from staging to production.
The prompt store is Postgres with a row per version and a separate S3 key for large templates. Each row has: prompt ID, version number (monotone integer), template body (or S3 reference), author, tags, staging state (draft, staging, production, archived), parent version (for lineage), and a checksum. Versioning is immutable: once a version is committed, it never changes. A “rollback” is a forward action: promote the old version to production, creating a new “current” pointer.
Version promotion flow:
- Author commits a draft to the prompt service.
- The system auto-runs a fast eval (< 5 min): 20 golden-set cases, LLM-as-judge score ≥ threshold.
- If eval passes, the version can be promoted to staging. A Slack notification is sent.
- The full regression suite runs on staging traffic (shadow mode: the new prompt runs in parallel with the production prompt, neither affecting real users).
- An engineer clicks “promote to production” in the UI. The version pointer in Redis is atomically updated.
- 10 minutes of production monitoring; if error rate or quality metrics degrade, auto-rollback to the previous version.
The A/B config store holds experiment configs: which prompts are being tested, what the traffic weights are (50/50, 90/10), which user cohort (hashed by user ID for sticky randomization). The serving fleet resolves the prompt version at request time: look up the user’s bucket, look up the active experiment for this prompt slot, resolve to a specific version ID, fetch from cache. Total resolution time: 2–3 ms including Redis round-trip.
Secret injection is the critical piece most candidates miss. Prompts sometimes contain tool call schemas or API keys as part of the system prompt. Storing API keys in the prompt body is a security disaster: they end up in the prompt store, in every audit log, and in every export. The correct pattern is templated injection via Vault: the prompt template has a variable like {{TOOL_API_KEY}}, and the serving fleet (or a secret-injector sidecar) renders the template with secrets fetched from Vault at render time. The secret never touches the prompt store.
Offline regression gating runs before promotion to production. The golden set is a curated set of (prompt, expected behavior) pairs, usually 100–500 entries. The eval runner (Ray Tasks for parallelism) sends each golden case to a self-hosted judge model, which scores the response on a rubric. Typical rubrics: faithfulness (did the output match the expected intent?), format compliance (did it output the requested JSON schema?), safety (did it avoid the blocked topics?). The score threshold is configurable per prompt; some prompts are safety-critical and require score > 0.95; others are creative and accept > 0.80.
Key trade-offs
- Runtime resolution vs baked-in prompts. Runtime resolution (serving fleet calls the prompt service per request) gives instant rollout and instant rollback. Baked-in (prompts in the container image) requires a redeploy but has zero runtime dependency. Runtime resolution is the right choice for any system where prompts change more than once a week.
- Sticky A/B vs random-per-request. Sticky randomization (hash user ID) is better for user experience — a user doesn’t see different personalities in the same session. Random-per-request is simpler to implement. Use sticky for chat; random is acceptable for single-turn extraction tasks.
- Full regression on every commit vs nightly. Full regression runs are expensive and slow. Daily nightly runs on main, fast partial runs on every commit, and full runs on staging-to-prod promotions is the right schedule.
- Prompt versioning in app code vs external. If prompts are versioned in the prompt service, engineers can change a prompt without shipping a new code artifact. This is powerful but it also means a non-engineer can change production behavior with a few clicks. Access control (who can promote to production) is the gate.
- Secret injection via Vault vs env var. Vault is the correct answer for any secret that is not already in the environment. An API key baked into a prompt template and stored in Postgres is a disaster waiting for a security audit.
Failure modes and their mitigations
- Regression suite false-negative. A bad prompt passes the golden set but degrades production behavior on out-of-distribution inputs. Mitigation: maintain a production error log — when a user flags a bad response, that case is added to the golden set automatically. The set grows over time.
- Prompt service unavailable at serving time. If the serving fleet can’t resolve the prompt version, it should fall back to a hard-coded default version embedded in the container image. Never fail open with an empty prompt.
- A/B experiment stuck. An engineer launches an A/B with 50/50 traffic and forgets to analyze results. After 30 days, the experiment is still running. Mitigation: experiments have an expiry date (default: 14 days). On expiry, the system alerts the owner and, if no action is taken, reverts to the control (A) variant.
- Secrets in prompt body. An engineer pastes an API key directly into the prompt body instead of using a template variable. Mitigation: the prompt service runs a secrets-detection scan on every commit (using
detect-secretsortruffleHog) and rejects commits that look like secrets.
What a senior candidate says that a mid-level candidate misses
The mid-level candidate talks about “storing prompts in a database with version numbers.” The senior candidate talks about the eval gate as the real design challenge: the CI gating is only valuable if the golden set is good, and golden sets tend to be stale and narrow unless there is an automated pipeline that continuously grows them from production failures. The second key insight is secret injection: keeping API keys and other secrets out of the prompt body, out of the prompt store, and out of every audit log downstream. This is not obvious unless you’ve watched someone accidentally commit a live API key to a prompt store that replicates to ten different systems.
Follow-up drills
- “An engineer wants to test a prompt change on 1% of production traffic without going through the full staging cycle. How would you support this while maintaining audit compliance?”
- “The LLM-as-judge used for regression scoring is itself an LLM. What happens when the judge model is upgraded? How do you prevent evaluation drift?”
- “Your system supports 50 teams. How do you prevent one team from forking another team’s prompt and making unauthorized changes to production systems they don’t own?”
§125.4 Design an agent platform with a tool sandbox
The problem
The interviewer asks: “We want to build a platform that lets AI agents call external tools — web search, code execution, database queries, third-party APIs. Design the agent platform and the tool sandbox, with a focus on multi-tenant isolation and security.”
What they are testing: this is the hardest system in this chapter. The interviewer is looking for candidates who understand that arbitrary code execution in a multi-tenant environment is a security problem first and a performance problem second. They want to see sandboxing primitives named, blast-radius analysis, resource limits, and audit trails.
Clarifying questions
- What kinds of tools? Read-only tools (web search, vector retrieval) vs stateful tools (database write, file creation) vs arbitrary code execution (Python interpreter)? Arbitrary code execution is the hardest and most important to design for.
- Who are the tenants? Internal teams only, or external untrusted users writing their own tools? External users writing custom tools requires stronger isolation.
- What are the resource limits per tool call? Time limit (1 second? 30 seconds?), memory limit (256 MB? 4 GB?), CPU limit, network access (allowed or blocked?).
- Synchronous vs asynchronous tool calls? Most tool calls should be synchronous (sub-second); long-running tools (scraping 100 pages) need an async model with polling or webhook callbacks.
- What is the audit requirement? Log tool inputs and outputs? Log the full execution environment state?
- How does the agent orchestrator interact with the platform? OpenAI function-calling API style, or a richer protocol?
Back-of-envelope
Tool call volume scales with agent request volume. If the serving fleet handles 10,000 agent requests per second and each request makes an average of 3 tool calls, the platform receives 30,000 tool calls per second. However, most tool calls are fast read-only calls (vector lookup: 5 ms, web search: 200 ms). Code execution calls are slower but rarer.
For code execution: assume 1% of tool calls (300/sec) are Python sandbox executions. Each sandbox needs to spin up a container or micro-VM, execute code, and return. The lifecycle for a gVisor container is ~50–100 ms cold start. At 300/sec with pool pre-warming, we need roughly 300 × 0.1 s = 30 concurrent gVisor workers. With 30-second max execution time, the pool needs much more capacity for long-running tasks: size the pool at 300/sec × 30s × 1.2 headroom ≈ 10,800 slots. This is the expensive part: at 2 vCPU + 512 MB per slot on a c5.2xlarge ($0.34/hr for 8 vCPUs), 10,800 slots / 4 vCPU per instance × $0.34 = ~$918/hr or $660k/month. This cost motivates aggressive short time limits: 10 second default, 30 second premium.
Architecture
graph LR
AGENT["Agent Orchestrator\n(ReAct loop / Temporal)"] --> TOOLAPI["Tool Registry API\nlist · invoke · schema"]
TOOLAPI --> DISPATCH["Tool Dispatcher\nroute by tool type · auth check"]
DISPATCH --> READONLY["Read-only Tools\nweb search · vector lookup\nDB read · HTTP fetch"]
DISPATCH --> SANDBOX["Code Sandbox Pool\ngVisor containers\nFirecracker micro-VMs for untrusted"]
DISPATCH --> STATEFUL["Stateful Tools\nDB write · file I/O · email send"]
SANDBOX --> NETPOL["Network Policy\nblock all except allowlist\nDNS sinkhole for malicious"]
SANDBOX --> RLIMITS["Resource Limits\ncgroups: 512 MB RAM · 1 CPU\nseccomp: allow-list syscalls"]
TOOLAPI --> AUDITLOG["Audit Log\nKafka: tool · tenant · input hash · output hash · duration"]
DISPATCH --> QUOTA["Per-tenant Quota\ntoken bucket: tool calls/min\nhard cap: $/day on expensive tools"]
STATEFUL --> GUARDRAILS["Stateful Guardrails\nconfirmation required · dry-run mode"]
style SANDBOX fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The code sandbox pool is the security-critical component: it executes arbitrary tenant code inside gVisor containers with cgroup resource limits and seccomp allow-lists, completely isolated from the host and from other tenants.
Tool registry is a catalog of registered tools with their OpenAPI schema, metadata (owner team, tier, SLA), and routing key. Tools are versioned: web-search@v2, python-interpreter@v3. The serving fleet (and agent orchestrators) discover available tools via the registry API, which returns the JSON schema for function-calling. The registry is read-heavy and served from a Redis cache in front of Postgres.
Tool dispatcher validates the calling agent’s credentials, checks per-tenant quota, selects the right handler, and routes the call. For read-only tools, it calls the external service directly and returns. For code execution, it routes to the sandbox pool. For stateful tools, it applies guardrails: confirmation-required mode (the agent must present the tool call for human approval before execution) and dry-run mode (describe what would happen without doing it).
Code sandbox architecture is the critical design choice. Two options:
- gVisor (runsc): a userspace kernel that intercepts syscalls in Go, providing strong isolation without a full VM. Cold start: 50–100 ms. Memory overhead: ~20 MB per container. Best for Python execution of trusted-but-isolated code.
- Firecracker micro-VMs: a full lightweight VM (5 MB kernel, 125 ms boot). Each tenant gets a VM. Best for untrusted external user code. The isolation is hardware-level, not userspace.
The production pattern for a B2B platform is gVisor for internal tools, Firecracker for external/untrusted code. A pool of warm gVisor containers (pre-started, waiting for assignments) eliminates cold start for the common case. Firecracker VMs are started on demand for untrusted execution.
Inside the sandbox: cgroups enforce memory limit (512 MB default) and CPU quota (1 vCPU). seccomp allow-list permits only the ~30 syscalls needed for Python execution (read, write, mmap, futex, clock_gettime, etc.) and explicitly blocks fork, exec, mount, network operations outside the allow-list. The network policy blocks all external connections except for explicitly allowed endpoints (e.g., the agent’s own database connection); a DNS sinkhole catches any attempt to resolve arbitrary external hostnames.
Multi-tenant quota covers two dimensions: calls per minute (token bucket, enforced before dispatch) and dollar spend per day (accumulator, enforced at the end of each execution). Expensive tools (web scraping, code execution) have explicit per-call cost estimates. If a tenant’s daily dollar accumulator hits the hard cap, tool calls return 429. This protects against runaway agent loops.
Audit trail: every tool call is logged to Kafka with: tenant ID, agent session ID, tool name + version, input SHA-256 (not raw input, for PII), output SHA-256, duration, exit code, resource usage (peak RAM, CPU time). Full input/output logs are optional and stored in S3 with the same tiered retention policy as the gateway audit log (7 days full, 90 days metadata). The audit trail is queryable via Athena.
Memory isolation between agents in the same session is explicit: each agent session gets an isolated in-memory scratchpad (a Redis hash keyed by session ID). Tool calls can write to the scratchpad but cannot read from other sessions’ scratchpads. The agent orchestrator (typically Temporal or a lightweight ReAct loop) is responsible for passing context to tools explicitly; there is no global shared state.
Key trade-offs
- gVisor vs Firecracker. gVisor: lower overhead (20 MB, 100 ms cold start), weaker isolation (userspace kernel, still has attack surface). Firecracker: higher overhead (5 MB kernel + 125 ms boot), hardware isolation, no shared kernel. Use gVisor for internal code, Firecracker for external untrusted code.
- Synchronous vs async tool calls. Synchronous is simpler but limits tool call time to the agent’s latency budget. Async allows long-running tools but requires the agent orchestrator to handle polling/callbacks — much more complex. Default: synchronous with a 10-second timeout; async as an explicit opt-in with a callback URL.
- Tool call caching. Idempotent read-only tool calls (web search for the same query) can be cached. Stateful tools must not be cached. A
cacheable: trueflag in the tool schema allows the dispatcher to cache read-only results. - Network access in sandboxes. Allowing network access from the sandbox opens data exfiltration risks. Default: no network access. Tools that need external access (web scraping) get a separate network-enabled pool with a forward proxy that enforces an allow-list.
- Short time limits vs user experience. A 10-second time limit on code execution kills any data processing workload. For batch tasks, the right model is asynchronous with a 10-minute limit and a webhook callback. Synchronous code execution should be hard-capped at 10 seconds.
Failure modes and their mitigations
- Agent loop calling a tool 10,000 times. A runaway ReAct loop can issue tool calls until the quota is exhausted or the budget cap hits. Mitigation: per-session tool call counter (hard cap at 200 calls/session); the orchestrator layer enforces max iterations.
- Sandbox escape attempt. A tenant sends code designed to exploit a gVisor vulnerability. Mitigation: defense in depth — even if gVisor is bypassed, the cgroup limits and network policy still apply; the outer Kubernetes NetworkPolicy blocks outbound exfiltration. Monitor
sandbox:seccomp_violation_rateand alert immediately on spikes. - Sandbox cold start at peak. If tool call volume spikes suddenly, the warm pool is exhausted and cold starts add 100–200 ms latency to every new tool call. Mitigation: KEDA autoscales the pool on
sandbox:queue_depth; maintain a minimum pool of 100 warm gVisor containers per tenant tier. - Stateful tool partial failure. A stateful tool (database write) succeeds but the agent crashes before recording the result. The agent retries and the tool is called twice. Mitigation: stateful tools must be idempotent or implement client-side deduplication keys; the dispatcher passes a tool-call ID that the tool uses for idempotency.
- External API rate-limit cascade. If a web search tool uses an external API (SerpAPI, Bing) and that API is rate-limited, all web search tool calls fail at once. Mitigation: circuit breaker (Chapter 122) at the tool dispatcher; fail fast with a cached fallback result.
What a senior candidate says that a mid-level candidate misses
The mid-level candidate focuses on the happy path: tool is registered, agent calls it, result is returned. The senior candidate immediately goes to the blast radius of a compromised sandbox: what does a malicious tenant’s code do if gVisor is bypassed? This is the right frame. The answer is: if the kernel is exploited, the cgroups (resource limits) still apply to the process, the Kubernetes NetworkPolicy still blocks outbound connections, and the audit trail has a record. The question is not whether the sandbox is perfect — it is whether the layers of defense limit the damage to one tenant’s session.
The second differentiator is agent loop termination as a first-class design requirement. Any platform that enables agents will eventually see a runaway agent that issues 50,000 tool calls in a loop because the LLM got confused about its termination condition. Per-session tool call budgets, dollar caps, and max-iteration limits in the orchestrator are not afterthoughts — they are the primary defense against the most common production incident in agent platforms.
Follow-up drills
- “The Python sandbox needs to call your internal vector database (which is on a private IP). How do you allow this while keeping the general no-network rule?”
- “A tenant’s code in the sandbox reads
/etc/passwd. Does that violate isolation? Why or why not?” - “Describe how you would implement tool call idempotency so that a stateful tool called twice with the same inputs has the same effect as calling it once.”
Read it yourself
- The Hugging Face Text Embeddings Inference repository — architecture and benchmarks for TEI, the production-grade encoder serving runtime.
- The Envoy AI Gateway documentation — the authoritative source for the OpenAI-compatible proxy filter chain.
- The gVisor security design document (gvisor.dev) — explains the user-kernel model, attack surface analysis, and the performance overhead of syscall interception.
- The Firecracker design paper (Agache et al., 2020, NSDI) — the micro-VM architecture behind AWS Lambda; the canonical production reference for lightweight VM isolation.
- The Microsoft Presidio library — the standard open-source PII detection and anonymization toolkit.
- Andrej Karpathy’s “Software 2.0” post — the conceptual frame for treating prompts as software artifacts.
- The LangSmith and Weave (W&B) documentation — commercial prompt management and eval tracking products; reading their API design reveals the tradeoffs in the systems space.
Practice
- Work the embedding service sizing for a 10 TB corpus of legal documents, where each document averages 30,000 tokens. Estimate the number of chunks, the vector index size at 1536 dims (fp16), and the TEI fleet size for 5,000 QPS query traffic.
- Design the token-bucket rate limiting logic for the AI gateway. How many Redis operations does each request require? What happens if Redis is slow (100 ms latency)? What is the fallback?
- The prompt management system needs to support 500 concurrent A/B experiments across 200 teams. Sketch the data model (tables and their columns) that supports this.
- A tool sandbox needs to run Python code that processes a 1 GB NumPy array. Walk through the resource limits you would set and the cold-start implications.
- Stretch: Design the schema for the tool registry — every field that a tool definition needs to carry, and the API endpoints for registering, discovering, and deprecating tools.