Agent SWE interview preparation: the questions they actually ask
"The best agent engineer I ever hired wrote zero agents in her take-home. She wrote a sandbox, a retry loop, and an eval harness. The agent was four lines"
If you have worked through Part V (Chapters 66—72, tool use, orchestration, MCP, safety) and Part XI (Chapters 131—140, building agents and agent infrastructure end-to-end), you understand how agent systems work. This chapter is about proving it under pressure. It is the interview capstone for agent SWE roles --- the fastest-growing hiring category in ML engineering since 2025 --- and it is written to be read the night before.
The agent SWE interview is not an ML interview with agent-themed questions bolted on. It is a systems interview that happens to involve non-deterministic components. The companies hiring for these roles care about sandboxing, observability, cost control, eval, and failure recovery at least as much as they care about prompting. If you walk in and talk only about prompt engineering, you will not pass. If you walk in and talk about the full lifecycle --- tool schema design, context window management, checkpoint-and-resume, trace-based eval, and progressive autonomy --- you will stand out.
This chapter gives you the questions they actually ask, the code they actually expect, and the vocabulary that signals you have built these systems for real.
Outline:
- The agent SWE role landscape
- The interview format
- System design questions and how to answer them
- Coding interview questions for agent roles
- Deep-dive technical questions
- The “agent failure” behavioral questions
- The vocabulary that signals agent expertise
- Mock interview transcript: “Design a coding agent”
- Mock interview transcript: “Implement an agent loop”
- Company-specific notes
- Common mistakes and recovery moves
- The 48-hour prep plan
- Mental model
141.1 The agent SWE role landscape
The title on the job posting varies --- Agent Engineer, Agent Infrastructure Engineer, AI Systems Engineer, Agent Platform Engineer --- but the role breaks into three distinct categories, and you need to know which one you are interviewing for before you walk in.
Agent infrastructure at frontier labs (Anthropic, OpenAI, Google DeepMind). You are building the platform that agents run on. Sandboxing, tool execution, streaming, context management, evaluation infrastructure. The interview skews heavily toward systems design and low-level implementation. You will be asked about process isolation, resource limits, checkpoint serialization, and how to run untrusted code safely at scale. ML knowledge matters but takes second seat to systems thinking.
Agent product at startups (Cognition/Devin, Cursor, Windsurf, Factory, Cosine). You are building a specific agent that ships to users. The interview tests your ability to design an end-to-end agent experience: the loop, the tools, the UX, the eval, the cost model. You will write code that looks like a real agent. Startups care about velocity and pragmatism --- over-engineering is a red flag.
Agent platform at enterprises (Salesforce, ServiceNow, Bloomberg, large banks). You are building the infrastructure that lets internal teams deploy agents safely. Multi-tenancy, access control, audit logging, compliance, human-in-the-loop. The interview tests your ability to design for safety, governance, and scale. You will be asked about RBAC, tool registries, approval workflows, and how to prevent an agent from sending $10M to the wrong account.
What “agent engineer” actually means: it is not prompt engineering. The job is 20% LLM interaction design, 30% systems infrastructure, 20% tooling and sandboxing, 15% evaluation and observability, and 15% safety and security. If you can only write prompts, you are a prompt engineer. If you can build the platform that runs the agent, manages its tools, evaluates its output, and recovers from its failures --- you are an agent engineer.
141.2 The interview format
A typical agent SWE loop at a senior level runs four to five rounds over one or two days. Here is the canonical structure:
| Round | Duration | Format | What they test |
|---|---|---|---|
| System design | 45—60 min | Whiteboard / shared doc | Can you architect an agent platform end-to-end? |
| Coding | 45 min | Live coding (CoderPad / laptop) | Can you implement an agent loop, tool registry, or sandbox? |
| ML fundamentals | 30—45 min | Discussion with code snippets | Do you understand how tool calling, context windows, and structured generation work under the hood? |
| Behavioral | 30—45 min | Conversation | Have you shipped agents, dealt with failures, made trade-off decisions? |
| Hiring manager / team fit | 30 min | Conversation | Will you thrive on this team? |
How this differs from traditional ML systems interviews (Part X, Chapters 112—130):
-
Non-determinism is the default. In a traditional ML systems interview, the non-deterministic component is the model, and everything around it is deterministic. In an agent interview, the agent’s behavior is non-deterministic at every step --- which tool it calls, how many iterations it runs, whether it gets stuck. You must design for this.
-
Security is a first-class concern. Traditional ML systems serve predictions; agents take actions. The interview will probe whether you think about what an agent should not be allowed to do.
-
Cost is a design constraint. A single agent run can cost $0.50—$50 depending on the task. Interviewers expect you to reason about cost per task, not just throughput.
-
Eval is harder and they know it. You cannot A/B test an agent the way you A/B test a recommendation model. The interviewer wants to hear your eval strategy, not just “we’ll use metrics.”
141.3 System design questions and how to answer them
The system design round is where agent SWE interviews diverge most sharply from traditional ML interviews. You will be given one of a handful of canonical prompts. Here are the four most common, with the 45-minute answer structure for each.
141.3.1 “Design a coding agent” (Devin / Claude Code / Codex)
This is the most popular question in 2025—2026 agent interviews. Expect it.
Minutes 0—5: Clarify scope. Ask: Is this an autonomous agent or an interactive assistant? Does it operate on a single repo or arbitrary repos? What is the latency target --- is the user watching in real time or does it run in CI? What is the cost budget per task?
Minutes 5—15: High-level architecture.
Draw five components: (1) the agent loop (ReAct with tool calls), (2) the sandbox (isolated environment for code execution), (3) the tool layer (file read/write, terminal, search, git), (4) the context manager (what fits in the window), (5) the eval and observability layer.
Minutes 15—30: Deep dives.
The interviewer will pick two or three areas. Be ready for:
-
Sandbox design. How do you isolate the agent’s code execution? Options: Docker containers (fast but shared kernel), microVMs (Firecracker --- strong isolation, ~125ms boot), subprocess with seccomp (lightweight but weaker). Discuss the trade-off between startup latency and security boundary. Mention network restrictions (the agent should not be able to exfiltrate data), filesystem snapshots (for rollback), and resource limits (CPU, memory, wall-clock timeout).
-
Agent loop and termination. The loop is simple --- call LLM, parse tool use, execute tool, feed result back. The hard part is termination. You need: a max-iteration cap (prevent infinite loops), a no-progress detector (if the last N iterations produced no meaningful change, stop), a cost cap (halt if token spend exceeds budget), and a user-interrupt mechanism (the user hits Ctrl-C and the agent checkpoints gracefully).
-
Context window management. A coding agent on a large repo will blow through 200K tokens in minutes. Strategies: sliding window over conversation history, summarization checkpoints (every N turns, summarize and compact), tool output truncation (grep returns 10,000 lines --- truncate to the first 200 with a note), and on-demand file reading (do not dump the entire file into context --- read only the relevant lines).
-
Streaming UX. The user needs to see what the agent is doing in real time. Stream the agent’s reasoning (thinking tokens), tool calls (show the command being run), and tool output (show terminal output as it appears). This requires an event-driven architecture with SSE or WebSocket streaming.
Minutes 30—40: Eval and cost.
This is where you signal seniority. Junior candidates forget eval entirely. Senior candidates say: “We need three eval layers. First, a suite of golden tasks --- 50—100 curated coding tasks with known-good solutions, run nightly in CI. Second, LLM-as-judge scoring on open-ended tasks where there is no single right answer. Third, cost and latency tracking per task category, with alerts when median cost drifts above threshold.”
Minutes 40—45: Ops and failure modes.
What happens when the LLM hallucinates a tool that does not exist? What happens when the sandbox runs out of disk? What happens when the agent gets stuck in a loop editing the same file? How do you debug a failure that only reproduces 10% of the time? Have answers for each.
What signals seniority: Discussing the security model unprompted. Mentioning eval before the interviewer asks. Reasoning about cost per task. Acknowledging that the agent will fail and designing for graceful degradation.
141.3.2 “Design a multi-tenant agent platform”
The framing: “You are building a platform that lets internal teams at a large company deploy their own agents. Each team defines their agent’s tools, prompts, and policies. The platform handles execution, billing, and safety.”
Key components to draw:
- Tool registry --- a catalog where teams register tools with JSON Schema definitions, auth requirements, and rate limits. Think of it as a service mesh for agent tools.
- Session state store --- each agent session has a conversation history, checkpoint state, and metadata. Must support resume-after-crash.
- Execution engine --- the agent loop runs in an isolated environment per tenant. Discuss multi-tenancy: shared-process (cheap, noisy-neighbor risk) vs per-tenant containers (expensive, isolated).
- Billing and metering --- track token usage, tool invocations, and wall-clock time per tenant. Aggregate for chargeback.
- Observability --- distributed tracing across agent loop, tool calls, and LLM requests. Every agent run produces a trace that can be replayed.
- Human-in-the-loop gateway --- some tools require approval before execution. The platform must support synchronous approval flows (pause the agent, notify the human, resume on approval).
Key trade-offs the interviewer will probe:
- Shared LLM pool vs per-tenant API keys (cost vs isolation)
- Synchronous vs asynchronous tool execution (latency vs safety)
- How to prevent one tenant’s agent from accessing another tenant’s tools
- How to handle a tool that takes 30 seconds to return (timeout? background execution? callback?)
141.3.3 “Design an agent evaluation system”
The framing: “Your team ships a coding agent. It is improving fast, but you have no way to know if a new prompt change helped or hurt. Design the eval system.”
The answer structure:
-
Golden task suite. Curate 100+ tasks across difficulty levels and categories (bug fix, feature addition, refactoring, test writing). Each task has a repo snapshot, a task description, and acceptance criteria (tests that must pass, files that must be modified, code that must not be deleted).
-
Trace capture. Every agent run in production and eval records a full trace: the conversation history, every tool call and result, the final outcome, the token count, the wall-clock time, the cost. Store traces as structured JSON in an append-only log.
-
Scoring pipeline. Three scoring layers: (a) deterministic checks (did the tests pass? did the agent stay within the cost budget?), (b) LLM-as-judge (give a separate LLM the task description and the agent’s output, ask it to score on correctness, code quality, and efficiency), (c) human review (sample 10% of eval runs for manual grading).
-
Regression CI. On every prompt or tool change, run the golden task suite. Compare pass rates, cost distributions, and LLM-as-judge scores against the baseline. Block the merge if pass rate drops by more than 2% or median cost increases by more than 20%.
-
Cost tracking dashboard. P50/P90/P99 cost per task category. Alerting on drift.
What signals seniority: Separating deterministic checks from LLM-based scoring. Discussing the variance in LLM-as-judge scores and how to reduce it (multiple judges, explicit rubrics). Mentioning that eval tasks must be versioned and that the golden suite needs ongoing curation.
141.3.4 “Design a customer support agent”
The framing: “Design an AI agent that handles customer support tickets for an e-commerce company. It can look up orders, process refunds, update shipping addresses, and escalate to a human.”
The key insight: This is a hybrid workflow-agent system. 95% of support tickets follow predictable patterns (order status, return request, address change). These should be workflows --- deterministic state machines with fixed tool call sequences. The remaining 5% are edge cases that require agent reasoning --- the LLM decides what to do based on the conversation. Saying “I’d just build an agent that handles everything” is the wrong answer.
Components:
- Intent classifier --- route the ticket to a workflow or the general agent. This can be a fine-tuned classifier or an LLM with a structured output schema.
- Workflow engine --- for common intents, execute a predefined sequence of tool calls. No LLM needed for most steps. Fast, cheap, predictable.
- Agent fallback --- for unrecognized intents or when the workflow gets stuck, hand off to the full agent loop.
- Tool layer --- order lookup, refund processing, address update, shipping status, CRM note creation.
- Safety rails --- the agent cannot issue a refund above $500 without human approval. The agent cannot access customer data from a different customer’s ticket. All tool calls are logged.
- Escalation --- if the agent cannot resolve the ticket in N turns, or if the customer asks for a human, escalate. Pass the full conversation and agent reasoning to the human agent.
What signals seniority: Proposing the hybrid workflow-agent pattern. Discussing cost (running GPT-4 on every “where is my order” ticket is burning money). Mentioning that safety rails must be enforced at the tool layer, not just in the prompt.
141.4 Coding interview questions for agent roles
The coding round tests whether you can implement the primitives that agent systems are built on. These are not LeetCode problems. They are systems-programming problems with an LLM twist.
141.4.1 “Implement a basic agent loop with tool calling”
This is the most common coding question. You have 20 minutes. The interviewer wants to see a clean ReAct loop with proper termination.
"""Basic agent loop with tool calling.
The candidate should produce something close to this in ~20 minutes.
Key things the interviewer watches for:
- Clean separation between LLM call and tool dispatch
- Explicit termination conditions (max iterations, no tool call)
- Error handling in tool execution
- Message history management
"""
import json
from typing import Any, Callable
# -- Simulated LLM client (in real code: anthropic.Anthropic() or openai.OpenAI()) --
def call_llm(messages: list[dict], tools: list[dict]) -> dict:
"""Placeholder for LLM API call. Returns an assistant message."""
# In production: client.messages.create(model=..., messages=..., tools=...)
raise NotImplementedError("Replace with real LLM call")
# -- Tool registry --
ToolFn = Callable[..., str]
class ToolRegistry:
def __init__(self):
self._tools: dict[str, ToolFn] = {}
self._schemas: dict[str, dict] = {}
def register(self, name: str, fn: ToolFn, schema: dict):
self._tools[name] = fn
self._schemas[name] = schema
def execute(self, name: str, arguments: dict[str, Any]) -> str:
if name not in self._tools:
return json.dumps({"error": f"Unknown tool: {name}"})
try:
result = self._tools[name](**arguments)
return result if isinstance(result, str) else json.dumps(result)
except Exception as e:
return json.dumps({"error": f"{type(e).__name__}: {e}"})
def schemas(self) -> list[dict]:
return [
{"name": name, "input_schema": schema}
for name, schema in self._schemas.items()
]
# -- Agent loop --
def agent_loop(
user_message: str,
registry: ToolRegistry,
max_iterations: int = 20,
max_tool_errors: int = 3,
) -> str:
"""Run a ReAct agent loop until the LLM produces a final text response
or a termination condition is hit."""
messages: list[dict] = [
{"role": "user", "content": user_message},
]
tool_error_count = 0
for i in range(max_iterations):
# 1. Call the LLM
response = call_llm(messages, tools=registry.schemas())
# 2. Check if the LLM wants to use a tool
if response.get("stop_reason") == "tool_use":
# Extract tool call(s) from the response
tool_calls = [
block for block in response.get("content", [])
if block.get("type") == "tool_use"
]
# Add the assistant message (with tool_use blocks) to history
messages.append({"role": "assistant", "content": response["content"]})
# 3. Execute each tool and collect results
tool_results = []
for tc in tool_calls:
result_str = registry.execute(tc["name"], tc.get("input", {}))
if '"error"' in result_str:
tool_error_count += 1
tool_results.append({
"type": "tool_result",
"tool_use_id": tc["id"],
"content": result_str,
})
messages.append({"role": "user", "content": tool_results})
# 4. Check consecutive-error termination
if tool_error_count >= max_tool_errors:
return (
f"Agent stopped: {tool_error_count} tool errors. "
f"Last error: {tool_results[-1]['content']}"
)
else:
# 5. The LLM produced a final text response -- we are done
text_blocks = [
block["text"]
for block in response.get("content", [])
if block.get("type") == "text"
]
return "\n".join(text_blocks)
return "Agent stopped: reached maximum iteration limit."
Discussion points the interviewer will raise:
- Why
max_iterations? Without it, a confused LLM can loop forever. Real systems use 20—100 depending on task complexity. - How would you add a cost cap? Track cumulative input + output tokens after each LLM call. Break if cost exceeds budget.
- How would you add no-progress detection? Hash the last N tool call + result pairs. If the same hash repeats, the agent is stuck.
- Parallel tool calls? Some LLMs emit multiple tool_use blocks. You can execute them concurrently with
asyncio.gather--- but mention that tool execution order may matter (you cannot write a file and read it in parallel).
141.4.2 “Implement a tool registry with schema validation”
"""Tool registry with JSON Schema validation.
The interviewer wants to see:
- Dynamic registration of tools
- Schema validation before dispatch (fail fast, not at tool execution)
- Clean error messages for missing/invalid arguments
"""
import json
from typing import Any, Callable
from jsonschema import validate, ValidationError # pip install jsonschema
class ValidatingToolRegistry:
"""Registry that validates tool inputs against JSON Schema before dispatch."""
def __init__(self):
self._tools: dict[str, Callable] = {}
self._schemas: dict[str, dict] = {}
self._descriptions: dict[str, str] = {}
def register(
self,
name: str,
fn: Callable,
parameters_schema: dict,
description: str = "",
):
"""Register a tool with its JSON Schema for input validation."""
self._tools[name] = fn
self._schemas[name] = parameters_schema
self._descriptions[name] = description
def list_tools(self) -> list[dict]:
"""Return tool definitions in the format LLMs expect."""
return [
{
"name": name,
"description": self._descriptions.get(name, ""),
"input_schema": self._schemas[name],
}
for name in self._tools
]
def call(self, name: str, arguments: dict[str, Any]) -> dict:
"""Validate inputs, dispatch to the tool, return structured result."""
# Check tool exists
if name not in self._tools:
available = ", ".join(sorted(self._tools.keys()))
return {
"is_error": True,
"content": f"Unknown tool '{name}'. Available: {available}",
}
# Validate against schema
schema = self._schemas[name]
try:
validate(instance=arguments, schema=schema)
except ValidationError as e:
return {
"is_error": True,
"content": f"Validation error for '{name}': {e.message}",
}
# Execute
try:
result = self._tools[name](**arguments)
return {"is_error": False, "content": str(result)}
except Exception as e:
return {
"is_error": True,
"content": f"Tool '{name}' raised {type(e).__name__}: {e}",
}
# -- Example usage --
def read_file(path: str, offset: int = 0, limit: int = 200) -> str:
"""Read lines from a file."""
with open(path) as f:
lines = f.readlines()[offset : offset + limit]
return "".join(f"{offset + i + 1:>4} | {line}" for i, line in enumerate(lines))
registry = ValidatingToolRegistry()
registry.register(
name="read_file",
fn=read_file,
parameters_schema={
"type": "object",
"properties": {
"path": {"type": "string", "description": "Absolute file path"},
"offset": {"type": "integer", "minimum": 0, "default": 0},
"limit": {"type": "integer", "minimum": 1, "maximum": 1000, "default": 200},
},
"required": ["path"],
},
description="Read lines from a file with optional offset and limit.",
)
# Valid call
print(registry.call("read_file", {"path": "/etc/hostname"}))
# Invalid call -- triggers schema validation error
print(registry.call("read_file", {"path": 123}))
# -> {'is_error': True, 'content': "Validation error for 'read_file': 123 is not of type 'string'"}
# Unknown tool
print(registry.call("delete_everything", {}))
# -> {'is_error': True, 'content': "Unknown tool 'delete_everything'. Available: read_file"}
141.4.3 “Implement context window management with summarization”
"""Context window manager with summarization checkpoints.
The interviewer wants to see:
- Token counting (approximate is fine)
- Truncation strategy that preserves system prompt and recent turns
- Summarization as a compression mechanism
- Awareness of what information is lost
"""
import tiktoken # pip install tiktoken
class ContextWindowManager:
"""Manages conversation history to stay within a token budget."""
def __init__(
self,
max_tokens: int = 128_000,
reserve_for_response: int = 4_096,
summarize_fn=None, # async fn(messages) -> str
):
self.max_tokens = max_tokens
self.reserve = reserve_for_response
self.budget = max_tokens - reserve_for_response
self.summarize_fn = summarize_fn
self._encoder = tiktoken.get_encoding("cl100k_base")
def count_tokens(self, text: str) -> int:
return len(self._encoder.encode(text))
def count_message_tokens(self, messages: list[dict]) -> int:
total = 0
for msg in messages:
content = msg.get("content", "")
if isinstance(content, str):
total += self.count_tokens(content)
elif isinstance(content, list):
for block in content:
if isinstance(block, dict):
total += self.count_tokens(
block.get("text", "") + block.get("content", "")
)
total += 4 # per-message overhead
return total
def trim_to_budget(self, messages: list[dict]) -> list[dict]:
"""Drop oldest non-system messages until we fit in the budget.
Preserves: system message (index 0), last user message, last 4 turns.
Drops from the front of the middle section.
"""
total = self.count_message_tokens(messages)
if total <= self.budget:
return messages
# Never drop the system prompt or the last 4 messages
system = [messages[0]] if messages[0].get("role") == "system" else []
protected_tail = messages[-4:]
middle = messages[len(system) : -4] if len(messages) > 4 + len(system) else []
# Drop oldest middle messages one at a time
while middle and self.count_message_tokens(system + middle + protected_tail) > self.budget:
middle.pop(0)
return system + middle + protected_tail
async def summarize_and_compact(self, messages: list[dict]) -> list[dict]:
"""Replace old turns with a summary to free token budget.
Strategy:
1. Keep system prompt and last 6 turns intact.
2. Summarize everything in between into a single assistant message.
3. Prepend the summary as context.
"""
if not self.summarize_fn:
return self.trim_to_budget(messages)
system = [messages[0]] if messages[0].get("role") == "system" else []
recent = messages[-6:]
old = messages[len(system) : -6]
if not old:
return self.trim_to_budget(messages)
summary_text = await self.summarize_fn(old)
summary_msg = {
"role": "user",
"content": (
f"[Context summary of {len(old)} earlier messages]\n{summary_text}"
),
}
compacted = system + [summary_msg] + recent
return self.trim_to_budget(compacted)
def truncate_tool_output(output: str, max_lines: int = 200, max_chars: int = 20_000) -> str:
"""Truncate long tool outputs to prevent context blowup.
A grep that returns 50,000 lines should not fill the entire context window.
"""
lines = output.split("\n")
if len(lines) <= max_lines and len(output) <= max_chars:
return output
kept = lines[:max_lines]
truncated = "\n".join(kept)
if len(truncated) > max_chars:
truncated = truncated[:max_chars]
omitted = len(lines) - max_lines
return f"{truncated}\n\n[... {omitted} more lines truncated ...]"
141.4.4 “Implement a retry-with-backoff wrapper for flaky tool calls”
"""Retry wrapper with exponential backoff and jitter.
The interviewer is testing:
- Awareness that external tools fail (APIs, sandboxes, file systems)
- Correct backoff math (exponential, not linear)
- Jitter to prevent thundering herd
- Distinguishing retryable from non-retryable errors
"""
import asyncio
import random
import time
from typing import TypeVar, Callable, Awaitable
from functools import wraps
T = TypeVar("T")
class RetryableError(Exception):
"""Mark an error as safe to retry."""
pass
class NonRetryableError(Exception):
"""Mark an error as not safe to retry (e.g., invalid input)."""
pass
async def retry_with_backoff(
fn: Callable[..., Awaitable[T]],
*args,
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0,
jitter: float = 0.5,
retryable_exceptions: tuple = (RetryableError, TimeoutError, ConnectionError),
**kwargs,
) -> T:
"""Call an async function with exponential backoff on retryable failures.
Delay schedule: base_delay * 2^attempt + random jitter
Gives up after max_retries and re-raises the last exception.
"""
last_exception = None
for attempt in range(max_retries + 1):
try:
return await fn(*args, **kwargs)
except retryable_exceptions as e:
last_exception = e
if attempt == max_retries:
break
delay = min(base_delay * (2 ** attempt), max_delay)
delay += random.uniform(0, jitter * delay)
await asyncio.sleep(delay)
except Exception:
# Non-retryable -- fail immediately
raise
raise last_exception # type: ignore[misc]
# -- Decorator variant (convenient for tool definitions) --
def retryable(
max_retries: int = 3,
base_delay: float = 1.0,
retryable_exceptions: tuple = (RetryableError, TimeoutError),
):
def decorator(fn: Callable[..., Awaitable[T]]) -> Callable[..., Awaitable[T]]:
@wraps(fn)
async def wrapper(*args, **kwargs) -> T:
return await retry_with_backoff(
fn, *args,
max_retries=max_retries,
base_delay=base_delay,
retryable_exceptions=retryable_exceptions,
**kwargs,
)
return wrapper
return decorator
# -- Example usage --
@retryable(max_retries=3, base_delay=0.5)
async def call_external_api(url: str) -> dict:
"""Simulated flaky API call."""
import httpx
async with httpx.AsyncClient(timeout=10) as client:
resp = await client.get(url)
if resp.status_code == 429:
raise RetryableError(f"Rate limited: {resp.status_code}")
if resp.status_code >= 500:
raise RetryableError(f"Server error: {resp.status_code}")
if resp.status_code >= 400:
raise NonRetryableError(f"Client error: {resp.status_code}")
return resp.json()
141.4.5 “Implement a simple sandbox that runs user code safely”
"""Simple sandbox for executing untrusted code via subprocess.
The interviewer wants to see:
- Process isolation (subprocess, not exec())
- Timeout enforcement
- Resource limits (memory, CPU)
- Output capture and truncation
- Awareness of what this does NOT protect against
"""
import subprocess
import tempfile
import os
import resource
from dataclasses import dataclass
from typing import Optional
@dataclass
class SandboxResult:
stdout: str
stderr: str
exit_code: int
timed_out: bool
wall_time_seconds: float
def set_resource_limits():
"""Called in the child process before exec. Sets memory and CPU limits."""
# 512 MB memory limit
mem_limit = 512 * 1024 * 1024
resource.setrlimit(resource.RLIMIT_AS, (mem_limit, mem_limit))
# 30 seconds CPU time
resource.setrlimit(resource.RLIMIT_CPU, (30, 30))
# No new child processes
resource.setrlimit(resource.RLIMIT_NPROC, (0, 0))
def run_in_sandbox(
code: str,
timeout_seconds: float = 30.0,
max_output_bytes: int = 100_000,
working_dir: Optional[str] = None,
) -> SandboxResult:
"""Execute Python code in a subprocess sandbox.
Security notes (state these in the interview):
- This is a MINIMUM viable sandbox for interview purposes.
- Production systems use containers (Docker) or microVMs (Firecracker).
- This does NOT prevent filesystem access outside working_dir.
- This does NOT prevent network access.
- For real isolation, use seccomp-bpf, namespaces, or a VM.
"""
import time
# Write code to a temp file
work_dir = working_dir or tempfile.mkdtemp(prefix="sandbox_")
code_path = os.path.join(work_dir, "_sandbox_script.py")
with open(code_path, "w") as f:
f.write(code)
start = time.monotonic()
timed_out = False
try:
proc = subprocess.run(
["python3", "-u", code_path],
capture_output=True,
timeout=timeout_seconds,
cwd=work_dir,
preexec_fn=set_resource_limits,
env={
"PATH": "/usr/bin:/bin",
"HOME": work_dir,
"PYTHONDONTWRITEBYTECODE": "1",
# Restrict env -- no API keys, no cloud credentials
},
)
stdout = proc.stdout.decode("utf-8", errors="replace")[:max_output_bytes]
stderr = proc.stderr.decode("utf-8", errors="replace")[:max_output_bytes]
exit_code = proc.returncode
except subprocess.TimeoutExpired:
timed_out = True
stdout = ""
stderr = f"Execution timed out after {timeout_seconds}s"
exit_code = -1
wall_time = time.monotonic() - start
return SandboxResult(
stdout=stdout,
stderr=stderr,
exit_code=exit_code,
timed_out=timed_out,
wall_time_seconds=round(wall_time, 3),
)
# -- Example --
if __name__ == "__main__":
result = run_in_sandbox("print('hello from sandbox')")
print(f"stdout: {result.stdout}")
print(f"exit_code: {result.exit_code}")
print(f"wall_time: {result.wall_time_seconds}s")
# This should timeout
result = run_in_sandbox("import time; time.sleep(999)", timeout_seconds=2.0)
print(f"timed_out: {result.timed_out}")
Edge cases the interviewer will ask about:
- What if the code forks a subprocess that outlives the parent? The
RLIMIT_NPROC=0prevents this, but mention that process groups (os.killpg) are a more robust solution. - What if the code writes 10 GB to disk? Add
RLIMIT_FSIZEor use a tmpfs with a size limit. - What about network access? In production, use network namespaces or firewall rules. In this sandbox, acknowledge the gap.
141.5 Deep-dive technical questions
These are the “tell me how X works” questions that test whether you actually understand the systems you build or just copy-paste from tutorials.
”How does tool calling work end-to-end?”
The answer the interviewer wants:
-
System prompt injection. The tools are serialized as JSON Schema and injected into the system prompt (or a dedicated
toolsparameter). The LLM sees a description of each tool --- its name, its parameters, and what it does. -
Structured generation. During decoding, when the model decides to call a tool, it emits a structured block (e.g.,
{"type": "tool_use", "name": "read_file", "input": {"path": "/foo.py"}}). Some providers use constrained decoding to guarantee valid JSON; others rely on instruction-following and parse hopefully. -
Stop reason. The model signals “I want to use a tool” via a stop reason (
tool_usein Anthropic’s API,tool_callsin OpenAI’s). The client must check this to know whether to dispatch tools or display the response. -
Client-side dispatch. The client code (your agent loop) extracts the tool name and arguments, validates them against the schema, and calls the corresponding function.
-
Result injection. The tool result is formatted as a message (typically
role: toolorrole: userwith atool_resultblock) and appended to the conversation. The LLM sees it on the next turn and can reason about it. -
Loop continuation. Steps 2—5 repeat until the model produces a final text response (stop reason
end_turn) or a termination condition is hit.
”What’s the difference between MCP and A2A?”
MCP (Model Context Protocol) connects an agent to tools and resources --- deterministic functions with defined inputs and outputs. It is the USB port. The interaction is single-turn: call a function, get a result.
A2A (Agent2Agent) connects an agent to another agent --- an opaque reasoning system that may negotiate, ask clarifying questions, stream partial results, or refuse. It is the diplomat-to-diplomat channel. The interaction is multi-turn with a task lifecycle (submitted, working, input-needed, completed, failed).
Use MCP when the remote capability is a database query, an API call, or a file operation. Use A2A when the remote capability is itself an LLM-based reasoning system that needs to think, negotiate, or produce artifacts over multiple turns.
”How do you prevent prompt injection in an agent that reads untrusted documents?”
Layered defense:
- Input isolation. Mark untrusted content explicitly: wrap it in delimiters (
<untrusted_document>...</untrusted_document>) and instruct the LLM to treat it as data, not instructions. - Tool-level access control. Even if the prompt injection succeeds and the LLM tries to call a dangerous tool, the tool layer enforces permissions. The agent reading a document should not have access to the
send_emailtool. - Output filtering. Before executing any tool call that resulted from processing untrusted input, check whether the tool call matches the expected pattern. A
read_fileagent suddenly callingdelete_fileis suspicious. - Separate context. Process untrusted documents in a separate LLM call with a restricted tool set, then pass only the extracted information (not the raw document) to the main agent.
”How do you handle context window exhaustion in a long-running agent?”
Three strategies, in order of preference:
- Truncation with preservation. Drop old middle turns, keep the system prompt and the last N turns. Cheap and fast, but loses information.
- Summarization checkpoints. Every K turns, summarize the conversation so far into a paragraph and replace the old turns with the summary. More expensive (requires an LLM call) but preserves key information.
- External memory. Write important findings to a scratchpad file or key-value store. The agent can read them back later without relying on the conversation history. This is what production coding agents actually do.
”Workflow vs agent: when do you use each?”
Workflow: The task has a known sequence of steps. The inputs and outputs are predictable. The error modes are enumerable. Example: process a refund (check order status -> verify return window -> issue refund -> send email). Use a workflow.
Agent: The task requires reasoning about ambiguous input, dynamic tool selection, or multi-step problem solving where the steps are not known in advance. Example: “fix this bug” (requires reading code, understanding the error, reasoning about the fix, testing, iterating). Use an agent.
The hybrid pattern: Most production systems use both. Route predictable tasks to workflows and unpredictable tasks to agents. The workflow is faster, cheaper, and more predictable; the agent handles the long tail.
”How would you implement human-in-the-loop for an agent that can delete files?”
- Classify tools by risk level. Read-only tools (file_read, search) are low-risk: auto-approve. Mutating tools (file_write) are medium-risk: log and allow. Destructive tools (file_delete, git_push) are high-risk: require approval.
- Pause-and-resume. When the agent calls a high-risk tool, pause the agent loop, serialize the current state (conversation history, pending tool call), send an approval request to the human (via Slack, email, or UI), and wait. On approval, resume. On rejection, feed a “tool call rejected” result back to the LLM and let it adapt.
- Progressive autonomy. Start with all mutating tools requiring approval. As confidence grows (based on eval results and audit logs), downgrade specific tools to auto-approve. This is the “earn trust” model.
”How do you evaluate an agent?”
Five metrics that matter:
- Task success rate. Did the agent complete the task correctly? Measured by deterministic checks (tests pass, expected output produced) and LLM-as-judge.
- Cost per task. Total token cost (input + output) plus tool execution costs. Track P50/P90/P99.
- Latency. Wall-clock time from task submission to completion.
- Tool call efficiency. How many tool calls did the agent need to complete the task? Fewer is better (less cost, less latency).
- Safety violations. Did the agent attempt any tool calls it should not have? Did it leak any sensitive information? This is a binary metric: any violation is a regression.
”How do you debug a non-deterministic agent failure?”
- Capture the full trace. Every production agent run must log the complete conversation history, every tool call and result, and every LLM response including stop reasons and token counts.
- Replay with the same inputs. Feed the exact same conversation history (up to the failure point) to the LLM and see if it reproduces. If it does, it is a prompt problem. If it does not, it is a non-determinism problem (temperature > 0).
- Binary search on the trace. Find the first tool call or LLM response that diverged from the expected path. That is your root cause point.
- Temperature zero for debugging. Set temperature to 0 during replay to eliminate random variation. If the bug reproduces at temperature 0, the root cause is in the prompt or tool results, not sampling randomness.
- Add assertions. Once you find the root cause, add a check to the agent loop or tool layer that catches this class of failure early.
141.6 The “agent failure” behavioral questions
Behavioral questions in agent interviews follow the same STAR structure (Situation, Task, Action, Result) as any behavioral interview, but the content is different. Interviewers are specifically probing whether you have dealt with the unique failure modes of non-deterministic systems.
”Tell me about an agent that failed in production”
Framework for answering:
- Situation: Describe the agent, its purpose, and the scale it was operating at.
- Root cause: Be specific. “The agent got stuck in a loop calling the same search query because the results were always slightly different, so the no-progress detector did not trigger” is a good root cause. “The agent did not work well” is not.
- Fix: What did you change? Ideally both the immediate fix (patched the no-progress detector to hash the tool name and arguments rather than the full result) and the systemic fix (added a cost cap so stuck agents cannot spend more than $5 per task).
- Prevention: What did you add to prevent this class of failure? (Eval task that tests loop detection, monitoring dashboard for per-task cost, alert when any agent run exceeds 50 iterations.)
”How do you decide between building an agent vs a workflow?”
The answer they want: “I start with a workflow. I only reach for an agent when the task requires dynamic reasoning about ambiguous input. The decision framework is: Can I enumerate the steps in advance? If yes, workflow. Does the task require reading and interpreting free-form data to decide the next step? If yes, agent. For most production systems, I use the hybrid pattern --- a workflow handles the 95% case, and an agent handles the 5% that falls outside the workflow’s coverage."
"How do you handle the cost of agents at scale?”
Talk about: per-task cost budgets (hard cap, not just monitoring), model tiering (use a cheaper model for simple routing decisions, expensive model for complex reasoning), caching (identical tool calls return cached results), context window management (summarization reduces input tokens), and workflow-agent hybrid (the workflow path costs near-zero per task).
”Tell me about a time you had to debug a non-deterministic system”
The interviewer is testing whether you have a systematic approach or just re-run and hope. The right answer involves: trace capture, deterministic replay, binary search on the trace, and adding assertions to prevent recurrence. Mention temperature-zero replay as a debugging technique.
Adapting STAR for agent work
Standard STAR asks “what did you do?” Agent STAR adds two extra dimensions:
- Non-determinism acknowledgment. “The failure only reproduced ~30% of the time, so I could not just re-run the test case. I captured full traces from 100 runs, filtered for the failing cases, and diff’d the traces against the passing cases to find the divergence point.”
- Systemic prevention. “Beyond fixing the immediate bug, I added an eval task that specifically tests this failure mode, a cost cap that prevents runaway agent runs, and monitoring that alerts when any agent run exceeds 2x the median cost for its task category.”
141.7 The vocabulary that signals agent expertise
There are phrases that, when used precisely and at the right moment, make an interviewer think “this person has built this for real.” Here are the phrases, organized by topic, with guidance on when to use each.
Agent loop and control flow:
| Phrase | When to use it |
|---|---|
| ReAct loop with max-iteration cap | Describing any agent loop architecture. Shows you know the pattern has a name and that unbounded loops are dangerous. |
| No-progress detection | Discussing termination. The agent hashes recent actions and halts if it is repeating itself. |
| Checkpoint-and-resume | Designing long-running agents. Serialize state so the agent can survive crashes, timeouts, or human-in-the-loop pauses. |
| Cost cap as a circuit breaker | Discussing cost control. A per-run token budget that halts the agent before it bankrupts you. |
| Graceful degradation on tool failure | Error handling. The agent should not crash when a tool returns an error --- it should reason about the failure and try an alternative. |
| Progressive autonomy | Human-in-the-loop design. Start supervised, earn trust, gradually increase the agent’s permission set. |
| The hybrid workflow-agent pattern | System design for production systems. Route the 95% predictable case to a workflow, the 5% long tail to an agent. |
| Fan-out with budget subdivision | Multi-agent design. When an orchestrator delegates to sub-agents, it divides its remaining budget among them. |
Tool design and security:
| Phrase | When to use it |
|---|---|
| Tool schema as the contract surface | Discussing tool design. The JSON Schema is the API contract between the LLM and your code. |
| Capability-based security with least-privilege tool access | Security discussions. Each agent session gets only the tools it needs, not all tools in the registry. |
| Tool-level access control | Prompt injection defense. Even if the LLM is confused, the tool layer enforces what is allowed. |
| Idempotent tool design | Retry safety. Tools that can be safely re-called produce the same effect. |
| Tool output truncation | Context window management. A grep that returns 50,000 lines must be truncated before entering the context. |
| Schema validation before dispatch | Fail fast. Validate tool inputs against JSON Schema before calling the function. |
| Side-effect classification | Human-in-the-loop design. Classify tools as read-only, mutating, or destructive to determine approval requirements. |
Context and memory:
| Phrase | When to use it |
|---|---|
| Context window as working memory | Explaining why agents have limited “attention.” The context window is the agent’s RAM --- it forgets what falls off the end. |
| Summarization checkpoints | Context management. Periodically compress old turns into a summary to free token budget. |
| External scratchpad for persistent memory | Long-running agents. Write findings to a file or KV store so they survive context truncation. |
| Sliding window with system prompt anchoring | Token budget management. The system prompt is always preserved; middle turns get dropped. |
| Token budget allocation | System design. Reserve X tokens for system prompt, Y for tool schemas, Z for conversation, and W for the response. |
Evaluation and observability:
| Phrase | When to use it |
|---|---|
| Trace-based eval with LLM-as-judge scoring | Eval design. Capture full traces, score with a separate LLM, aggregate. |
| Golden task suite with regression CI | Eval infrastructure. A curated set of tasks run on every change. |
| Deterministic replay for debugging | Investigating failures. Feed the exact conversation history to the LLM and see if it reproduces. |
| Cost-per-task distribution monitoring | Ops. Track P50/P90/P99 cost, alert on drift. |
| Tool call efficiency metric | Eval. Fewer tool calls for the same result = better agent. |
| Temperature-zero bisection | Debugging non-determinism. Set temperature to 0 and binary search the trace for the divergence point. |
Architecture and infrastructure:
| Phrase | When to use it |
|---|---|
| Sandbox with network isolation | Coding agent design. The agent’s code execution environment cannot reach the internet. |
| MicroVM for strong tenant isolation | Multi-tenant platforms. Firecracker VMs give better isolation than containers. |
| Event-sourced agent state | Checkpoint design. Store the sequence of events (tool calls, results) rather than snapshots of state. |
| Tool registry with dynamic dispatch | Platform design. Tools are registered at runtime and dispatched by name. |
| Streaming UX with thinking tokens | User experience. Show the agent’s reasoning and actions in real time. |
| Rate limiting per-tenant per-model | Multi-tenant cost control. Prevent one tenant from consuming all LLM capacity. |
| Approval gateway for high-risk actions | Safety. A middleware that pauses the agent and routes destructive actions to a human. |
| Observability-first design | General principle. Every agent action emits a structured trace event before you even think about the happy path. |
141.8 Mock interview transcript: “Design a coding agent”
45-minute system design round. Annotated with [SIGNAL] and [TESTING] notes.
Interviewer: Let’s say we want to build a coding agent --- something like Claude Code or Devin. The user gives it a task in natural language, and it writes and modifies code to accomplish it. How would you design this?
Candidate: Before I dive in, let me clarify a few things. Is this an interactive agent where the user watches in real time, or a batch agent that runs in the background? Is it operating on an existing repository, or can it create projects from scratch? And what is the target environment --- local developer machine, cloud, or both?
[GOOD SIGNAL: Clarifies before designing. Shows this is not their first system design.] [TESTING: Can they scope a problem?]
Interviewer: Interactive, real-time. It works on an existing repo. Cloud-hosted, but the user sees the agent’s actions streamed to their terminal.
Candidate: Got it. Let me sketch the high-level architecture and then we can drill into whichever areas are most interesting.
I see five core components. First, the agent loop --- a ReAct loop that calls an LLM, gets back either a text response or a tool call, executes the tool, and feeds the result back. It has a max-iteration cap and a cost cap as circuit breakers.
Second, the tool layer. For a coding agent, the essential tools are: file read, file write (or edit), terminal (run shell commands), search (grep/glob across the repo), and git operations. Each tool has a JSON Schema defining its inputs. I would use a tool registry with schema validation before dispatch --- fail fast if the LLM generates invalid arguments.
Third, the sandbox. The terminal tool runs user-generated commands, so it must be isolated. For a cloud deployment, I would use a container per session with network restrictions, resource limits (CPU, memory, wall-clock timeout), and a filesystem snapshot for rollback. For stronger isolation, Firecracker microVMs --- ~125ms boot time, proper kernel boundary.
Fourth, the context manager. A coding agent on a large repo burns through tokens fast. My strategy: never dump an entire file into context --- read only the relevant line range. Truncate tool outputs to a configurable limit (200 lines by default for terminal output). Add summarization checkpoints every N turns --- compress old conversation into a paragraph to free token budget. Keep the system prompt and the last several turns always in the window.
Fifth, the eval and observability layer. Every agent run records a full trace: the conversation history, every tool call and its result, token counts, cost, and wall-clock time. We run a golden task suite nightly --- 100+ coding tasks with known-good solutions --- and track pass rate and cost regression over time.
[GOOD SIGNAL: Five components, named cleanly, with concrete details. Mentions cost cap, schema validation, and eval unprompted.] [TESTING: Can they see the full system, not just the LLM part?]
Interviewer: Good overview. Let’s drill into the sandbox. What happens if the agent runs a command that forks a background process?
Candidate: That is a real problem. If the agent runs something like nohup python server.py &, the process outlives the command and keeps consuming resources. Three defenses: first, run the command inside a process group and kill the entire group when the command returns or times out --- os.killpg(pgid, signal.SIGKILL). Second, set RLIMIT_NPROC to limit the number of child processes. Third, in the container, set a total memory limit via cgroups so even if orphan processes survive, they cannot consume unbounded memory.
For the sandbox lifecycle: when the agent session starts, we spin up a fresh container with the repo cloned into it. When the session ends, we snapshot the container state (in case we need to debug or resume), then destroy it. If the session is idle for 10 minutes, we checkpoint and hibernate.
[GOOD SIGNAL: Specific Unix knowledge (process groups, RLIMIT_NPROC, cgroups). Not hand-waving “just use Docker.”] [TESTING: Do they understand OS-level isolation, or just high-level concepts?]
Interviewer: How would you handle the case where the agent gets stuck in a loop --- editing the same file over and over without making progress?
Candidate: This is one of the most common failure modes in coding agents. I would implement a no-progress detector with two checks. First, hash the tool name and arguments of the last N tool calls (say N = 5). If we see the same hash appear three or more times, the agent is probably stuck. Second, after each edit, run the relevant tests. If the tests keep failing with the same error after three consecutive edits, inject a meta-message into the conversation: “You have attempted to fix this issue three times with similar approaches. Consider a different strategy or ask the user for guidance.”
I would also have a hard max-iteration cap --- say 100 iterations for a single task. And the cost cap as a backstop: if the agent has spent $5 and has not resolved the task, halt and report.
The key insight is that stuck detection must be at the tool layer, not in the prompt. Telling the LLM “don’t get stuck” in the system prompt does not work reliably. You need a programmatic check.
[GOOD SIGNAL: “Stuck detection must be at the tool layer, not in the prompt.” This is the kind of insight that comes from production experience.] [TESTING: Do they understand that LLMs are unreliable and you need programmatic safeguards?]
Interviewer: Let’s talk about eval. How do you know if a new prompt change made the agent better or worse?
Candidate: I would set up a three-layer eval system. The first layer is deterministic checks --- for each golden task, we have acceptance criteria: specific tests that must pass, specific files that must be modified, specific code patterns that must appear. These are binary: pass or fail.
The second layer is LLM-as-judge. For tasks where there is no single right answer --- like “refactor this function for readability” --- we use a separate LLM to score the agent’s output on a rubric. The rubric has specific dimensions: correctness (does the code work?), quality (is it clean?), efficiency (did the agent use a reasonable number of tool calls?). I would use multiple judges and average the scores to reduce variance.
The third layer is cost and latency tracking. A prompt change that increases pass rate by 5% but doubles the cost per task is not necessarily a win. We track P50/P90/P99 cost and latency distributions and surface them alongside pass rates.
For the CI pipeline: on every prompt or tool change, run the golden task suite (or a fast subset of 20 tasks for PRs, full suite nightly). Block the merge if pass rate drops by more than 2% or median cost increases by more than 20%.
[GOOD SIGNAL: Three-layer eval, specific thresholds for CI blocking, awareness of cost-accuracy trade-off.] [TESTING: Do they have a real eval strategy, or just “we’ll add metrics later”?]
Interviewer: What about the streaming UX? The user is watching in real time. What do they see?
Candidate: The user sees three types of events streamed to their terminal. First, the agent’s reasoning --- the text the LLM generates before deciding on a tool call. In the Anthropic API this is the thinking block; in other APIs you can extract the text content that precedes the tool call. This helps the user understand why the agent is taking an action.
Second, the tool call itself --- the user sees something like “Reading file: src/main.py lines 42-60” or “Running: pytest tests/test_auth.py”. This shows what the agent is doing.
Third, the tool result --- the output of the command, the content of the file, the test results. Truncated if too long, with a note saying “output truncated, showing first 50 lines.”
The architecture is event-driven: the agent loop emits events (thinking, tool_call, tool_result, final_answer) onto a stream. The client consumes the stream via SSE or WebSocket. This decouples the agent loop from the presentation layer --- the same stream can feed a terminal UI, a web UI, or a log file.
[GOOD SIGNAL: Concrete event types, SSE/WebSocket mention, decoupling of loop from presentation.]
Interviewer: One last question. If you had to ship a V1 in two weeks, what would you cut?
Candidate: I would ship with: a basic ReAct loop with a max-iteration cap, five core tools (file read, file write, terminal, search, git status), a container-based sandbox with resource limits, and simple token truncation (no summarization --- just drop old turns if the window fills up). No eval CI, no LLM-as-judge, no streaming UX --- just a synchronous loop that prints the final answer.
The eval suite and streaming UX come in week three. Summarization-based context management comes when we see users hitting context limits on real tasks. The key is to ship the simplest agent that works end-to-end and instrument it to learn what breaks.
[GOOD SIGNAL: Pragmatic prioritization. Ships the loop, sandbox, and core tools first. Defers complexity.] [TESTING: Can they scope a V1? Over-engineering is a red flag for agent SWE roles.]
141.9 Mock interview transcript: “Implement an agent loop”
30-minute coding round. The candidate shares their screen and codes in a real editor.
Interviewer: I would like you to implement a basic agent loop. The agent takes a user message, calls an LLM, and can use tools. Don’t worry about actually calling a real LLM --- you can stub that out. Focus on the loop structure, tool dispatch, and termination conditions. You have 30 minutes.
Candidate: Alright. Let me start with the types and work my way up to the loop.
[Types first. Good.]
# The candidate starts typing...
from dataclasses import dataclass
from typing import Any, Callable
import json
@dataclass
class ToolCall:
id: str
name: str
arguments: dict[str, Any]
@dataclass
class ToolResult:
tool_call_id: str
content: str
is_error: bool = False
@dataclass
class LLMResponse:
"""Represents what comes back from the LLM API."""
text: str | None = None # Final text response (if no tool call)
tool_calls: list[ToolCall] | None = None # Tool calls (if any)
input_tokens: int = 0
output_tokens: int = 0
Interviewer: Good. Keep going.
class AgentLoop:
def __init__(
self,
llm_fn: Callable, # (messages, tool_schemas) -> LLMResponse
tools: dict[str, Callable], # name -> function
tool_schemas: list[dict], # JSON Schema definitions
max_iterations: int = 30,
max_cost_usd: float = 5.0,
cost_per_input_token: float = 3e-6, # ~$3/M input tokens
cost_per_output_token: float = 15e-6, # ~$15/M output tokens
):
self.llm_fn = llm_fn
self.tools = tools
self.tool_schemas = tool_schemas
self.max_iterations = max_iterations
self.max_cost_usd = max_cost_usd
self.cost_per_input_token = cost_per_input_token
self.cost_per_output_token = cost_per_output_token
def run(self, user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
total_cost = 0.0
recent_tool_hashes: list[str] = []
for iteration in range(self.max_iterations):
# 1. Call LLM
response: LLMResponse = self.llm_fn(messages, self.tool_schemas)
# 2. Track cost
total_cost += (
response.input_tokens * self.cost_per_input_token
+ response.output_tokens * self.cost_per_output_token
)
if total_cost > self.max_cost_usd:
return f"[Agent stopped: cost cap exceeded (${total_cost:.2f})]"
# 3. If the LLM returned text only, we are done
if response.text and not response.tool_calls:
return response.text
# 4. Execute tool calls
if not response.tool_calls:
return response.text or "[Agent produced no output]"
results: list[ToolResult] = []
for tc in response.tool_calls:
result = self._execute_tool(tc)
results.append(result)
# Track for loop detection
call_hash = f"{tc.name}:{json.dumps(tc.arguments, sort_keys=True)}"
recent_tool_hashes.append(call_hash)
# 5. Loop detection: same call 3+ times in last 6 calls
if len(recent_tool_hashes) >= 6:
window = recent_tool_hashes[-6:]
for h in set(window):
if window.count(h) >= 3:
return "[Agent stopped: stuck in a loop]"
# 6. Append to message history
# Assistant message with tool calls
messages.append({
"role": "assistant",
"tool_calls": [
{"id": tc.id, "name": tc.name, "arguments": tc.arguments}
for tc in response.tool_calls
],
})
# Tool results
for r in results:
messages.append({
"role": "tool",
"tool_call_id": r.tool_call_id,
"content": r.content,
})
return f"[Agent stopped: max iterations ({self.max_iterations})]"
def _execute_tool(self, tc: ToolCall) -> ToolResult:
fn = self.tools.get(tc.name)
if fn is None:
return ToolResult(
tool_call_id=tc.id,
content=f"Error: unknown tool '{tc.name}'",
is_error=True,
)
try:
output = fn(**tc.arguments)
return ToolResult(
tool_call_id=tc.id,
content=str(output),
)
except Exception as e:
return ToolResult(
tool_call_id=tc.id,
content=f"Error: {type(e).__name__}: {e}",
is_error=True,
)
Candidate: That is the core loop. Let me walk through the termination conditions: we stop if (1) the LLM returns text without tool calls --- that is the normal exit, (2) we hit the max iteration cap, (3) the cumulative cost exceeds the budget, or (4) the loop detector finds the agent is repeating itself.
[GOOD SIGNAL: Four termination conditions, explained clearly. Cost tracking built in from the start.] [TESTING: Does the candidate think about termination, or just the happy path?]
Interviewer: Nice. What if the LLM returns multiple tool calls and they should be executed in parallel?
Candidate: Good question. Right now I am executing them sequentially. For parallel execution, I would wrap _execute_tool as an async method and use asyncio.gather:
import asyncio
async def _execute_tools_parallel(self, tool_calls: list[ToolCall]) -> list[ToolResult]:
tasks = [self._execute_tool_async(tc) for tc in tool_calls]
return await asyncio.gather(*tasks)
But there is a nuance: some tool calls have dependencies. If the LLM emits write_file("foo.py", ...) and run_command("python foo.py") in the same turn, running them in parallel would fail because the file does not exist yet when the command runs. In practice, I would execute them in the order the LLM emitted them unless we have explicit dependency information. Sequential is the safe default; parallel is an optimization for independent tools like multiple file reads.
[GOOD SIGNAL: Identifies the dependency problem with parallel tool calls. “Sequential is the safe default” is a mature answer.] [TESTING: Does the candidate understand the subtleties of concurrent tool execution?]
Interviewer: How would you add logging so you can debug failures later?
Candidate: I would add a trace recorder. Every event in the loop --- LLM call, tool call, tool result, termination --- emits a structured event to a list, and at the end we serialize it.
@dataclass
class TraceEvent:
event_type: str # "llm_call", "tool_call", "tool_result", "terminate"
iteration: int
data: dict
timestamp: float
class AgentLoop:
def __init__(self, ...):
...
self.trace: list[TraceEvent] = []
def _record(self, event_type: str, iteration: int, data: dict):
self.trace.append(TraceEvent(
event_type=event_type,
iteration=iteration,
data=data,
timestamp=time.time(),
))
Then after every LLM call, I record the response (or at least the tool calls and text). After every tool execution, I record the input and output. On termination, I record the reason. The full trace can be serialized to JSON for debugging, eval, or replay.
[GOOD SIGNAL: Structured tracing built into the loop. Mentions replay.]
Interviewer: Last thing. How would you test this without calling a real LLM?
Candidate: The llm_fn parameter is already a function I inject, so I can pass a mock:
def mock_llm(messages, tool_schemas) -> LLMResponse:
"""Deterministic mock: first call returns a tool call, second returns text."""
if len(messages) <= 2:
return LLMResponse(
tool_calls=[ToolCall(id="1", name="read_file", arguments={"path": "test.py"})],
input_tokens=100,
output_tokens=50,
)
return LLMResponse(text="Done! The file contains a hello world program.", input_tokens=200, output_tokens=30)
# Test
agent = AgentLoop(
llm_fn=mock_llm,
tools={"read_file": lambda path: "print('hello world')"},
tool_schemas=[{"name": "read_file", "input_schema": {"type": "object", "properties": {"path": {"type": "string"}}}}],
)
result = agent.run("What does test.py do?")
assert "hello world" in result
assert len(agent.trace) > 0
The mock lets me test all four termination paths without spending money on API calls. I would also write a mock that triggers the loop detector and one that triggers the cost cap.
[GOOD SIGNAL: Testability was designed in from the start (dependency injection of llm_fn). Writes the test unprompted.] [TESTING: Do they think about testability?]
141.10 Company-specific notes
Anthropic
The interview emphasizes safety, tool execution infrastructure, and evaluation. Anthropic builds Claude Code and the agent infrastructure that powers it. Expect deep questions about sandboxing (they use real sandboxes), context window management, and eval methodology. The cultural bar: you should care about whether the agent does the right thing, not just whether it does the thing at all.
What to prepare: Sandbox design (containers, microVMs, resource limits), eval infrastructure (golden tasks, trace-based scoring), and safety (prompt injection defense, tool-level access control). Read Chapters 70—72 (safety, evaluation, human-in-the-loop) thoroughly.
What they look for: Systems engineers who also understand ML well enough to reason about model behavior. Comfort with ambiguity --- agent engineering is a new field and the best practices are still forming.
OpenAI
The interview has a strong coding bar, more so than most agent roles. OpenAI builds the Codex agent and the agent platform (tool use, function calling). Expect a coding round that is genuinely difficult --- not just “implement an agent loop” but “implement it, then extend it to handle streaming, then add parallel tool execution, then optimize the context management.” Systems design questions will focus on platform infrastructure: multi-tenant tool execution, API design, and scale.
What to prepare: Strong Python coding (you will code live), API and platform design (Chapter 104 --- API contract design), and multi-tenant infrastructure. Know how function calling works at the API level inside and out.
What they look for: Strong engineers who can build reliable infrastructure at scale. Move fast but do not break production.
Google DeepMind
The interview balances ML depth with systems thinking. DeepMind builds Gemini-based agents and developed the A2A protocol. Expect questions about multi-agent architectures, agent-to-agent communication, and protocol design. The ML fundamentals round will go deeper than at other companies --- how does tool calling actually work in the model’s forward pass? How do you fine-tune a model for better tool use?
What to prepare: A2A protocol (Chapter 138), multi-agent topologies (Chapter 68), and ML fundamentals (structured generation, tool calling training). Be prepared for the “how does this work inside the model” questions.
What they look for: Researchers who can also build systems. They want you to understand why the agent behaves the way it does, not just how to work around it.
Meta
The interview focuses on internal developer tooling and open-source alignment. Meta builds agents for internal code migration, testing, and infrastructure management. Expect questions about operating agents at massive scale (millions of repos, thousands of internal users) and about building agent infrastructure that can be open-sourced.
What to prepare: Scale-oriented system design, internal developer tools (Chapter 101 --- monorepo build systems), and open-source infrastructure patterns. Meta values simplicity and reproducibility.
What they look for: Pragmatic engineers who can build infrastructure that works at Meta scale and can be open-sourced for the community.
Startups (Cognition / Devin, Cursor, Windsurf, Factory)
The interview is fast and practical. Startups care about whether you can ship an agent product, not whether you can design the perfect platform. Expect a coding round that is close to real product code (“implement this feature for our agent”) and a system design round focused on product trade-offs (“we have two weeks --- what do we build?”).
What to prepare: End-to-end agent implementation (the full loop from user input to output), UX design for agents (streaming, interruption, undo), and practical cost management. Have an opinion about what makes a good agent product.
What they look for: Builders. People who have shipped things. If you can demo an agent you built, bring it. Side projects matter more than credentials.
141.11 Common mistakes and recovery moves
Mistake 1: Over-engineering the agent
The symptom: You propose a multi-agent architecture with a supervisor, three specialist agents, a shared memory store, and a consensus protocol --- for a task that a single agent with five tools could handle.
Why it fails the interview: Over-engineering signals that you have read papers but not built systems. In practice, single-agent with good tools beats multi-agent almost always. The interviewer is testing whether you reach for complexity because it is interesting or because it is necessary.
The recovery: “Actually, let me simplify. A single agent with a well-designed tool set would handle this. Multi-agent adds coordination overhead that is not justified for this use case. I would only reach for multi-agent if I had agents that need different permission sets or if the task naturally decomposes into independent subtasks that can run in parallel.”
Mistake 2: Forgetting the security model
The symptom: You design a coding agent with terminal access and never mention sandboxing, network isolation, or what the agent should not be allowed to do.
Why it fails the interview: An agent with terminal access can curl your credentials to an external server. The interviewer is waiting for you to mention this. If you do not, they will ask, and the fact that you did not think of it is a negative signal.
The recovery: As soon as you realize you skipped it: “I should address the security model. The terminal tool runs in a sandboxed container with no network access to the internet, resource limits on CPU and memory, and a filesystem limited to the working directory. High-risk operations like git push require explicit user approval.”
Mistake 3: Not discussing cost
The symptom: You design an agent that runs a 200K-token conversation per task and never mention what it costs.
Why it fails the interview: Agent costs are real and material. A single agent run can cost $0.50—$50. At scale, this is a P&L line item.
The recovery: “Let me talk about cost. Each agent run at these context lengths costs roughly $X. To manage this, I would set a per-task cost cap, use a cheaper model for simple routing decisions, cache repeated tool calls, and implement summarization to reduce input token count over long conversations.”
Mistake 4: Not discussing eval
The symptom: You design the agent architecture but never mention how you know if it works.
Why it fails the interview: Agent eval is the hardest unsolved problem in the space. The interviewer wants to hear your strategy, not “we’ll add metrics.”
The recovery: “I want to address eval before we run out of time. I would build a golden task suite of 100+ curated tasks, run them on every prompt or tool change, and track pass rate plus cost regression. For subjective quality, I would use LLM-as-judge with explicit rubrics.”
Mistake 5: Treating it as a pure ML problem
The symptom: You focus the entire design discussion on prompt engineering, model selection, and fine-tuning, and barely mention the infrastructure.
Why it fails the interview: Agent engineering is 80% systems, 20% ML. The model is a black box that you call via API. The hard part is everything around it: tool execution, sandboxing, state management, observability, eval, cost, and safety.
The recovery: “The model is one component. Let me spend more time on the infrastructure: how we execute tools, how we manage state, how we observe and evaluate the system.”
Mistake 6: Not knowing the answer
The “I don’t know” move: It is better to say “I don’t know, but here is how I would investigate” than to make something up. Example: “I haven’t implemented constrained decoding myself, but I know it works by restricting the token sampling to only tokens that would produce valid JSON according to the schema. If I needed to build it, I would start by looking at the Outlines library, which implements grammar-guided generation.”
This move signals intellectual honesty and problem-solving instinct, both of which interviewers value more than encyclopedic knowledge.
141.12 The 48-hour prep plan
You have two days before the interview. Here is what to study and in what order.
Day 1 (8 hours)
Morning (3 hours): Re-read core agent chapters.
- Chapter 66 --- Tool use and function calling. Understand the full end-to-end flow: schema definition, structured generation, dispatch, result injection. (45 min)
- Chapter 67 --- Agent loop patterns. ReAct, plan-and-execute, and the hybrid. Know why ReAct won. (30 min)
- Chapter 68 --- Multi-agent orchestration. Supervisor, pipeline, debate topologies. Know when multi-agent is and is not justified. (30 min)
- Chapter 69 --- MCP. Tool registration, server/client, schema design. (30 min)
- Chapter 70 --- Agent safety. Prompt injection, capability-based security, the defense-in-depth model. (45 min)
Afternoon (3 hours): Practice coding.
- Implement an agent loop from scratch without looking at the book. Time yourself: 25 minutes. (45 min including review)
- Implement a tool registry with schema validation. 20 minutes. (40 min)
- Implement context window management with truncation. 15 minutes. (30 min)
- Implement retry with exponential backoff. 10 minutes. (20 min)
- Review your solutions against the code in Section 141.4. Note what you missed. (45 min)
Evening (2 hours): System design practice.
- Set a 45-minute timer and design a coding agent out loud (or in writing). Cover all five components from Section 141.3.1. (45 min)
- Review the mock transcript in Section 141.8. Note the signals and what the interviewer was testing. (30 min)
- Prepare two stories for behavioral questions: one “agent failure” story and one “debugging a non-deterministic system” story. Use the STAR framework from Section 141.6. (45 min)
Day 2 (6 hours)
Morning (2.5 hours): Advanced topics.
- Chapter 71 --- Agent evaluation. Trace-based eval, LLM-as-judge, golden task suites. (45 min)
- Chapter 72 --- Human-in-the-loop. Approval workflows, progressive autonomy. (30 min)
- Chapter 138 --- A2A protocol. MCP vs A2A, Agent Cards, task lifecycle. (45 min)
- Section 141.5 --- Deep-dive questions. Read each Q&A and make sure you can answer from memory. (30 min)
Afternoon (2 hours): Mock interviews.
- Have a friend (or a timer) give you the “Design a coding agent” prompt. Do it in 45 minutes. (60 min with debrief)
- Do the “Implement an agent loop” coding problem from scratch in 30 minutes. (45 min with review)
- Answer three behavioral questions out loud. Record yourself if possible. (15 min)
Evening (1.5 hours): Polish.
- Review the vocabulary table in Section 141.7. Practice using these phrases in sentences --- they should feel natural, not forced. (30 min)
- Re-read the company-specific notes in Section 141.10 for your target company. (15 min)
- Re-read the common mistakes in Section 141.11. Mentally rehearse the recovery moves. (15 min)
- Skim Section 141.3 system design prompts one more time. For each, make sure you can list the five key components from memory. (30 min)
141.13 Mental model
-
Agent engineering is systems engineering. The model is one API call. The hard part is everything around it: tool execution, sandboxing, state management, observability, eval, cost, and safety. Design the system, not the prompt.
-
Every agent needs four termination conditions. Max iterations, cost cap, no-progress detection, and user interrupt. If your design has fewer than four, you have a bug.
-
Security is not optional. An agent that can run code can exfiltrate data. An agent that can send emails can phish. Discuss the security model before the interviewer has to ask.
-
Eval is the hardest problem. You cannot A/B test an agent like a recommendation model. Golden task suites, LLM-as-judge, trace-based scoring, and regression CI are the state of the art. Have a strategy.
-
Start with a workflow, add an agent. The hybrid pattern --- workflow for the 95% case, agent for the 5% long tail --- is the right default for production systems. Saying “I’d just build an agent” is the wrong answer for most system design prompts.
-
Cost is a design constraint, not an afterthought. A single agent run can cost $0.50—$50. Reason about cost per task early in the design. Set caps. Use cheap models for simple decisions. Cache tool results.
-
Non-determinism requires observability. Every agent run must produce a trace. Without traces, you cannot debug, evaluate, or improve. Build tracing into the loop from day one.
-
Show pragmatism, not perfection. The interviewer prefers a candidate who ships a simple agent with good instrumentation over a candidate who designs an elegant multi-agent architecture that would take six months to build. Scope the V1. Ship it. Iterate.
Read it yourself
| Resource | Why it matters |
|---|---|
| Chapters 66—72 (Part V: Tool use, agents, MCP, safety, eval, HITL) | The foundation. If you have not read these, nothing in this chapter will make sense. |
| Chapters 131—140 (Part XI: Building agents and agent infra) | The production-depth treatment. Covers everything from sandbox design to A2A. |
| Chapter 126 --- The coding interview: twenty ML systems algorithms | The coding round prep for ML fundamentals that complement agent-specific coding. |
| Chapters 112—121 --- ML systems interview preparation | System design frameworks and behavioral prep that apply to agent interviews too. |
| Anthropic Claude Code documentation | Study how a real coding agent works --- its tools, its UX, its limitations. |
| OpenAI Codex documentation | The competing approach to coding agents. Compare and contrast with Claude Code. |
| Model Context Protocol specification | The tool protocol most agent systems use. Know it cold. |
| Google A2A specification | The agent-to-agent protocol. Know when to use it and when not to. |
Practice
-
Implement a complete agent loop from scratch in 25 minutes. Include all four termination conditions (max iterations, cost cap, no-progress detection, max tool errors). Test it with a mock LLM function that exercises each termination path.
-
Design a multi-tenant agent platform in a 45-minute timed exercise. Draw the architecture, list the components, discuss the trade-offs between shared-process and per-tenant isolation, and explain how you would handle billing and observability. Write up your answer and compare it against Section 141.3.2.
-
Implement a tool registry with JSON Schema validation, dynamic registration, and error handling. Then add a permission system: each agent session has a set of allowed tools, and the registry rejects calls to tools the session is not authorized to use.
-
Write three behavioral answers using the agent-adapted STAR framework from Section 141.6. One about an agent failure, one about debugging a non-deterministic system, and one about a cost-vs-quality trade-off decision. Practice delivering each in under 3 minutes.
-
Build a context window manager that implements both truncation and summarization. Write a test that creates a conversation with 500 messages, runs it through the manager, and verifies that the output fits within a 128K token budget while preserving the system prompt and the last 6 turns.
-
Prepare a 2-minute explanation of how tool calling works end-to-end (from system prompt to tool dispatch to result injection). Record yourself delivering it. Listen back and cut any filler words or unnecessary detail. The interviewer wants clarity and precision, not length.
-
Stretch: Build a minimal coding agent that can actually solve simple programming tasks. Give it four tools (file_read, file_write, terminal, search), connect it to a real LLM API, and run it against three tasks: “create a Python function that sorts a list,” “fix the bug in this file” (provide a file with a deliberate bug), and “add tests for this function.” Capture the full trace for each run and evaluate: Did it complete the task? How many tool calls did it use? What did it cost? What went wrong?