Code-executing agents: sandboxes, REPLs, and the code-as-tool pattern
"Most agent frameworks give a model a fixed menu of tools—search, calculator, API call. The moment you hand the model a *code interpreter* instead, the menu becomes infinite: any computation expressible in a programming language is now a single tool call away. This chapter maps the architecture, security boundaries, and production patterns that make code-executing agents safe and useful"
133.1 — Why code execution changes everything
A conventional tool-using agent couples each capability to a hand-written function: get_weather(city), query_database(sql), send_email(to, body). Scaling this approach to hundreds of tasks means writing hundreds of tools. Worse, each new task requires a deployment cycle.
Code execution collapses the combinatorial explosion into one meta-tool: run this code. A model that can emit Python (or JavaScript, or shell) can compute Fibonacci numbers, parse CSVs, render charts, install packages, and call APIs—all through a single tool endpoint.
Three properties make code execution uniquely powerful:
- Composability. The model chains library calls, control flow, and data transformations in a single action rather than orchestrating dozens of atomic tool invocations.
- Expressiveness. Anything a human developer can do in a REPL, the agent can attempt—statistical tests, image manipulation, web scraping, file format conversion.
- Self-correction. When code throws an exception, the traceback is itself a rich observation. The model reads the error, edits the code, and retries—a tight feedback loop that purely declarative tools cannot match.
The cost is risk: arbitrary code execution on shared infrastructure opens the door to data exfiltration, resource abuse, and lateral movement. The rest of this chapter is largely about managing that risk while preserving the expressiveness.
133.2 — Code-as-tool pattern: LLM generates code, runtime executes, result goes back
The code-as-tool pattern is a three-party contract:
| Party | Role |
|---|---|
| Orchestrator | Sends the user query plus system prompt to the LLM; receives a tool-call requesting code execution |
| LLM | Generates a code snippet (with optional natural-language explanation) as the tool-call payload |
| Sandbox runtime | Executes the snippet in an isolated environment; returns stdout, stderr, generated files, and exit code |
The orchestrator feeds the execution result back to the LLM as the tool response, and the loop continues until the model emits a final answer.
# Minimal code-as-tool orchestrator (conceptual)
import openai, subprocess, tempfile, json
SYSTEM = """You have a tool called `execute_python`.
Call it with {"code": "<python code>"} whenever you need computation.
You will receive {"stdout": "...", "stderr": "...", "exit_code": 0}."""
def execute_python(code: str) -> dict:
"""Run code in a subprocess and capture output."""
with tempfile.NamedTemporaryFile(suffix=".py", mode="w", delete=False) as f:
f.write(code)
f.flush()
result = subprocess.run(
["python3", f.name],
capture_output=True, text=True, timeout=30
)
return {
"stdout": result.stdout[-4000:], # truncate for context window
"stderr": result.stderr[-2000:],
"exit_code": result.returncode,
}
tools = [{
"type": "function",
"function": {
"name": "execute_python",
"description": "Execute Python code and return stdout/stderr.",
"parameters": {
"type": "object",
"properties": {"code": {"type": "string"}},
"required": ["code"],
},
},
}]
def agent_loop(user_query: str, max_turns: int = 10):
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": user_query},
]
for _ in range(max_turns):
resp = openai.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools
)
msg = resp.choices[0].message
messages.append(msg)
if msg.tool_calls:
for tc in msg.tool_calls:
args = json.loads(tc.function.arguments)
result = execute_python(args["code"])
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result),
})
else:
return msg.content # final answer
return "Max turns reached."
This skeleton already works—but subprocess.run on the host machine is not safe for production. Sections 133.4–133.6 replace that subprocess with a hardened sandbox.
133.3 — The REPL agent loop (generate → execute → observe → iterate)
The acronym REPL (Read-Eval-Print-Loop) comes from interactive interpreters, but the agent version extends the concept:
- Generate. The LLM writes a code cell—typically 5–50 lines of Python.
- Execute. The sandbox runs the cell. State persists between cells (variables, imports, files on disk).
- Observe. The orchestrator returns stdout, stderr, rendered images, and any file artifacts to the LLM.
- Iterate. The LLM decides whether to write another cell (fix a bug, refine a plot, run a follow-up query) or emit a final answer.
State persistence is the critical difference between a REPL agent and a one-shot code tool. A REPL sandbox keeps the Python interpreter alive across cells so that a DataFrame loaded in cell 1 is available in cell 5. This mirrors how a human works in a Jupyter notebook.
# REPL agent: stateful IPython kernel via Jupyter client
from jupyter_client import KernelManager
import queue
class REPLSandbox:
"""Wraps a live IPython kernel for multi-turn code execution."""
def __init__(self, timeout: int = 30):
self.km = KernelManager(kernel_name="python3")
self.km.start_kernel()
self.kc = self.km.client()
self.kc.start_channels()
self.kc.wait_for_ready(timeout=15)
self.timeout = timeout
def execute(self, code: str) -> dict:
msg_id = self.kc.execute(code)
outputs: list[str] = []
errors: list[str] = []
images: list[bytes] = []
while True:
try:
msg = self.kc.get_iopub_msg(timeout=self.timeout)
except queue.Empty:
break
content = msg["content"]
msg_type = msg["msg_type"]
if msg_type == "stream":
outputs.append(content["text"])
elif msg_type == "error":
errors.append("\n".join(content["traceback"]))
elif msg_type in ("display_data", "execute_result"):
if "image/png" in content.get("data", {}):
import base64
images.append(base64.b64decode(content["data"]["image/png"]))
if "text/plain" in content.get("data", {}):
outputs.append(content["data"]["text/plain"])
elif msg_type == "status" and content["execution_state"] == "idle":
break
return {
"stdout": "".join(outputs)[-4000:],
"stderr": "".join(errors)[-2000:],
"images": images, # pass to multimodal model or save as artifacts
}
def shutdown(self):
self.kc.stop_channels()
self.km.shutdown_kernel(now=True)
Design decision: stateful vs. stateless. Stateful sandboxes are natural for exploratory data analysis. Stateless (fresh process per call) sandboxes are simpler to reason about for security and scaling. Many production systems offer both modes.
133.4 — Sandbox architectures: process isolation, container (Docker/gVisor), microVM (Firecracker), cloud (E2B, Modal)
Not all sandboxes are equal. The spectrum trades isolation strength against startup latency and operational complexity.
| Level | Mechanism | Startup | Isolation | Example |
|---|---|---|---|---|
| Process | subprocess + ulimit | ~5 ms | Weak (shared kernel, shared FS) | Local dev REPL |
| Container | Docker + seccomp/AppArmor | 200–800 ms | Medium (shared kernel, namespaced FS/net) | Self-hosted sandbox |
| Container + gVisor | Docker with runsc runtime | 300–1000 ms | Strong (user-space kernel intercept) | GCP Cloud Run |
| microVM | Firecracker, Cloud Hypervisor | 125–300 ms | Very strong (separate guest kernel) | AWS Lambda, Fly Machines |
| Cloud sandbox | E2B, Modal, CodeSandbox SDK | 300–2000 ms (cold) | Very strong (vendor-managed VM/container) | SaaS agents |
Process-level isolation is only appropriate during local development. A single os.system("rm -rf /") or a reverse shell demonstrates why.
Containers with a hardened seccomp profile and dropped capabilities are the most common self-hosted option. Adding gVisor (runsc) interposes a user-space kernel that intercepts every syscall, drastically reducing the host kernel attack surface.
MicroVMs (Firecracker) boot a minimal Linux guest in ~125 ms. Each execution gets its own kernel, memory space, and virtual network interface. AWS Lambda and Fly Machines both use this model.
Cloud sandbox services (E2B, Modal) abstract the VM/container lifecycle behind an API. You POST code; you GET results. Billing is per-second or per-execution.
133.5 — E2B in depth: API, sandbox lifecycle, file system, networking
E2B (short for “Environment to Build”) provides cloud sandboxes purpose-built for AI code execution. Each sandbox is a Firecracker microVM with its own filesystem, network stack, and process tree.
Lifecycle
create → running (idle timeout: 5 min default) → destroy
↑ │
└────────── keep-alive ping ─────────────┘
You create a sandbox, run one or more code cells, then let it auto-destroy on timeout (or destroy explicitly).
Core API
from e2b_code_interpreter import Sandbox
# 1. Create a sandbox (cold start ~1-2s, warm ~300ms)
sbx = Sandbox(api_key="e2b_...", timeout=300) # 5-min idle timeout
# 2. Execute code — state persists across calls
result = sbx.run_code("import pandas as pd; df = pd.DataFrame({'a': [1,2,3]})")
print(result.text) # stdout
print(result.error) # stderr / traceback (None if clean)
# 3. Multi-turn: variables survive
result2 = sbx.run_code("print(df.describe())")
print(result2.text)
# 4. Install packages at runtime
sbx.run_code("!pip install seaborn -q")
# 5. File operations
sbx.files.write("/home/user/data.csv", b"a,b\n1,2\n3,4")
csv_bytes = sbx.files.read("/home/user/data.csv")
# 6. Download generated artifacts (charts, reports)
sbx.run_code("""
import matplotlib.pyplot as plt
plt.plot([1,2,3],[4,5,6])
plt.savefig('/home/user/chart.png')
""")
png_bytes = sbx.files.read("/home/user/chart.png")
# 7. Networking: sandbox can make outbound HTTP requests
sbx.run_code("""
import requests
resp = requests.get('https://api.github.com')
print(resp.status_code)
""")
# 8. Tear down
sbx.close()
Custom templates
E2B lets you define a sandbox template — a Dockerfile that pre-bakes packages, datasets, and config files into the VM image. This eliminates per-session pip install latency.
# e2b.Dockerfile
FROM e2b/base:latest
RUN pip install pandas numpy scikit-learn matplotlib seaborn
COPY seed_data/ /home/user/data/
e2b template build --name "ds-sandbox"
Now Sandbox(template="ds-sandbox") starts with all libraries pre-installed.
Key limits (as of 2025)
| Dimension | Default | Configurable |
|---|---|---|
| vCPUs | 2 | Up to 8 |
| RAM | 512 MB | Up to 8 GB |
| Disk | 1 GB | Up to 10 GB |
| Idle timeout | 5 min | Up to 24 h |
| Max execution time | 300 s per cell | Adjustable |
| Outbound network | Allowed | Can be restricted |
133.6 — Docker-based sandboxes: building your own, resource limits, escape risks
Not every team wants a cloud dependency. A self-hosted Docker sandbox gives you full control—at the cost of owning the security posture.
Minimal implementation
import docker
import uuid
import tarfile
import io
class DockerSandbox:
"""Ephemeral Docker container for code execution."""
IMAGE = "python:3.12-slim"
MEM_LIMIT = "256m"
CPU_PERIOD = 100_000
CPU_QUOTA = 50_000 # 50% of one core
TIMEOUT = 30
def __init__(self):
self.client = docker.from_env()
self.container = self.client.containers.run(
self.IMAGE,
command="sleep infinity",
detach=True,
mem_limit=self.MEM_LIMIT,
cpu_period=self.CPU_PERIOD,
cpu_quota=self.CPU_QUOTA,
network_mode="none", # no network access
security_opt=["no-new-privileges"],
read_only=False, # writable /tmp
tmpfs={"/tmp": "size=64m"},
name=f"sandbox-{uuid.uuid4().hex[:8]}",
)
def execute(self, code: str) -> dict:
"""Write code to file, exec inside container, capture output."""
# Inject code as a tar archive
code_bytes = code.encode()
tar_buf = io.BytesIO()
with tarfile.open(fileobj=tar_buf, mode="w") as tar:
info = tarfile.TarInfo(name="run.py")
info.size = len(code_bytes)
tar.addfile(info, io.BytesIO(code_bytes))
tar_buf.seek(0)
self.container.put_archive("/tmp", tar_buf)
# Execute
exit_code, output = self.container.exec_run(
["python3", "/tmp/run.py"],
demux=True,
timeout=self.TIMEOUT,
)
stdout = (output[0] or b"").decode(errors="replace")[-4000:]
stderr = (output[1] or b"").decode(errors="replace")[-2000:]
return {"stdout": stdout, "stderr": stderr, "exit_code": exit_code}
def upload_file(self, container_path: str, data: bytes):
tar_buf = io.BytesIO()
with tarfile.open(fileobj=tar_buf, mode="w") as tar:
info = tarfile.TarInfo(name=container_path.split("/")[-1])
info.size = len(data)
tar.addfile(info, io.BytesIO(data))
tar_buf.seek(0)
directory = "/".join(container_path.split("/")[:-1]) or "/"
self.container.put_archive(directory, tar_buf)
def download_file(self, container_path: str) -> bytes:
bits, _ = self.container.get_archive(container_path)
tar_buf = io.BytesIO(b"".join(bits))
with tarfile.open(fileobj=tar_buf) as tar:
member = tar.getmembers()[0]
return tar.extractfile(member).read()
def destroy(self):
self.container.remove(force=True)
Hardening checklist
| Control | Why |
|---|---|
network_mode="none" | Prevents data exfiltration and reverse shells |
mem_limit / cpu_quota | Prevents fork bombs and crypto-mining |
no-new-privileges | Blocks setuid escalation |
read_only=True + tmpfs on /tmp | Limits persistent filesystem writes |
Drop all capabilities: cap_drop=["ALL"] | Minimizes kernel attack surface |
gVisor runtime (runtime="runsc") | Intercepts syscalls in user space |
| Seccomp profile | Allowlist of ~60 safe syscalls |
| AppArmor/SELinux | Mandatory access control on file paths |
Escape risks
Even with all of the above, container escapes are published regularly (e.g., CVE-2024-21626 in runc). Defenses in depth:
- Run the Docker daemon on a dedicated VM that has no access to production databases or secrets.
- Use gVisor or Firecracker if the threat model includes sophisticated attackers.
- Rotate sandbox VMs frequently; treat them as cattle, not pets.
133.7 — File system access: mounting volumes, read-write boundary, artifact extraction
Code-executing agents frequently need to read input files (CSVs, images, PDFs) and produce output artifacts (charts, reports, transformed data). File system design is where convenience and security collide.
Pattern 1: Upload-before-execute
The orchestrator uploads files into the sandbox before the first code cell runs. The sandbox filesystem is the only shared surface.
# Orchestrator side
sandbox.upload_file("/home/user/input.csv", user_csv_bytes)
result = sandbox.execute("import pandas as pd; df = pd.read_csv('/home/user/input.csv'); print(df.shape)")
Pattern 2: Mounted read-only volume
For Docker sandboxes, mount a host directory as read-only so the agent can access reference data without the ability to modify it.
container = client.containers.run(
image,
volumes={"/data/reference": {"bind": "/mnt/reference", "mode": "ro"}},
...
)
Pattern 3: Artifact extraction
After execution, the orchestrator pulls files from a designated output directory.
# Convention: agent writes outputs to /home/user/output/
artifacts = {}
for filename in sandbox.list_files("/home/user/output/"):
artifacts[filename] = sandbox.download_file(f"/home/user/output/{filename}")
The read-write boundary rule
Never mount a directory read-write unless the agent genuinely needs to persist data back to the host. Default to read-only mounts plus a writable
/tmpor/home/userinside the container. Extract artifacts explicitly rather than sharing a writable volume.
This principle limits the blast radius of a compromised sandbox. Even if the agent writes malicious files, they exist only inside the ephemeral container and are destroyed on teardown.
133.8 — Language runtimes: Python, Node.js, shell, polyglot
Python dominates code-executing agents because of the data-science ecosystem (pandas, numpy, matplotlib, scikit-learn), but production systems increasingly support multiple runtimes.
Python
The default. Use IPython kernels for REPL-style state persistence, or plain python3 for stateless one-shot execution. Pre-install heavy packages in the sandbox image.
Node.js / TypeScript
Useful when the agent manipulates JSON APIs, generates front-end code, or runs Puppeteer for browser automation.
result = sandbox.execute(
code="const resp = await fetch('https://api.example.com/data'); console.log(await resp.json());",
language="javascript",
)
Shell (bash)
Agents that orchestrate CLI tools—ffmpeg, imagemagick, curl, jq—benefit from a shell runtime. Shell execution is also the natural escape hatch for installing OS packages.
result = sandbox.execute(
code="apt-get update && apt-get install -y ffmpeg && ffmpeg -i input.mp4 -vn output.mp3",
language="bash",
)
Polyglot sandboxes
Some frameworks let the model choose the runtime per cell. The sandbox image includes Python, Node, and bash; the tool schema adds a language field:
{
"name": "execute_code",
"parameters": {
"type": "object",
"properties": {
"language": {"type": "string", "enum": ["python", "javascript", "bash"]},
"code": {"type": "string"}
},
"required": ["language", "code"]
}
}
Trade-off: polyglot sandboxes are larger images (~1–2 GB) with more installed attack surface. Restrict to the runtimes your use case actually requires.
133.9 — Security model: permission boundary, network isolation, credential management, time limits
Security for code-executing agents operates on four axes.
1. Permission boundary
The principle of least privilege applied to sandboxes:
- Run as a non-root user inside the container.
- Drop all Linux capabilities (
cap_drop: ["ALL"]). - Enable seccomp to allowlist only necessary syscalls.
- Use read-only filesystems except for designated scratch directories.
container = client.containers.run(
image,
user="1000:1000",
cap_drop=["ALL"],
security_opt=[
"no-new-privileges",
"seccomp=sandbox-seccomp.json",
],
read_only=True,
tmpfs={"/tmp": "size=64m,noexec"},
...
)
2. Network isolation
By default, disable all networking (network_mode="none"). If the agent legitimately needs outbound HTTP (e.g., to call an API), use an allowlist proxy:
# docker-compose.yml for sandbox with egress proxy
services:
sandbox:
image: sandbox:latest
network_mode: "container:egress-proxy"
egress-proxy:
image: envoyproxy/envoy:v1.30
volumes:
- ./envoy-allowlist.yaml:/etc/envoy/envoy.yaml
# envoy-allowlist.yaml permits only *.api.example.com
3. Credential management
Never inject secrets into the sandbox. If the agent needs to call an authenticated API:
- The orchestrator calls the API on behalf of the agent (the sandbox returns a request spec, the orchestrator executes it).
- Or use short-lived, narrowly-scoped tokens injected as environment variables with automatic expiry.
# Anti-pattern: injecting long-lived credentials
sandbox.execute("import os; requests.get(url, headers={'Authorization': os.environ['API_KEY']})")
# Better: orchestrator-mediated API call
api_request = sandbox.execute("print(json.dumps({'url': '...', 'method': 'GET'}))")
# Orchestrator validates the URL against an allowlist, adds auth, makes the call
response = requests.get(api_request["url"], headers={"Authorization": f"Bearer {short_lived_token}"})
sandbox.inject_result(response.text)
4. Time and resource limits
| Limit | Recommended default | Purpose |
|---|---|---|
| Wall-clock timeout | 30 s per cell | Prevents infinite loops |
| CPU quota | 0.5–1 core | Prevents mining / DoS |
| Memory | 256 MB–1 GB | Prevents OOM on host |
| Disk | 100 MB writable | Prevents disk fill |
| Max cells per session | 20–50 | Bounds total compute |
| Max sandbox lifetime | 10–30 min | Prevents resource leaks |
133.10 — Production patterns: data analysis agents, SWE-bench style, notebook agents
Pattern A: Data analysis agent
The most common deployment. User uploads a CSV; the agent explores it, computes statistics, generates charts, and narrates findings.
class DataAnalysisAgent:
"""Production data-analysis agent with E2B sandbox."""
SYSTEM_PROMPT = """You are a data analyst. The user has uploaded a file to
/home/user/data.csv. Use the execute_python tool to explore, analyze, and
visualize the data. Always print results so you can observe them.
Save charts to /home/user/output/. When done, summarize your findings."""
def __init__(self, llm_client, sandbox: Sandbox):
self.llm = llm_client
self.sbx = sandbox
def run(self, user_query: str, csv_bytes: bytes) -> dict:
# Upload user data
self.sbx.files.write("/home/user/data.csv", csv_bytes)
messages = [
{"role": "system", "content": self.SYSTEM_PROMPT},
{"role": "user", "content": user_query},
]
for _ in range(15): # max turns
response = self.llm.chat(messages=messages, tools=self.tools)
msg = response.message
messages.append(msg)
if not msg.tool_calls:
# Collect artifacts
artifacts = self._collect_artifacts()
return {"answer": msg.content, "artifacts": artifacts}
for tc in msg.tool_calls:
result = self.sbx.run_code(tc.arguments["code"])
messages.append(self._tool_response(tc.id, result))
return {"answer": "Analysis incomplete — turn limit reached.", "artifacts": {}}
def _collect_artifacts(self) -> dict:
artifacts = {}
try:
for entry in self.sbx.files.list("/home/user/output/"):
artifacts[entry.name] = self.sbx.files.read(
f"/home/user/output/{entry.name}"
)
except Exception:
pass
return artifacts
Pattern B: SWE-bench style (software engineering agent)
The agent clones a repo, reads the issue, writes a patch, runs tests, and iterates until tests pass. The sandbox needs git, a full development environment, and network access (for pip install).
# Orchestrator sets up the repo inside the sandbox
sandbox.run_code("""
!git clone https://github.com/org/repo.git /home/user/repo
!cd /home/user/repo && pip install -e '.[dev]' -q
""")
# Agent loop: read issue → locate files → write patch → run tests
# The LLM uses shell commands (cat, grep, sed) alongside Python
Key differences from data analysis: longer sessions (10–30 min), larger disk (cloned repo + dependencies), network access required, and the success metric is test pass rate rather than narrative quality.
Pattern C: Notebook agent
The agent works inside a Jupyter notebook, reading and writing cells. This is natural for data science workflows where the output is a notebook.
import nbformat
def notebook_agent(sbx, llm, user_query: str) -> nbformat.NotebookNode:
nb = nbformat.v4.new_notebook()
messages = [{"role": "user", "content": user_query}]
for _ in range(20):
resp = llm.chat(messages=messages, tools=tools)
if not resp.message.tool_calls:
# Add markdown cell with final summary
nb.cells.append(nbformat.v4.new_markdown_cell(resp.message.content))
break
for tc in resp.message.tool_calls:
code = tc.arguments["code"]
result = sbx.run_code(code)
# Record in notebook
cell = nbformat.v4.new_code_cell(code)
cell.outputs = [nbformat.v4.new_output(
output_type="stream", name="stdout", text=result.text or ""
)]
nb.cells.append(cell)
messages.append(tool_response(tc.id, result))
return nb
133.11 — Cost and latency: cold starts, warm pools, per-execution billing
Cold start budget
| Component | Typical latency |
|---|---|
| VM / container creation | 200–2000 ms |
| Language runtime init (Python) | 50–200 ms |
| Package imports (pandas, numpy) | 300–800 ms |
| Total cold start | 0.5–3 s |
For interactive agents, a 3-second pause before the first code result is noticeable. Mitigation strategies:
Warm pools. Pre-create N sandboxes and keep them idle. When a request arrives, assign one instantly. Replenish the pool asynchronously.
import asyncio
from collections import deque
class SandboxPool:
"""Pre-warmed pool of sandbox instances."""
def __init__(self, pool_size: int = 5, factory=None):
self.factory = factory or (lambda: Sandbox())
self.pool: deque = deque()
self.pool_size = pool_size
async def initialize(self):
tasks = [asyncio.to_thread(self.factory) for _ in range(self.pool_size)]
instances = await asyncio.gather(*tasks)
self.pool.extend(instances)
async def acquire(self) -> Sandbox:
if self.pool:
sbx = self.pool.popleft()
# Replenish in background
asyncio.create_task(self._replenish())
return sbx
# Pool exhausted — create on demand (cold start)
return await asyncio.to_thread(self.factory)
async def release(self, sbx: Sandbox):
# Destroy used sandbox — never reuse between users
await asyncio.to_thread(sbx.close)
async def _replenish(self):
sbx = await asyncio.to_thread(self.factory)
self.pool.append(sbx)
Snapshot / clone. Some platforms (Firecracker with memory snapshots, E2B templates) let you snapshot a VM after packages are imported. Restoring from snapshot is faster than booting + importing.
Cost model
| Provider | Billing unit | Approximate cost (2025) |
|---|---|---|
| E2B | Per-second of sandbox uptime | ~$0.00005/s ($0.18/hr) |
| Modal | Per-second of compute (CPU+RAM) | ~$0.0001/s for 1 CPU / 1 GB |
| Self-hosted (AWS) | EC2 instance + overhead | ~$0.05–0.10/hr per sandbox host |
For a typical data-analysis session (10 code cells, 60 s total sandbox time), the sandbox cost is $0.003–$0.006 — usually dwarfed by the LLM inference cost (~$0.02–0.10 per session with GPT-4o).
Optimization levers
- Batch cells. If the model generates code that clearly has multiple independent steps, execute them in one cell to reduce round-trips.
- Truncate output. Cap stdout at ~4 KB to avoid bloating the context window (and thus LLM cost).
- Reuse sandboxes within a session but never across users (security).
- Set aggressive idle timeouts (2–5 min) to avoid paying for idle VMs.
- Pre-bake images. Every
pip installinside a live sandbox is wasted cold-start time and compute cost. Bake dependencies into the template.
133.12 — Mental model (8 takeaway points)
-
Code execution is the universal tool. One meta-tool (run code) replaces an ever-growing registry of narrow tools. Design your agent with code-as-tool as the primary capability and add dedicated tools only when code is insufficient (e.g., human-in-the-loop approvals).
-
Stateful REPLs beat stateless execution for exploratory work. Keeping the interpreter alive across cells lets the agent build on previous results, just like a human in a Jupyter notebook. Use stateless mode for one-shot computations where isolation is paramount.
-
Isolation is non-negotiable in production. Process-level sandboxing is a development convenience, not a production strategy. At minimum, use hardened Docker containers; prefer gVisor or Firecracker microVMs for untrusted code.
-
Defense in depth, not a single wall. Combine container isolation + network-mode-none + seccomp + dropped capabilities + non-root user + resource limits. Any single layer may have vulnerabilities; the combination raises the bar dramatically.
-
Never inject secrets into the sandbox. Mediate API calls through the orchestrator, or use short-lived tokens with narrow scopes. Treat the sandbox as adversary-controlled at all times.
-
Cold starts are the latency bottleneck. Warm pools and pre-baked images are the primary mitigations. Budget 0.5–3 seconds for the first execution; subsequent cells in a stateful REPL add only execution time.
-
Sandbox cost is noise next to LLM cost. A 60-second sandbox session costs a fraction of a cent. Optimize LLM token usage (truncate outputs, limit turns) before worrying about sandbox compute.
-
The file system is the interface. Upload inputs, execute code, download outputs. Design clear conventions for input and output directories, enforce read-only mounts for reference data, and extract artifacts explicitly.
Read it yourself
- E2B documentation — e2b.dev/docs. The canonical reference for cloud sandbox API, templates, and lifecycle management.
- Firecracker design docs — github.com/firecracker-microvm/firecracker. The microVM that powers AWS Lambda; explains the security and performance model.
- gVisor documentation — gvisor.dev. How the user-space kernel intercept works; configuration for Docker and Kubernetes.
- OpenAI Code Interpreter internals — Simon Willison’s analysis. Reverse-engineering of the sandbox environment, package list, and execution model.
- SWE-bench — swebench.com. The benchmark for software engineering agents; studies in multi-turn code execution on real repositories.
- Modal documentation — modal.com/docs. Alternative cloud sandbox with strong Python-native APIs and GPU support.
- Docker security best practices — docs.docker.com/engine/security. Seccomp profiles, AppArmor, capability management.
Practice
-
Implement a minimal code-as-tool agent using the
subprocess-based skeleton from Section 133.2. Give it a user query like “What is the 50th Fibonacci number?” and verify it generates and executes correct Python. Then ask it to read a CSV file you provide — observe how it handles file paths. -
Harden the subprocess sandbox by replacing
subprocess.runwith a Docker container (Section 133.6). Addnetwork_mode="none", memory limits, and a timeout. Test by asking the agent to runimport socket; socket.create_connection(("8.8.8.8", 53))and confirm it fails. -
Build a stateful REPL sandbox using the Jupyter kernel approach from Section 133.3. Execute three cells in sequence: (a)
import pandas as pd; df = pd.DataFrame({"x": range(100)}), (b)df["y"] = df["x"] ** 2, (c)print(df.describe()). Verify that state persists across cells. -
Create an E2B sandbox template (Section 133.5) that pre-installs pandas, matplotlib, and scikit-learn. Measure the cold-start time with and without the template. How much time does pre-baking save?
-
Implement an artifact extraction pipeline (Section 133.7). Have the agent generate a matplotlib chart, save it to
/home/user/output/chart.png, then extract the PNG bytes from the sandbox. Display the chart locally usingPILor save it to disk. -
Design a network allowlist proxy (Section 133.9). Configure an Envoy or nginx reverse proxy that allows the sandbox to reach only
api.github.comand blocks all other destinations. Verify withcurlinside the sandbox. -
Stretch: Build a warm-pool manager (Section 133.11) that maintains 3 pre-created Docker sandboxes. Measure the latency difference between acquiring a warm sandbox vs. creating one cold. Then add a “poisoned sandbox” test: after an agent session finishes, verify that the sandbox is destroyed (not returned to the pool) so the next user cannot access leftover state.