Part XI · Building Agents and Agent Infrastructure
Chapter 133 ~21 min read

Code-executing agents: sandboxes, REPLs, and the code-as-tool pattern

"Most agent frameworks give a model a fixed menu of tools—search, calculator, API call. The moment you hand the model a *code interpreter* instead, the menu becomes infinite: any computation expressible in a programming language is now a single tool call away. This chapter maps the architecture, security boundaries, and production patterns that make code-executing agents safe and useful"

133.1 — Why code execution changes everything

A conventional tool-using agent couples each capability to a hand-written function: get_weather(city), query_database(sql), send_email(to, body). Scaling this approach to hundreds of tasks means writing hundreds of tools. Worse, each new task requires a deployment cycle.

Code execution collapses the combinatorial explosion into one meta-tool: run this code. A model that can emit Python (or JavaScript, or shell) can compute Fibonacci numbers, parse CSVs, render charts, install packages, and call APIs—all through a single tool endpoint.

Three properties make code execution uniquely powerful:

  1. Composability. The model chains library calls, control flow, and data transformations in a single action rather than orchestrating dozens of atomic tool invocations.
  2. Expressiveness. Anything a human developer can do in a REPL, the agent can attempt—statistical tests, image manipulation, web scraping, file format conversion.
  3. Self-correction. When code throws an exception, the traceback is itself a rich observation. The model reads the error, edits the code, and retries—a tight feedback loop that purely declarative tools cannot match.

The cost is risk: arbitrary code execution on shared infrastructure opens the door to data exfiltration, resource abuse, and lateral movement. The rest of this chapter is largely about managing that risk while preserving the expressiveness.


133.2 — Code-as-tool pattern: LLM generates code, runtime executes, result goes back

The code-as-tool pattern is a three-party contract:

PartyRole
OrchestratorSends the user query plus system prompt to the LLM; receives a tool-call requesting code execution
LLMGenerates a code snippet (with optional natural-language explanation) as the tool-call payload
Sandbox runtimeExecutes the snippet in an isolated environment; returns stdout, stderr, generated files, and exit code

The orchestrator feeds the execution result back to the LLM as the tool response, and the loop continues until the model emits a final answer.

# Minimal code-as-tool orchestrator (conceptual)
import openai, subprocess, tempfile, json

SYSTEM = """You have a tool called `execute_python`.
Call it with {"code": "<python code>"} whenever you need computation.
You will receive {"stdout": "...", "stderr": "...", "exit_code": 0}."""

def execute_python(code: str) -> dict:
    """Run code in a subprocess and capture output."""
    with tempfile.NamedTemporaryFile(suffix=".py", mode="w", delete=False) as f:
        f.write(code)
        f.flush()
        result = subprocess.run(
            ["python3", f.name],
            capture_output=True, text=True, timeout=30
        )
    return {
        "stdout": result.stdout[-4000:],   # truncate for context window
        "stderr": result.stderr[-2000:],
        "exit_code": result.returncode,
    }

tools = [{
    "type": "function",
    "function": {
        "name": "execute_python",
        "description": "Execute Python code and return stdout/stderr.",
        "parameters": {
            "type": "object",
            "properties": {"code": {"type": "string"}},
            "required": ["code"],
        },
    },
}]

def agent_loop(user_query: str, max_turns: int = 10):
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": user_query},
    ]
    for _ in range(max_turns):
        resp = openai.chat.completions.create(
            model="gpt-4o", messages=messages, tools=tools
        )
        msg = resp.choices[0].message
        messages.append(msg)

        if msg.tool_calls:
            for tc in msg.tool_calls:
                args = json.loads(tc.function.arguments)
                result = execute_python(args["code"])
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": json.dumps(result),
                })
        else:
            return msg.content  # final answer
    return "Max turns reached."

This skeleton already works—but subprocess.run on the host machine is not safe for production. Sections 133.4–133.6 replace that subprocess with a hardened sandbox.

Orchestrator (agent loop) LLM Sandbox User (final answer) prompt code result answer (when done)
Figure 133.1 — The code-as-tool pattern. The orchestrator mediates between the LLM (which generates code) and the sandbox (which executes it). Results feed back until the model produces a final answer.

133.3 — The REPL agent loop (generate → execute → observe → iterate)

The acronym REPL (Read-Eval-Print-Loop) comes from interactive interpreters, but the agent version extends the concept:

  1. Generate. The LLM writes a code cell—typically 5–50 lines of Python.
  2. Execute. The sandbox runs the cell. State persists between cells (variables, imports, files on disk).
  3. Observe. The orchestrator returns stdout, stderr, rendered images, and any file artifacts to the LLM.
  4. Iterate. The LLM decides whether to write another cell (fix a bug, refine a plot, run a follow-up query) or emit a final answer.

State persistence is the critical difference between a REPL agent and a one-shot code tool. A REPL sandbox keeps the Python interpreter alive across cells so that a DataFrame loaded in cell 1 is available in cell 5. This mirrors how a human works in a Jupyter notebook.

# REPL agent: stateful IPython kernel via Jupyter client
from jupyter_client import KernelManager
import queue

class REPLSandbox:
    """Wraps a live IPython kernel for multi-turn code execution."""

    def __init__(self, timeout: int = 30):
        self.km = KernelManager(kernel_name="python3")
        self.km.start_kernel()
        self.kc = self.km.client()
        self.kc.start_channels()
        self.kc.wait_for_ready(timeout=15)
        self.timeout = timeout

    def execute(self, code: str) -> dict:
        msg_id = self.kc.execute(code)
        outputs: list[str] = []
        errors: list[str] = []
        images: list[bytes] = []

        while True:
            try:
                msg = self.kc.get_iopub_msg(timeout=self.timeout)
            except queue.Empty:
                break
            content = msg["content"]
            msg_type = msg["msg_type"]

            if msg_type == "stream":
                outputs.append(content["text"])
            elif msg_type == "error":
                errors.append("\n".join(content["traceback"]))
            elif msg_type in ("display_data", "execute_result"):
                if "image/png" in content.get("data", {}):
                    import base64
                    images.append(base64.b64decode(content["data"]["image/png"]))
                if "text/plain" in content.get("data", {}):
                    outputs.append(content["data"]["text/plain"])
            elif msg_type == "status" and content["execution_state"] == "idle":
                break

        return {
            "stdout": "".join(outputs)[-4000:],
            "stderr": "".join(errors)[-2000:],
            "images": images,  # pass to multimodal model or save as artifacts
        }

    def shutdown(self):
        self.kc.stop_channels()
        self.km.shutdown_kernel(now=True)

Design decision: stateful vs. stateless. Stateful sandboxes are natural for exploratory data analysis. Stateless (fresh process per call) sandboxes are simpler to reason about for security and scaling. Many production systems offer both modes.


133.4 — Sandbox architectures: process isolation, container (Docker/gVisor), microVM (Firecracker), cloud (E2B, Modal)

Not all sandboxes are equal. The spectrum trades isolation strength against startup latency and operational complexity.

LevelMechanismStartupIsolationExample
Processsubprocess + ulimit~5 msWeak (shared kernel, shared FS)Local dev REPL
ContainerDocker + seccomp/AppArmor200–800 msMedium (shared kernel, namespaced FS/net)Self-hosted sandbox
Container + gVisorDocker with runsc runtime300–1000 msStrong (user-space kernel intercept)GCP Cloud Run
microVMFirecracker, Cloud Hypervisor125–300 msVery strong (separate guest kernel)AWS Lambda, Fly Machines
Cloud sandboxE2B, Modal, CodeSandbox SDK300–2000 ms (cold)Very strong (vendor-managed VM/container)SaaS agents

Process-level isolation is only appropriate during local development. A single os.system("rm -rf /") or a reverse shell demonstrates why.

Containers with a hardened seccomp profile and dropped capabilities are the most common self-hosted option. Adding gVisor (runsc) interposes a user-space kernel that intercepts every syscall, drastically reducing the host kernel attack surface.

MicroVMs (Firecracker) boot a minimal Linux guest in ~125 ms. Each execution gets its own kernel, memory space, and virtual network interface. AWS Lambda and Fly Machines both use this model.

Cloud sandbox services (E2B, Modal) abstract the VM/container lifecycle behind an API. You POST code; you GET results. Billing is per-second or per-execution.

Sandbox isolation spectrum Weak isolation Strong isolation Low High Startup latency Proc ~5 ms Docker ~500 ms gVisor ~700 ms Micro VM ~200 ms Cloud SaaS ~1 s (cold)

Circle size ≈ operational complexity

Figure 133.2 — Sandbox isolation vs. startup latency. MicroVMs achieve strong isolation with surprisingly low cold-start times. Cloud SaaS sandboxes shift operational complexity to the vendor.

133.5 — E2B in depth: API, sandbox lifecycle, file system, networking

E2B (short for “Environment to Build”) provides cloud sandboxes purpose-built for AI code execution. Each sandbox is a Firecracker microVM with its own filesystem, network stack, and process tree.

Lifecycle

create  →  running (idle timeout: 5 min default)  →  destroy
                ↑                                        │
                └────────── keep-alive ping ─────────────┘

You create a sandbox, run one or more code cells, then let it auto-destroy on timeout (or destroy explicitly).

Core API

from e2b_code_interpreter import Sandbox

# 1. Create a sandbox (cold start ~1-2s, warm ~300ms)
sbx = Sandbox(api_key="e2b_...", timeout=300)  # 5-min idle timeout

# 2. Execute code — state persists across calls
result = sbx.run_code("import pandas as pd; df = pd.DataFrame({'a': [1,2,3]})")
print(result.text)   # stdout
print(result.error)  # stderr / traceback (None if clean)

# 3. Multi-turn: variables survive
result2 = sbx.run_code("print(df.describe())")
print(result2.text)

# 4. Install packages at runtime
sbx.run_code("!pip install seaborn -q")

# 5. File operations
sbx.files.write("/home/user/data.csv", b"a,b\n1,2\n3,4")
csv_bytes = sbx.files.read("/home/user/data.csv")

# 6. Download generated artifacts (charts, reports)
sbx.run_code("""
import matplotlib.pyplot as plt
plt.plot([1,2,3],[4,5,6])
plt.savefig('/home/user/chart.png')
""")
png_bytes = sbx.files.read("/home/user/chart.png")

# 7. Networking: sandbox can make outbound HTTP requests
sbx.run_code("""
import requests
resp = requests.get('https://api.github.com')
print(resp.status_code)
""")

# 8. Tear down
sbx.close()

Custom templates

E2B lets you define a sandbox template — a Dockerfile that pre-bakes packages, datasets, and config files into the VM image. This eliminates per-session pip install latency.

# e2b.Dockerfile
FROM e2b/base:latest
RUN pip install pandas numpy scikit-learn matplotlib seaborn
COPY seed_data/ /home/user/data/
e2b template build --name "ds-sandbox"

Now Sandbox(template="ds-sandbox") starts with all libraries pre-installed.

Key limits (as of 2025)

DimensionDefaultConfigurable
vCPUs2Up to 8
RAM512 MBUp to 8 GB
Disk1 GBUp to 10 GB
Idle timeout5 minUp to 24 h
Max execution time300 s per cellAdjustable
Outbound networkAllowedCan be restricted

133.6 — Docker-based sandboxes: building your own, resource limits, escape risks

Not every team wants a cloud dependency. A self-hosted Docker sandbox gives you full control—at the cost of owning the security posture.

Minimal implementation

import docker
import uuid
import tarfile
import io

class DockerSandbox:
    """Ephemeral Docker container for code execution."""

    IMAGE = "python:3.12-slim"
    MEM_LIMIT = "256m"
    CPU_PERIOD = 100_000
    CPU_QUOTA = 50_000   # 50% of one core
    TIMEOUT = 30

    def __init__(self):
        self.client = docker.from_env()
        self.container = self.client.containers.run(
            self.IMAGE,
            command="sleep infinity",
            detach=True,
            mem_limit=self.MEM_LIMIT,
            cpu_period=self.CPU_PERIOD,
            cpu_quota=self.CPU_QUOTA,
            network_mode="none",            # no network access
            security_opt=["no-new-privileges"],
            read_only=False,                # writable /tmp
            tmpfs={"/tmp": "size=64m"},
            name=f"sandbox-{uuid.uuid4().hex[:8]}",
        )

    def execute(self, code: str) -> dict:
        """Write code to file, exec inside container, capture output."""
        # Inject code as a tar archive
        code_bytes = code.encode()
        tar_buf = io.BytesIO()
        with tarfile.open(fileobj=tar_buf, mode="w") as tar:
            info = tarfile.TarInfo(name="run.py")
            info.size = len(code_bytes)
            tar.addfile(info, io.BytesIO(code_bytes))
        tar_buf.seek(0)
        self.container.put_archive("/tmp", tar_buf)

        # Execute
        exit_code, output = self.container.exec_run(
            ["python3", "/tmp/run.py"],
            demux=True,
            timeout=self.TIMEOUT,
        )
        stdout = (output[0] or b"").decode(errors="replace")[-4000:]
        stderr = (output[1] or b"").decode(errors="replace")[-2000:]
        return {"stdout": stdout, "stderr": stderr, "exit_code": exit_code}

    def upload_file(self, container_path: str, data: bytes):
        tar_buf = io.BytesIO()
        with tarfile.open(fileobj=tar_buf, mode="w") as tar:
            info = tarfile.TarInfo(name=container_path.split("/")[-1])
            info.size = len(data)
            tar.addfile(info, io.BytesIO(data))
        tar_buf.seek(0)
        directory = "/".join(container_path.split("/")[:-1]) or "/"
        self.container.put_archive(directory, tar_buf)

    def download_file(self, container_path: str) -> bytes:
        bits, _ = self.container.get_archive(container_path)
        tar_buf = io.BytesIO(b"".join(bits))
        with tarfile.open(fileobj=tar_buf) as tar:
            member = tar.getmembers()[0]
            return tar.extractfile(member).read()

    def destroy(self):
        self.container.remove(force=True)

Hardening checklist

ControlWhy
network_mode="none"Prevents data exfiltration and reverse shells
mem_limit / cpu_quotaPrevents fork bombs and crypto-mining
no-new-privilegesBlocks setuid escalation
read_only=True + tmpfs on /tmpLimits persistent filesystem writes
Drop all capabilities: cap_drop=["ALL"]Minimizes kernel attack surface
gVisor runtime (runtime="runsc")Intercepts syscalls in user space
Seccomp profileAllowlist of ~60 safe syscalls
AppArmor/SELinuxMandatory access control on file paths

Escape risks

Even with all of the above, container escapes are published regularly (e.g., CVE-2024-21626 in runc). Defenses in depth:

  • Run the Docker daemon on a dedicated VM that has no access to production databases or secrets.
  • Use gVisor or Firecracker if the threat model includes sophisticated attackers.
  • Rotate sandbox VMs frequently; treat them as cattle, not pets.

133.7 — File system access: mounting volumes, read-write boundary, artifact extraction

Code-executing agents frequently need to read input files (CSVs, images, PDFs) and produce output artifacts (charts, reports, transformed data). File system design is where convenience and security collide.

Pattern 1: Upload-before-execute

The orchestrator uploads files into the sandbox before the first code cell runs. The sandbox filesystem is the only shared surface.

# Orchestrator side
sandbox.upload_file("/home/user/input.csv", user_csv_bytes)
result = sandbox.execute("import pandas as pd; df = pd.read_csv('/home/user/input.csv'); print(df.shape)")

Pattern 2: Mounted read-only volume

For Docker sandboxes, mount a host directory as read-only so the agent can access reference data without the ability to modify it.

container = client.containers.run(
    image,
    volumes={"/data/reference": {"bind": "/mnt/reference", "mode": "ro"}},
    ...
)

Pattern 3: Artifact extraction

After execution, the orchestrator pulls files from a designated output directory.

# Convention: agent writes outputs to /home/user/output/
artifacts = {}
for filename in sandbox.list_files("/home/user/output/"):
    artifacts[filename] = sandbox.download_file(f"/home/user/output/{filename}")

The read-write boundary rule

Never mount a directory read-write unless the agent genuinely needs to persist data back to the host. Default to read-only mounts plus a writable /tmp or /home/user inside the container. Extract artifacts explicitly rather than sharing a writable volume.

This principle limits the blast radius of a compromised sandbox. Even if the agent writes malicious files, they exist only inside the ephemeral container and are destroyed on teardown.


133.8 — Language runtimes: Python, Node.js, shell, polyglot

Python dominates code-executing agents because of the data-science ecosystem (pandas, numpy, matplotlib, scikit-learn), but production systems increasingly support multiple runtimes.

Python

The default. Use IPython kernels for REPL-style state persistence, or plain python3 for stateless one-shot execution. Pre-install heavy packages in the sandbox image.

Node.js / TypeScript

Useful when the agent manipulates JSON APIs, generates front-end code, or runs Puppeteer for browser automation.

result = sandbox.execute(
    code="const resp = await fetch('https://api.example.com/data'); console.log(await resp.json());",
    language="javascript",
)

Shell (bash)

Agents that orchestrate CLI tools—ffmpeg, imagemagick, curl, jq—benefit from a shell runtime. Shell execution is also the natural escape hatch for installing OS packages.

result = sandbox.execute(
    code="apt-get update && apt-get install -y ffmpeg && ffmpeg -i input.mp4 -vn output.mp3",
    language="bash",
)

Polyglot sandboxes

Some frameworks let the model choose the runtime per cell. The sandbox image includes Python, Node, and bash; the tool schema adds a language field:

{
  "name": "execute_code",
  "parameters": {
    "type": "object",
    "properties": {
      "language": {"type": "string", "enum": ["python", "javascript", "bash"]},
      "code": {"type": "string"}
    },
    "required": ["language", "code"]
  }
}

Trade-off: polyglot sandboxes are larger images (~1–2 GB) with more installed attack surface. Restrict to the runtimes your use case actually requires.


133.9 — Security model: permission boundary, network isolation, credential management, time limits

Security for code-executing agents operates on four axes.

1. Permission boundary

The principle of least privilege applied to sandboxes:

  • Run as a non-root user inside the container.
  • Drop all Linux capabilities (cap_drop: ["ALL"]).
  • Enable seccomp to allowlist only necessary syscalls.
  • Use read-only filesystems except for designated scratch directories.
container = client.containers.run(
    image,
    user="1000:1000",
    cap_drop=["ALL"],
    security_opt=[
        "no-new-privileges",
        "seccomp=sandbox-seccomp.json",
    ],
    read_only=True,
    tmpfs={"/tmp": "size=64m,noexec"},
    ...
)

2. Network isolation

By default, disable all networking (network_mode="none"). If the agent legitimately needs outbound HTTP (e.g., to call an API), use an allowlist proxy:

# docker-compose.yml for sandbox with egress proxy
services:
  sandbox:
    image: sandbox:latest
    network_mode: "container:egress-proxy"

  egress-proxy:
    image: envoyproxy/envoy:v1.30
    volumes:
      - ./envoy-allowlist.yaml:/etc/envoy/envoy.yaml
    # envoy-allowlist.yaml permits only *.api.example.com

3. Credential management

Never inject secrets into the sandbox. If the agent needs to call an authenticated API:

  • The orchestrator calls the API on behalf of the agent (the sandbox returns a request spec, the orchestrator executes it).
  • Or use short-lived, narrowly-scoped tokens injected as environment variables with automatic expiry.
# Anti-pattern: injecting long-lived credentials
sandbox.execute("import os; requests.get(url, headers={'Authorization': os.environ['API_KEY']})")

# Better: orchestrator-mediated API call
api_request = sandbox.execute("print(json.dumps({'url': '...', 'method': 'GET'}))")
# Orchestrator validates the URL against an allowlist, adds auth, makes the call
response = requests.get(api_request["url"], headers={"Authorization": f"Bearer {short_lived_token}"})
sandbox.inject_result(response.text)

4. Time and resource limits

LimitRecommended defaultPurpose
Wall-clock timeout30 s per cellPrevents infinite loops
CPU quota0.5–1 corePrevents mining / DoS
Memory256 MB–1 GBPrevents OOM on host
Disk100 MB writablePrevents disk fill
Max cells per session20–50Bounds total compute
Max sandbox lifetime10–30 minPrevents resource leaks

133.10 — Production patterns: data analysis agents, SWE-bench style, notebook agents

Pattern A: Data analysis agent

The most common deployment. User uploads a CSV; the agent explores it, computes statistics, generates charts, and narrates findings.

class DataAnalysisAgent:
    """Production data-analysis agent with E2B sandbox."""

    SYSTEM_PROMPT = """You are a data analyst. The user has uploaded a file to
    /home/user/data.csv. Use the execute_python tool to explore, analyze, and
    visualize the data. Always print results so you can observe them.
    Save charts to /home/user/output/. When done, summarize your findings."""

    def __init__(self, llm_client, sandbox: Sandbox):
        self.llm = llm_client
        self.sbx = sandbox

    def run(self, user_query: str, csv_bytes: bytes) -> dict:
        # Upload user data
        self.sbx.files.write("/home/user/data.csv", csv_bytes)

        messages = [
            {"role": "system", "content": self.SYSTEM_PROMPT},
            {"role": "user", "content": user_query},
        ]

        for _ in range(15):  # max turns
            response = self.llm.chat(messages=messages, tools=self.tools)
            msg = response.message
            messages.append(msg)

            if not msg.tool_calls:
                # Collect artifacts
                artifacts = self._collect_artifacts()
                return {"answer": msg.content, "artifacts": artifacts}

            for tc in msg.tool_calls:
                result = self.sbx.run_code(tc.arguments["code"])
                messages.append(self._tool_response(tc.id, result))

        return {"answer": "Analysis incomplete — turn limit reached.", "artifacts": {}}

    def _collect_artifacts(self) -> dict:
        artifacts = {}
        try:
            for entry in self.sbx.files.list("/home/user/output/"):
                artifacts[entry.name] = self.sbx.files.read(
                    f"/home/user/output/{entry.name}"
                )
        except Exception:
            pass
        return artifacts

Pattern B: SWE-bench style (software engineering agent)

The agent clones a repo, reads the issue, writes a patch, runs tests, and iterates until tests pass. The sandbox needs git, a full development environment, and network access (for pip install).

# Orchestrator sets up the repo inside the sandbox
sandbox.run_code("""
!git clone https://github.com/org/repo.git /home/user/repo
!cd /home/user/repo && pip install -e '.[dev]' -q
""")

# Agent loop: read issue → locate files → write patch → run tests
# The LLM uses shell commands (cat, grep, sed) alongside Python

Key differences from data analysis: longer sessions (10–30 min), larger disk (cloned repo + dependencies), network access required, and the success metric is test pass rate rather than narrative quality.

Pattern C: Notebook agent

The agent works inside a Jupyter notebook, reading and writing cells. This is natural for data science workflows where the output is a notebook.

import nbformat

def notebook_agent(sbx, llm, user_query: str) -> nbformat.NotebookNode:
    nb = nbformat.v4.new_notebook()

    messages = [{"role": "user", "content": user_query}]
    for _ in range(20):
        resp = llm.chat(messages=messages, tools=tools)
        if not resp.message.tool_calls:
            # Add markdown cell with final summary
            nb.cells.append(nbformat.v4.new_markdown_cell(resp.message.content))
            break
        for tc in resp.message.tool_calls:
            code = tc.arguments["code"]
            result = sbx.run_code(code)
            # Record in notebook
            cell = nbformat.v4.new_code_cell(code)
            cell.outputs = [nbformat.v4.new_output(
                output_type="stream", name="stdout", text=result.text or ""
            )]
            nb.cells.append(cell)
            messages.append(tool_response(tc.id, result))
    return nb

133.11 — Cost and latency: cold starts, warm pools, per-execution billing

Cold start budget

ComponentTypical latency
VM / container creation200–2000 ms
Language runtime init (Python)50–200 ms
Package imports (pandas, numpy)300–800 ms
Total cold start0.5–3 s

For interactive agents, a 3-second pause before the first code result is noticeable. Mitigation strategies:

Warm pools. Pre-create N sandboxes and keep them idle. When a request arrives, assign one instantly. Replenish the pool asynchronously.

import asyncio
from collections import deque

class SandboxPool:
    """Pre-warmed pool of sandbox instances."""

    def __init__(self, pool_size: int = 5, factory=None):
        self.factory = factory or (lambda: Sandbox())
        self.pool: deque = deque()
        self.pool_size = pool_size

    async def initialize(self):
        tasks = [asyncio.to_thread(self.factory) for _ in range(self.pool_size)]
        instances = await asyncio.gather(*tasks)
        self.pool.extend(instances)

    async def acquire(self) -> Sandbox:
        if self.pool:
            sbx = self.pool.popleft()
            # Replenish in background
            asyncio.create_task(self._replenish())
            return sbx
        # Pool exhausted — create on demand (cold start)
        return await asyncio.to_thread(self.factory)

    async def release(self, sbx: Sandbox):
        # Destroy used sandbox — never reuse between users
        await asyncio.to_thread(sbx.close)

    async def _replenish(self):
        sbx = await asyncio.to_thread(self.factory)
        self.pool.append(sbx)

Snapshot / clone. Some platforms (Firecracker with memory snapshots, E2B templates) let you snapshot a VM after packages are imported. Restoring from snapshot is faster than booting + importing.

Cost model

ProviderBilling unitApproximate cost (2025)
E2BPer-second of sandbox uptime~$0.00005/s ($0.18/hr)
ModalPer-second of compute (CPU+RAM)~$0.0001/s for 1 CPU / 1 GB
Self-hosted (AWS)EC2 instance + overhead~$0.05–0.10/hr per sandbox host

For a typical data-analysis session (10 code cells, 60 s total sandbox time), the sandbox cost is $0.003–$0.006 — usually dwarfed by the LLM inference cost (~$0.02–0.10 per session with GPT-4o).

Optimization levers

  1. Batch cells. If the model generates code that clearly has multiple independent steps, execute them in one cell to reduce round-trips.
  2. Truncate output. Cap stdout at ~4 KB to avoid bloating the context window (and thus LLM cost).
  3. Reuse sandboxes within a session but never across users (security).
  4. Set aggressive idle timeouts (2–5 min) to avoid paying for idle VMs.
  5. Pre-bake images. Every pip install inside a live sandbox is wasted cold-start time and compute cost. Bake dependencies into the template.

133.12 — Mental model (8 takeaway points)

  1. Code execution is the universal tool. One meta-tool (run code) replaces an ever-growing registry of narrow tools. Design your agent with code-as-tool as the primary capability and add dedicated tools only when code is insufficient (e.g., human-in-the-loop approvals).

  2. Stateful REPLs beat stateless execution for exploratory work. Keeping the interpreter alive across cells lets the agent build on previous results, just like a human in a Jupyter notebook. Use stateless mode for one-shot computations where isolation is paramount.

  3. Isolation is non-negotiable in production. Process-level sandboxing is a development convenience, not a production strategy. At minimum, use hardened Docker containers; prefer gVisor or Firecracker microVMs for untrusted code.

  4. Defense in depth, not a single wall. Combine container isolation + network-mode-none + seccomp + dropped capabilities + non-root user + resource limits. Any single layer may have vulnerabilities; the combination raises the bar dramatically.

  5. Never inject secrets into the sandbox. Mediate API calls through the orchestrator, or use short-lived tokens with narrow scopes. Treat the sandbox as adversary-controlled at all times.

  6. Cold starts are the latency bottleneck. Warm pools and pre-baked images are the primary mitigations. Budget 0.5–3 seconds for the first execution; subsequent cells in a stateful REPL add only execution time.

  7. Sandbox cost is noise next to LLM cost. A 60-second sandbox session costs a fraction of a cent. Optimize LLM token usage (truncate outputs, limit turns) before worrying about sandbox compute.

  8. The file system is the interface. Upload inputs, execute code, download outputs. Design clear conventions for input and output directories, enforce read-only mounts for reference data, and extract artifacts explicitly.


Read it yourself

  • E2B documentatione2b.dev/docs. The canonical reference for cloud sandbox API, templates, and lifecycle management.
  • Firecracker design docsgithub.com/firecracker-microvm/firecracker. The microVM that powers AWS Lambda; explains the security and performance model.
  • gVisor documentationgvisor.dev. How the user-space kernel intercept works; configuration for Docker and Kubernetes.
  • OpenAI Code Interpreter internalsSimon Willison’s analysis. Reverse-engineering of the sandbox environment, package list, and execution model.
  • SWE-benchswebench.com. The benchmark for software engineering agents; studies in multi-turn code execution on real repositories.
  • Modal documentationmodal.com/docs. Alternative cloud sandbox with strong Python-native APIs and GPU support.
  • Docker security best practicesdocs.docker.com/engine/security. Seccomp profiles, AppArmor, capability management.

Practice

  1. Implement a minimal code-as-tool agent using the subprocess-based skeleton from Section 133.2. Give it a user query like “What is the 50th Fibonacci number?” and verify it generates and executes correct Python. Then ask it to read a CSV file you provide — observe how it handles file paths.

  2. Harden the subprocess sandbox by replacing subprocess.run with a Docker container (Section 133.6). Add network_mode="none", memory limits, and a timeout. Test by asking the agent to run import socket; socket.create_connection(("8.8.8.8", 53)) and confirm it fails.

  3. Build a stateful REPL sandbox using the Jupyter kernel approach from Section 133.3. Execute three cells in sequence: (a) import pandas as pd; df = pd.DataFrame({"x": range(100)}), (b) df["y"] = df["x"] ** 2, (c) print(df.describe()). Verify that state persists across cells.

  4. Create an E2B sandbox template (Section 133.5) that pre-installs pandas, matplotlib, and scikit-learn. Measure the cold-start time with and without the template. How much time does pre-baking save?

  5. Implement an artifact extraction pipeline (Section 133.7). Have the agent generate a matplotlib chart, save it to /home/user/output/chart.png, then extract the PNG bytes from the sandbox. Display the chart locally using PIL or save it to disk.

  6. Design a network allowlist proxy (Section 133.9). Configure an Envoy or nginx reverse proxy that allows the sandbox to reach only api.github.com and blocks all other destinations. Verify with curl inside the sandbox.

  7. Stretch: Build a warm-pool manager (Section 133.11) that maintains 3 pre-created Docker sandboxes. Measure the latency difference between acquiring a warm sandbox vs. creating one cold. Then add a “poisoned sandbox” test: after an agent session finishes, verify that the sandbox is destroyed (not returned to the pool) so the next user cannot access leftover state.