Chapter 72: Designing an agent orchestration layer

This is the closing chapter of Part V. We’ve covered tool calling, agent loops, multi-agent patterns, MCP, the workflow vs agent decision, and the failure modes. This chapter is about the production wrapper: the orchestration layer that turns an agent loop into a deployable system.

By the end you’ll know how to design an agent orchestration layer with state management, observability, billing, and the operational features production demands.

Outline:

The problem an orchestration layer solves.
The components of an agent platform.
Session and state management.
Tool registry.
Execution backbone.
Observability for agents.
Billing per step.
The human-in-the-loop pattern.
Multi-tenant agent serving.
The Part V capstone.

72.1 The problem

A toy agent — the 100-line Python script from Chapter 67 — works great for a demo. It takes a query, runs the loop, returns an answer. Perfect for a Jupyter notebook.

A production agent has many more concerns:

Many concurrent users, each with their own session.
State that survives crashes (the agent in progress shouldn’t be lost if a server reboots).
Long-running execution (agents that take minutes to hours).
Tool availability that varies per tenant or per user.
Billing that tracks how much each user spent on LLM calls and tool calls.
Observability for debugging and monitoring.
Human-in-the-loop for approving sensitive actions.
Cancellation when the user changes their mind.
Retries when something transient fails.
A web UI that streams the agent’s progress to the user.

None of these are in the 100-line Python script. They’re what an orchestration layer provides.

The orchestration layer is the difference between “I built an agent” and “I shipped an agent product.” Most teams underinvest in this layer and discover they need it after the first production incident.

72.2 The components of an agent platform

A complete agent platform has these components:

The production agent platform is six downstream services behind one Agent Service core; the toy 100-line agent script has none of these — bridging that gap is what makes agents shippable.

Each component has its own responsibilities. Let me walk through them.

72.3 Session and state management

A session is a user’s interaction with the agent — typically a single goal the agent is pursuing. A session has:

Session ID: a unique identifier.
User ID: who owns the session.
Tenant ID: which org/customer.
Status: pending, running, paused, completed, failed, canceled.
Goal: the original user query.
Message history: the LLM messages so far.
Tool calls: the tool calls made and their results.
Final result: the answer (if completed).
Metadata: created_at, updated_at, model_version, etc.

The state has to be persisted so the session can survive crashes, be resumed by a different worker, and be queried later for debugging or billing.

stateDiagram-v2
  [*] --> pending : session created
  pending --> running : worker picks up
  running --> paused : awaiting human approval
  paused --> running : approved
  paused --> failed : rejected / timeout
  running --> completed : final answer produced
  running --> failed : max iterations / error
  running --> canceled : user cancels
  completed --> [*]
  failed --> [*]
  canceled --> [*]

A session’s status drives how workers, the UI, and the billing system treat it — only completed sessions are billed in full; failed sessions are billed only for work done.

The store: a relational database (Postgres) for the structured data, plus an object store (S3) for large fields like message histories and tool results.

A simple schema:

CREATE TABLE sessions (
    id UUID PRIMARY KEY,
    user_id UUID,
    tenant_id UUID,
    status VARCHAR,
    goal TEXT,
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    completed_at TIMESTAMP,
    final_result TEXT,
    metadata JSONB
);

CREATE TABLE session_messages (
    id UUID PRIMARY KEY,
    session_id UUID REFERENCES sessions,
    sequence INT,
    role VARCHAR,  -- user, assistant, tool
    content TEXT,
    tool_calls JSONB,
    created_at TIMESTAMP
);

CREATE TABLE session_tool_calls (
    id UUID PRIMARY KEY,
    session_id UUID REFERENCES sessions,
    tool_name VARCHAR,
    arguments JSONB,
    result JSONB,
    error TEXT,
    duration_ms INT,
    cost_cents INT,
    created_at TIMESTAMP
);

Every interaction with the agent updates the database. The agent can be resumed by reading the latest state and continuing from where it left off.

72.4 Tool registry

The orchestration layer needs to know which tools are available. The tool registry is the component that:

Lists tools available to a particular agent run.
Filters tools by tenant, user, or session context.
Routes tool calls to the right backend (MCP server, internal API, external service).
Tracks tool usage for billing and rate limiting.

A typical tool registry maps each tool to:

The tool’s schema (for the LLM).
The backend that handles it (URL, MCP server reference, code function).
The tenants allowed to use it.
The cost per call (for billing).
The rate limit per tenant.

When an LLM emits a tool call, the orchestrator looks up the tool in the registry, checks permissions and rate limits, routes to the backend, gets the result, and returns it to the LLM.

The registry can be:

A database table, with tools registered manually or via API.
A YAML/JSON config file, loaded at startup.
Dynamic discovery, by querying connected MCP servers at startup.

Most production systems use a combination: dynamic discovery for MCP servers, plus a database for custom tools.

72.5 Execution backbone

The actual running of the agent loop. Two main approaches:

In-process execution

The agent loop runs as a Python (or other language) function inside the agent service. State is held in memory for the duration of the run, then persisted at the end.

Pros: simple, low-latency. Cons: doesn’t survive crashes; one-shot execution model.

Good for short agent runs (< 30 seconds) where crashes are rare.

Workflow-engine execution

The agent loop runs as a durable workflow in a workflow engine like Temporal. Each step (LLM call, tool call) is a separate “activity” that’s persisted on completion. If the worker crashes, another worker picks up the workflow from the last completed activity.

Pros: durable, survives crashes, can run for hours/days, has built-in retries and timeouts. Cons: more complex, more latency per step (the workflow engine adds overhead).

Good for long-running agents and high-reliability use cases.

For most production agent platforms, Temporal-based execution is the right choice. The reliability gains are significant.

The Temporal workflow looks like:

@workflow.defn
class AgentWorkflow:
    @workflow.run
    async def run(self, goal: str) -> str:
        messages = [{"role": "user", "content": goal}]
        
        for i in range(MAX_ITERATIONS):
            # LLM call as an activity
            response = await workflow.execute_activity(
                llm_generate,
                messages,
                start_to_close_timeout=timedelta(minutes=5)
            )
            
            messages.append(response.message)
            
            if response.tool_calls:
                # Run tool calls in parallel as activities
                tool_results = await asyncio.gather(*[
                    workflow.execute_activity(
                        execute_tool,
                        tool_call,
                        start_to_close_timeout=timedelta(seconds=30)
                    )
                    for tool_call in response.tool_calls
                ])
                for tc, result in zip(response.tool_calls, tool_results):
                    messages.append({"role": "tool", "content": json.dumps(result)})
            else:
                return response.message.content
        
        raise Exception("Max iterations exceeded")

Each execute_activity call is durable. The workflow can crash and resume. Retries are automatic. Timeouts are enforced.

This is the production pattern. Variations exist (LangGraph with persistence, custom state machines), but Temporal-based execution is the most mature.

72.6 Observability for agents

Agents are hard to debug. The orchestration layer should make debugging tractable through extensive observability.

The data to capture:

Per-session metrics:

Total LLM calls.
Total tool calls (and per-tool counts).
Total tokens (input and output).
Total wall-clock time.
Final status (success, failure, timeout).
Cost (computed from token counts and tool prices).

Per-step traces:

For each step, capture the LLM input, the LLM output, any tool calls, and the tool results.
Use OpenTelemetry traces with one span per step.
Link spans into a single trace per session.

Error details:

When something fails, capture the full context: messages, tool state, error.
Include stack traces for crashes.

Audit log:

A separate, append-only log of every action the agent took.
Used for compliance and security review.

The observability stack is typically:

Prometheus for metrics.
Tempo / Jaeger for traces.
Loki for logs.
A custom session viewer UI for browsing past sessions interactively.

A good session viewer is one of the most valuable internal tools for an agent platform. Engineers should be able to look up any past session and see exactly what happened, step by step. This is how you debug production agent issues.

72.7 Billing per step

LLM calls cost money. Tool calls cost money. The orchestration layer should track costs per session and feed them into a billing system.

Per-call cost capture:

def llm_generate(messages):
    response = llm.call(messages)
    cost = compute_llm_cost(response.usage.input_tokens, response.usage.output_tokens, model)
    record_billing_event(session_id, "llm_call", cost)
    return response

def execute_tool(tool_call):
    result = tool_router.call(tool_call)
    cost = TOOL_PRICES.get(tool_call.name, 0)
    record_billing_event(session_id, "tool_call", cost, metadata={"tool": tool_call.name})
    return result

The billing events go to a central billing pipeline (Chapter 83), which aggregates by user/tenant and produces invoices.

The granularity matters: per-call billing is needed for accurate pricing of agent products. Aggregated daily billing is simpler but less accurate.

For most agent platforms, per-call billing with periodic aggregation is the right approach. Capture every event, aggregate them in a billing service, expose to users via a dashboard.

72.8 The human-in-the-loop pattern

For sensitive actions (sending emails, making payments, modifying production systems), the agent should not act autonomously. Instead, it should request human approval.

The pattern:

The agent decides it wants to call a sensitive tool.
The orchestrator intercepts the call before executing it.
The orchestrator sends a notification to a human reviewer with the context (what the agent wants to do, why).
The reviewer approves or rejects.
The orchestrator sends the result back to the agent (success or “user denied”).
The agent proceeds.

sequenceDiagram
  participant A as Agent loop
  participant O as Orchestrator
  participant H as Human reviewer

  A->>O: wants to call send_email()
  O->>H: approval request (context + proposed action)
  Note over H: reviews in web UI
  H-->>O: approve / reject
  O-->>A: result signal
  A->>A: continue (or abort if rejected)

Human-in-the-loop is implemented as a workflow signal: the agent pauses at the sensitive action and waits up to 24 hours for a reviewer’s decision before proceeding.

In Temporal-based execution, this is implemented as a signal: the workflow sends a signal to the orchestrator indicating it needs approval, then waits for a return signal.

@workflow.defn
class AgentWorkflow:
    async def run(self, goal):
        # ... agent loop
        if tool_call.is_sensitive:
            approval = await workflow.wait_condition(
                lambda: self.has_received_approval,
                timeout=timedelta(hours=24)
            )
            if not approval:
                return "User denied the action"
        # ... continue

The human reviewer interacts via a web UI that shows pending approvals and lets them click approve/reject.

Human-in-the-loop is essential for high-stakes agent applications. It’s the only way to give users confidence that the agent won’t do something irreversible without their consent.

72.9 Multi-tenant agent serving

A production agent platform serves many tenants. The orchestration layer needs to handle:

Per-tenant tool access: tenant A can use tools X, Y; tenant B can use Y, Z.
Per-tenant rate limits: prevent one tenant from monopolizing the system.
Per-tenant model selection: tenant A uses Llama 3 70B, tenant B uses GPT-4o.
Per-tenant billing: track costs separately.
Per-tenant data isolation: tenant A’s session data is not visible to tenant B.

The implementation: every operation includes a tenant ID. The tool registry checks tenant ID for permissions. The rate limiter checks tenant ID for quotas. The state store stores tenant ID with every record. The billing system aggregates by tenant ID.

Multi-tenancy is a cross-cutting concern that has to be designed in from the start. Adding it later is painful.

72.10 The Part V capstone

This closes Part V — Agents, Tool Use, and Workflow Orchestration. You now have:

Chapter 66: tool calling and the wire protocols.
Chapter 67: the agent loop and the patterns (ReAct, plan-and-execute, reflection).
Chapter 68: multi-agent patterns and when they’re worth it.
Chapter 69: MCP in depth.
Chapter 70: workflow vs agent — the most important decision.
Chapter 71: production agent failure modes.
Chapter 72 (this chapter): the orchestration layer that makes agents production-ready.

You should be able to:

Design an agent system for any task.
Pick between workflow and agent based on the requirements.
Implement the basic ReAct loop.
Integrate MCP for tool access.
Defend against the common failure modes.
Build a production agent platform with state, observability, billing.

In Part VI we shift from agents to distributed systems and the request lifecycle — the broader systems engineering layer below the LLM application.

72.11 The mental model

Eight points to take into Part VI:

A production agent is a service, not a script. It needs state, observability, billing, retries.
Sessions are first-class. Persistent state in a database, long-running execution.
Tool registry centralizes which tools exist, who can call them, and how to route calls.
Workflow-engine execution (Temporal) is the right backbone for durable agents.
Observability is critical. Per-step traces, audit logs, session viewer UI.
Per-call billing captures the cost of every LLM call and tool call.
Human-in-the-loop for sensitive actions. Implemented as workflow signals.
Multi-tenancy is a cross-cutting concern designed in from the start.

In Part VI we look at the broader distributed systems engineering layer.

Read it yourself

The Temporal documentation on workflows, activities, signals, queries.
The LangGraph documentation on persistent execution.
Production blog posts from companies building agent platforms (search “agent platform architecture”).
The Anthropic / OpenAI documentation on building production agent systems.

Practice

Sketch the architecture for an agent platform serving multiple tenants. Identify each component.
Why is workflow-engine execution better than in-process for production agents? List three reasons.
Design a session schema for an agent platform. What fields would you include?
Implement a Temporal workflow for a simple agent loop. Use signals for human-in-the-loop.
Why is per-call billing important for agent products? Compute the cost of a single agent run.
How would you handle a sensitive action (e.g., sending an email) in your agent platform?
Stretch: Build a minimal agent platform with state persistence (SQLite), observability (logging), and billing (in-memory ledger). Run a few agent sessions and inspect the state.