Part XI · Building Agents and Agent Infrastructure
Chapter 140 ~23 min read

Designing an agent product: UX, streaming, and the human-in-the-loop contract

"You can build the most capable agent in the world, but if users cannot predict what it will do, cannot stop it when it goes wrong, and cannot understand what it costs, they will turn it off after one session. Agent design is product design first, model design second"

The previous chapters gave us tool use, planning loops, memory, and evaluation. This capstone chapter zooms out from the loop internals and asks the product question: how should a human experience an agent? We will cover the UX spectrum from invisible background automation to full collaborative partnership, the streaming primitives that make agents feel responsive, the human-in-the-loop contract that keeps users in control, and the business model decisions that determine whether the product survives contact with a credit-card statement.


140.1 — When to ship agent vs workflow vs form

Not every problem needs an agent. The first design decision is choosing the right level of autonomy.

DimensionStatic form / wizardDeterministic workflowAgent
Input spaceFixed, enumerableFixed triggers, parameterizedOpen-ended natural language
StepsHard-codedDAG of known stepsDynamic plan, tool selection
Failure modeValidation errorStep retry / dead-letterUnbounded cost, hallucination
When to pickCompliance, speedETL, CI/CD, notificationsAmbiguous goals, exploration

Rule of thumb: if you can enumerate every path a user might take, use a workflow. If the user’s intent requires interpretation and the action space is large, an agent earns its complexity. Many successful products start as workflows and graduate individual features to agent mode once they have usage data.

# decision_matrix.py — programmatic triage
from enum import Enum

class ExecutionMode(Enum):
    FORM = "form"
    WORKFLOW = "workflow"
    AGENT = "agent"

def choose_mode(
    input_is_structured: bool,
    steps_enumerable: bool,
    avg_actions_per_task: int,
    user_trust_level: float,   # 0-1
) -> ExecutionMode:
    """Heuristic triage — not a substitute for product judgement."""
    if input_is_structured and steps_enumerable:
        return ExecutionMode.FORM
    if steps_enumerable and avg_actions_per_task < 8:
        return ExecutionMode.WORKFLOW
    if avg_actions_per_task >= 8 or not steps_enumerable:
        if user_trust_level < 0.3:
            return ExecutionMode.WORKFLOW   # agent not yet trusted
        return ExecutionMode.AGENT
    return ExecutionMode.WORKFLOW

The function above is intentionally naive. The point is to force the conversation: what does trust_level mean for our users today, and what would move it?


140.2 — Agent UX spectrum: invisible → assistive → collaborative → autonomous

Agents sit on a spectrum, and the right position depends on task risk and user expertise.

Invisible Assistive Collaborative Autonomous Spam filter, auto-tagging Autocomplete, suggested replies Claude Code, Cursor chat Scheduled bots, Operator Low High ← user control autonomy →

Invisible agents operate behind the scenes — the user never sees a chat interface. Think email spam filtering or auto-categorization of support tickets. The UX is simply absence of noise.

Assistive agents suggest but never act. GitHub Copilot’s inline completions are assistive: you press Tab to accept. The cost of a bad suggestion is one keystroke to dismiss.

Collaborative agents negotiate with the user. They propose plans, ask clarifying questions, and execute steps with periodic check-ins. Claude Code operates here — it reads your codebase, proposes edits, and waits for approval on file writes.

Autonomous agents run end-to-end with minimal intervention. They are appropriate only when: (a) the blast radius of a mistake is small, (b) rollback is cheap, or (c) the user has explicitly granted blanket trust.

Key design question: Where on this spectrum does your first release sit? Almost always, start further left than you think. You can always grant more autonomy; revoking it feels like a downgrade.


140.3 — Streaming output: SSE for steps, progressive disclosure, artifact pattern

Agents are slow. A single task can involve 5-30 LLM calls, each taking 1-10 seconds. Without streaming, the user stares at a spinner for minutes. Streaming is not a nice-to-have — it is a core UX requirement.

Server-Sent Events (SSE) for step-level streaming

SSE is the simplest protocol for server-to-client push over HTTP. Each event carries a typed payload.

# agent_sse_server.py — FastAPI streaming endpoint
import json
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from typing import AsyncGenerator

app = FastAPI()

# Event types the client understands
EVENT_THINKING  = "thinking"
EVENT_TOOL_CALL = "tool_call"
EVENT_TOOL_RESULT = "tool_result"
EVENT_TEXT_DELTA = "text_delta"
EVENT_ARTIFACT  = "artifact"
EVENT_DONE      = "done"
EVENT_ERROR     = "error"
EVENT_APPROVAL  = "approval_required"

async def format_sse(event: str, data: dict) -> str:
    """Format a single SSE frame."""
    payload = json.dumps(data, ensure_ascii=False)
    return f"event: {event}\ndata: {payload}\n\n"

async def run_agent_stream(user_message: str) -> AsyncGenerator[str, None]:
    """Execute agent loop, yielding SSE frames at each step."""
    yield await format_sse(EVENT_THINKING, {
        "summary": "Planning approach..."
    })
    await asyncio.sleep(0.3)  # simulate LLM latency

    # Step 1: tool call
    yield await format_sse(EVENT_TOOL_CALL, {
        "tool": "search_codebase",
        "args": {"query": user_message},
        "step": 1,
    })
    await asyncio.sleep(1.2)

    yield await format_sse(EVENT_TOOL_RESULT, {
        "tool": "search_codebase",
        "result_summary": "Found 3 relevant files",
        "step": 1,
    })

    # Step 2: approval gate
    yield await format_sse(EVENT_APPROVAL, {
        "action": "edit_file",
        "file": "src/auth.py",
        "description": "Add rate-limiting decorator to login endpoint",
        "step": 2,
    })
    # Client must POST /approve/{run_id} to continue
    # ... remainder of loop omitted for brevity

    yield await format_sse(EVENT_DONE, {
        "total_steps": 4,
        "total_tokens": 12_340,
        "cost_usd": 0.037,
    })

@app.get("/agent/stream")
async def stream_agent(q: str):
    return StreamingResponse(
        run_agent_stream(q),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable nginx buffering
        },
    )

Progressive disclosure

Not every SSE event deserves screen real estate. Apply progressive disclosure:

  1. Default collapsed: tool calls show a one-line summary (“Searched codebase — 3 files found”).
  2. Expandable: clicking reveals arguments, raw results, latency.
  3. Pinned: approval gates and errors are always visible, never auto-collapsed.
  4. Artifacts: large outputs (code files, tables, diagrams) render in a dedicated pane, not inline in the chat scroll.

The artifact pattern — pioneered by Claude’s artifact panel — separates conversation from work product. The conversation is ephemeral context; the artifact is the durable deliverable. When your agent produces code, documents, or data, render them as first-class objects the user can copy, version, and fork.


140.4 — Human-in-the-loop contract: what needs approval, trust calibration

The human-in-the-loop (HITL) contract is an explicit agreement between the product and the user about which actions require approval. It is the single most important design artifact for an agent product.

Defining the contract

Every tool the agent can call falls into one of three tiers:

TierApprovalExamples
Read-onlyNeverSearch, list files, fetch docs
Reversible writeConfigurableEdit file (with undo), create branch, draft email
Irreversible / high-costAlwaysSend email, deploy to prod, delete resource, spend > $X
# hitl_contract.py — declarative approval policy
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class ApprovalTier(Enum):
    NEVER = "never"           # read-only, always allowed
    CONFIGURABLE = "config"   # user sets per-tool or per-session
    ALWAYS = "always"         # mandatory confirmation

@dataclass
class ToolPolicy:
    name: str
    tier: ApprovalTier
    description: str
    cost_estimate_usd: Optional[float] = None
    blast_radius: str = "low"   # low | medium | high

@dataclass
class HITLContract:
    """The approval contract for one agent product."""
    policies: dict[str, ToolPolicy] = field(default_factory=dict)
    session_budget_usd: float = 1.00
    auto_approve_after_n: int = 0  # 0 = never auto-approve

    def requires_approval(self, tool_name: str, context: dict) -> bool:
        policy = self.policies.get(tool_name)
        if policy is None:
            return True  # unknown tools always require approval
        if policy.tier == ApprovalTier.NEVER:
            return False
        if policy.tier == ApprovalTier.ALWAYS:
            return True
        # CONFIGURABLE: check session-level overrides
        if context.get("blanket_approve", False):
            return False
        if (policy.cost_estimate_usd or 0) > context.get("remaining_budget", 0):
            return True  # over budget → force approval
        return context.get("require_approval_default", True)

# Example contract for a code-editing agent
code_agent_contract = HITLContract(
    policies={
        "read_file":       ToolPolicy("read_file", ApprovalTier.NEVER,
                                      "Read file contents"),
        "search_codebase": ToolPolicy("search_codebase", ApprovalTier.NEVER,
                                      "Grep / semantic search"),
        "edit_file":       ToolPolicy("edit_file", ApprovalTier.CONFIGURABLE,
                                      "Modify existing file", blast_radius="medium"),
        "create_file":     ToolPolicy("create_file", ApprovalTier.CONFIGURABLE,
                                      "Create new file", blast_radius="medium"),
        "run_command":     ToolPolicy("run_command", ApprovalTier.ALWAYS,
                                      "Execute shell command", blast_radius="high"),
        "git_push":        ToolPolicy("git_push", ApprovalTier.ALWAYS,
                                      "Push to remote", blast_radius="high"),
    },
    session_budget_usd=2.00,
)

Trust calibration

Users arrive with different priors about AI reliability. A senior engineer who has used Copilot for two years is comfortable auto-approving file edits. A first-time user is not. The product must calibrate trust rather than impose a single policy.

Three mechanisms:

  1. Onboarding wizard — ask the user to set their comfort level per tool category.
  2. Inline adjustment — after every approval, offer “Always allow this” / “Always ask.”
  3. Trust decay — if the agent makes an error that the user reverts, automatically tighten approval for that tool category for the next N interactions.

140.5 — Progressive autonomy: start supervised, earn trust

Progressive autonomy is the principle that an agent should begin in a restricted mode and unlock capabilities as it demonstrates reliability — both within a session and across sessions.

# progressive_autonomy.py — trust scoring
from dataclasses import dataclass
import math

@dataclass
class TrustScore:
    """Per-user, per-tool trust accumulator."""
    successes: int = 0
    failures: int = 0
    reverts: int = 0

    @property
    def score(self) -> float:
        """Wilson score lower bound — conservative success rate."""
        n = self.successes + self.failures + self.reverts
        if n == 0:
            return 0.0
        p_hat = self.successes / n
        z = 1.96  # 95% confidence
        denominator = 1 + z**2 / n
        centre = p_hat + z**2 / (2 * n)
        spread = z * math.sqrt((p_hat * (1 - p_hat) + z**2 / (4 * n)) / n)
        return (centre - spread) / denominator

    @property
    def auto_approve(self) -> bool:
        """Auto-approve when trust is high and sample is sufficient."""
        return self.score > 0.85 and (self.successes + self.failures) >= 10

    def record(self, outcome: str):
        if outcome == "success":
            self.successes += 1
        elif outcome == "failure":
            self.failures += 1
        elif outcome == "revert":
            self.reverts += 2  # reverts weigh double

The Wilson score lower bound is the right statistic here because it is conservative with small sample sizes. The agent must earn its way past the approval gate through demonstrated competence, not simply through the passage of time.

Session-level progression

Within a single session, the pattern looks like this:

  1. First tool call → always show plan and ask approval.
  2. After 3 consecutive approved calls of the same type → offer “auto-approve for this session.”
  3. If user accepts → suppress approval for that tool type until session ends or an error occurs.
  4. On error → revoke auto-approve, return to step 1.

140.6 — Error UX: “I tried X, didn’t work because Y, should I try Z?”

Agent errors are fundamentally different from traditional software errors. A 500 status code has a single cause; an agent failure is a narrative — a sequence of reasoning steps that went wrong somewhere. The UX must communicate that narrative.

The error template

Every agent error message should follow this structure:

What I tried: Edited src/auth.py to add rate limiting using the slowapi library. What went wrong: The test suite failed — test_login_rate_limit raised ImportError: No module named 'slowapi'. What I recommend: Install slowapi via pip install slowapi and re-run. Should I proceed?

This pattern — action → failure → proposed recovery — respects the user’s time by frontloading the diagnosis. Compare this to a raw traceback, which forces the user to do the diagnosis themselves.

# error_ux.py — structured agent error
from dataclasses import dataclass
from typing import Optional

@dataclass
class AgentError:
    action_taken: str
    failure_reason: str
    proposed_recovery: str
    requires_approval: bool = True
    raw_output: Optional[str] = None   # collapsible in UI

    def to_user_message(self) -> str:
        msg = (
            f"**What I tried:** {self.action_taken}\n\n"
            f"**What went wrong:** {self.failure_reason}\n\n"
            f"**What I recommend:** {self.proposed_recovery}"
        )
        if self.requires_approval:
            msg += "\n\nShould I proceed with this recovery?"
        return msg

    def to_sse_event(self) -> dict:
        return {
            "type": "agent_error",
            "action": self.action_taken,
            "reason": self.failure_reason,
            "recovery": self.proposed_recovery,
            "needs_approval": self.requires_approval,
            "raw": self.raw_output,
        }

Anti-patterns to avoid

  • Silent retry loops. The agent retries 5 times, burns tokens, and the user sees only “Sorry, I couldn’t complete the task.” Always surface intermediate failures.
  • Blame-the-model phrasing. “The AI hallucinated” means nothing to a user. Say what went wrong in domain terms.
  • No escape hatch. Every error screen must have a “Stop and let me handle this” button. Never trap the user in a recovery loop.

140.7 — Cost transparency: cost/run, budget controls

LLM-powered agents have variable, per-invocation costs that can surprise users. A simple coding task might cost $0.02; a complex refactor with 40 tool calls might cost $2.00. Without transparency, users either under-use the product (fear of bills) or over-use it (surprise bills). Both are churn vectors.

What to show

  1. Per-session running total — displayed in the UI footer, updated after each LLM call.
  2. Per-step cost — in the expandable detail of each step.
  3. Projected cost — before executing a plan, estimate total cost and show it.
  4. Budget gate — when projected cost exceeds the user’s configured limit, pause and ask.
# cost_tracker.py — real-time cost accounting
from dataclasses import dataclass, field

# Approximate pricing per 1M tokens (update as models change)
PRICING = {
    "claude-sonnet": {"input": 3.00, "output": 15.00},
    "claude-opus":   {"input": 15.00, "output": 75.00},
    "gpt-4o":        {"input": 2.50, "output": 10.00},
}

@dataclass
class CostTracker:
    model: str
    budget_usd: float = 1.00
    steps: list = field(default_factory=list)

    @property
    def total_usd(self) -> float:
        return sum(s["cost"] for s in self.steps)

    @property
    def remaining_usd(self) -> float:
        return max(0.0, self.budget_usd - self.total_usd)

    def record_step(self, input_tokens: int, output_tokens: int, label: str):
        prices = PRICING.get(self.model, PRICING["claude-sonnet"])
        cost = (
            input_tokens * prices["input"] / 1_000_000
            + output_tokens * prices["output"] / 1_000_000
        )
        self.steps.append({"label": label, "cost": round(cost, 6),
                           "input_tokens": input_tokens,
                           "output_tokens": output_tokens})
        return cost

    def check_budget(self, projected_tokens: int) -> bool:
        """Return True if projected next step fits within budget."""
        prices = PRICING.get(self.model, PRICING["claude-sonnet"])
        projected_cost = projected_tokens * prices["output"] / 1_000_000
        return (self.total_usd + projected_cost) <= self.budget_usd

    def summary(self) -> dict:
        return {
            "total_cost_usd": round(self.total_usd, 4),
            "remaining_usd": round(self.remaining_usd, 4),
            "steps": len(self.steps),
            "total_input_tokens": sum(s["input_tokens"] for s in self.steps),
            "total_output_tokens": sum(s["output_tokens"] for s in self.steps),
        }

Design principle: cost transparency builds trust. Users who see costs are more willing to use the product, not less, because they feel in control. Hide costs and you create anxiety.


140.8 — Conversation design: system prompts, persona, clarifying questions

The agent’s personality is a product surface, not an implementation detail. Three axes to design:

System prompt architecture

A production system prompt has layers:

┌─────────────────────────────────────┐
│  Identity & constraints             │  ← Who are you? What can't you do?
├─────────────────────────────────────┤
│  Tool descriptions & policies       │  ← HITL contract, tool docs
├─────────────────────────────────────┤
│  User context (injected per-session)│  ← Preferences, history, project info
├─────────────────────────────────────┤
│  Task-specific instructions         │  ← Injected per-turn if needed
└─────────────────────────────────────┘
# system_prompt_builder.py
def build_system_prompt(
    identity: str,
    tools: list[dict],
    user_context: dict,
    task_instructions: str = "",
) -> str:
    sections = [
        f"# Identity\n{identity}",
        "# Available tools\n" + "\n".join(
            f"- **{t['name']}**: {t['description']} "
            f"[approval: {t['approval_tier']}]"
            for t in tools
        ),
        f"# User context\n"
        f"- Preferred language: {user_context.get('language', 'English')}\n"
        f"- Experience level: {user_context.get('level', 'intermediate')}\n"
        f"- Project: {user_context.get('project', 'unknown')}",
    ]
    if task_instructions:
        sections.append(f"# Task instructions\n{task_instructions}")
    # Critical: constraints go last for recency bias
    sections.append(
        "# Constraints\n"
        "- Always explain before acting.\n"
        "- Never run destructive commands without approval.\n"
        "- If uncertain, ask a clarifying question instead of guessing.\n"
        "- Show cost estimates before expensive operations."
    )
    return "\n\n".join(sections)

Clarifying questions

A good agent asks before acting when the request is ambiguous. But asking too many questions is equally bad — it signals lack of capability. The heuristic:

  • If ambiguity could lead to an irreversible wrong action → ask.
  • If ambiguity only affects style/preference → pick a reasonable default, state your assumption, proceed.
  • If the user has answered a similar question before → reuse that preference silently.

Persona

Keep the persona utilitarian. Users of productivity agents want competence, not personality. Avoid filler phrases (“Great question!”), excessive hedging (“I’m just an AI…”), and performative enthusiasm. State what you will do, do it, report the result.


140.9 — Multi-turn: context, undo, conversation branching

Single-turn Q&A is simple. Multi-turn agent sessions introduce three hard problems.

Context management

Each turn adds tokens. A 20-turn coding session can easily exceed 100K tokens of context. Strategies:

  1. Sliding window with summaries — after N turns, summarize older turns into a compressed block.
  2. Tool-result truncation — large tool outputs (file contents, search results) are summarized immediately; raw data is stored server-side and retrievable on demand.
  3. Explicit context reset — give users a “Start fresh” button that clears accumulated context without losing the conversation history (which remains viewable but not in the LLM window).

Undo

Every write action the agent performs should be undoable. Implementation varies by domain:

  • File edits: store diffs; apply reverse patch.
  • API calls: store request/response; call compensating API if available.
  • Irreversible actions: this is why they require approval — undo is “don’t do it.”
# undo_stack.py — reversible action log
from dataclasses import dataclass, field
from typing import Callable, Any

@dataclass
class ReversibleAction:
    description: str
    forward_result: Any
    reverse_fn: Callable[[], Any]
    step_index: int

@dataclass
class UndoStack:
    actions: list[ReversibleAction] = field(default_factory=list)

    def push(self, action: ReversibleAction):
        self.actions.append(action)

    def undo_last(self) -> str:
        if not self.actions:
            return "Nothing to undo."
        action = self.actions.pop()
        action.reverse_fn()
        return f"Undone: {action.description}"

    def undo_to(self, step_index: int) -> list[str]:
        """Undo all actions back to (and including) the given step."""
        undone = []
        while self.actions and self.actions[-1].step_index >= step_index:
            undone.append(self.undo_last())
        return undone

Conversation branching

Sometimes the user wants to say “go back to step 3 and try a different approach.” This is conversation branching — a tree, not a list. The UI shows a timeline; clicking any node forks the conversation from that point. The underlying implementation truncates the message list to the branch point and appends the new user message.


140.10 — Shipping incrementally: workflow → agent flexibility → measure → iterate

Phase 1 Deterministic Workflow Phase 2 LLM-assisted steps Phase 3 Agent with HITL gates Phase 4 Progressive autonomy Measure: task completion rate, cost/task, reverts, NPS Each phase gate requires metric thresholds before advancing

The phased rollout is not just engineering caution — it is a data collection strategy. Each phase generates the usage data you need to justify the next.

Phase 1: Deterministic workflow. Hard-code every step. Measure completion rates and time-to-complete. This is your baseline.

Phase 2: LLM-assisted steps. Replace the hardest-to-enumerate steps with LLM calls (e.g., “classify this ticket” or “draft this response”). Keep the overall flow deterministic. Measure quality delta and cost.

Phase 3: Agent with HITL gates. Let the LLM choose which tools to call and in what order, but require approval at every write step. Measure autonomy-to-quality tradeoff.

Phase 4: Progressive autonomy. Use the trust-scoring system from Section 140.5 to auto-approve low-risk actions for high-trust users. Measure revert rates.

Phase gate criteria (example):

MetricPhase 2 → 3Phase 3 → 4
Task completion rate> 85%> 90%
User revert rate< 15%< 5%
Cost per task< $0.50< $0.50
NPS (agent feature)> 30> 50

140.11 — Pricing models: per-task, per-step, per-token, subscription+caps

Pricing an agent product is harder than pricing a SaaS feature because cost is usage-correlated and variable. The four models:

Per-token passthrough — charge the user what the LLM charges you, plus a margin. Transparent but unpredictable for users. Works for developer tools where users understand tokens (e.g., API platforms).

Per-step — charge per tool invocation. More legible than per-token but creates a perverse incentive: the agent should minimize steps, which may conflict with quality. Rarely used in practice.

Per-task — charge a flat fee per completed task. User-friendly but requires you to absorb variance. Works when tasks are well-defined (e.g., “generate one blog post” = $0.10). Fails when task scope varies wildly.

Subscription + caps — monthly fee with a usage budget (e.g., “$20/month, includes 500 agent runs, $0.04/run after that”). This is the dominant model for consumer and prosumer products. The cap must be generous enough that 80%+ of users never hit it — the cap is a safety valve, not a revenue driver.

# pricing_engine.py — multi-model pricing
from dataclasses import dataclass
from enum import Enum

class PricingModel(Enum):
    PER_TOKEN = "per_token"
    PER_TASK = "per_task"
    SUBSCRIPTION = "subscription"

@dataclass
class PricingConfig:
    model: PricingModel
    # Per-token
    markup_pct: float = 0.30        # 30% margin on LLM cost
    # Per-task
    flat_rate_usd: float = 0.05
    # Subscription
    monthly_usd: float = 20.00
    included_runs: int = 500
    overage_per_run: float = 0.04

def compute_charge(
    config: PricingConfig,
    llm_cost_usd: float,
    runs_used_this_month: int,
) -> float:
    if config.model == PricingModel.PER_TOKEN:
        return llm_cost_usd * (1 + config.markup_pct)
    elif config.model == PricingModel.PER_TASK:
        return config.flat_rate_usd
    elif config.model == PricingModel.SUBSCRIPTION:
        if runs_used_this_month < config.included_runs:
            return 0.0  # covered by subscription
        return config.overage_per_run
    return 0.0

Practical guidance: most teams should start with subscription + generous caps. It aligns incentives (you want users to use the product), smooths revenue, and avoids the “meter anxiety” that per-token pricing creates.


140.12 — Case studies: Claude Code, ChatGPT plugins → GPTs → Operator, Cursor, Linear/Notion AI

Claude Code

Claude Code is a collaborative agent (Section 140.2) for software engineering. Key design decisions:

  • HITL contract: Read-only tools (search, read file) are auto-approved. File writes show a diff and require confirmation. Shell commands always require approval. This tiered model matches developer expectations — reading is safe, writing needs review.
  • Streaming: Step-by-step output with inline diffs. The user sees what the agent is thinking before it acts.
  • Progressive autonomy: Users can grant blanket approval for file edits within a session, escalating to autonomous mode for trusted operations.
  • Cost transparency: Token counts and cost visible per session.
  • Lesson: Developers tolerate agent latency if they can see progress. The streaming UX is what makes a 30-second tool call feel manageable.

ChatGPT: plugins → GPTs → Operator

OpenAI’s agent product evolved through three distinct phases, each illustrating a different point on the UX spectrum:

  • Plugins (2023): Third-party tools called by ChatGPT. Assistive mode — the model decided when to call a plugin, but the user had to enable them. Failed because: discovery was poor, quality was uneven, and users didn’t understand what plugins could do.
  • GPTs (2024): Custom ChatGPT configurations with specific tools and personas. Still assistive/collaborative. Succeeded better because the creator defined the tool set, reducing user confusion.
  • Operator (2025): Autonomous browser agent. The highest-autonomy product in the lineup. Requires explicit task delegation (“Book me a flight”) and shows a live browser view so the user can monitor. Key innovation: the live viewport — the user sees exactly what the agent sees, restoring a sense of control even in autonomous mode.
  • Lesson: The market was not ready for plugins (too much user burden), was ready for GPTs (curated experiences), and is cautiously ready for Operator (high autonomy, high visibility).

Cursor

Cursor is a code editor with deeply integrated LLM assistance. It occupies the assistive-to-collaborative zone:

  • Tab completion is invisible/assistive — accept or dismiss with a keystroke.
  • Cmd+K inline edits are collaborative — the user describes what to change, sees a diff, accepts or rejects.
  • Chat panel is fully collaborative — multi-turn conversation with codebase context.
  • Agent mode — plans and executes multi-file changes with approval gates.
  • Lesson: Offering multiple interaction modes at different autonomy levels within the same product lets users self-select their comfort zone. Power users use agent mode; cautious users use Tab completion. Both are happy.

Linear AI / Notion AI

These are examples of domain-embedded agents — AI capabilities woven into an existing product rather than presented as a standalone agent.

  • Linear AI: Auto-triages issues, suggests labels, drafts descriptions. Mostly invisible/assistive. The agent is not a separate mode — it is part of the existing workflow.
  • Notion AI: Document-level assistance — summarize, translate, brainstorm. Collaborative within a single document. No multi-step tool use; the “agent” is scoped to text transformation.
  • Lesson: Not every AI feature needs to be an agent. Embedding narrow LLM capabilities into existing workflows (invisible/assistive) often delivers more value per engineering-hour than building a general-purpose agent.

140.13 — Mental model

Eight principles for designing agent products:

  1. Start with the workflow, not the agent. Build the deterministic version first. Graduate to agent mode only for steps that genuinely require interpretation or flexible planning.

  2. Place yourself on the autonomy spectrum deliberately. Invisible, assistive, collaborative, and autonomous are all valid — but choosing the wrong level for your user’s trust and your agent’s reliability is the most common failure mode.

  3. Stream everything. Agents are slow. SSE with progressive disclosure turns “waiting” into “watching.” The user’s perception of speed is determined by information flow, not wall-clock time.

  4. Make the HITL contract explicit. Every tool call is read-only, reversible-write, or irreversible. The approval policy for each tier should be documented, configurable, and visible to the user.

  5. Earn autonomy through demonstrated reliability. Progressive autonomy, backed by statistical trust scores, is safer and more user-friendly than a binary “supervised / unsupervised” switch.

  6. Errors are conversations, not crashes. “I tried X, it failed because Y, should I try Z?” respects the user’s time and maintains the collaborative frame.

  7. Show the meter. Cost per session, cost per step, remaining budget. Transparency builds trust; hidden costs build churn.

  8. Price for the 80th percentile. Subscription + caps works because most users stay under the cap. The cap protects you from outliers; the subscription smooths revenue. Per-token pricing is for APIs, not products.


Read it yourself

  • Anthropic, Building effective agents — architecture patterns and when to use each one.
  • Lilian Weng, LLM Powered Autonomous Agents (2023) — comprehensive survey of planning, memory, and tool-use patterns.
  • Nielsen Norman Group, AI UX Guidelines — interaction design principles for AI-powered interfaces.
  • OpenAI, Operator system card (2025) — transparency report on autonomous agent safety decisions.
  • Karpathy, State of GPT (2023) — practical mental models for LLM product design.
  • Designing Data-Intensive Applications (Kleppmann, 2017) — Chapter 12 on “The Future of Data Systems” is relevant to agent-as-workflow-orchestrator thinking.

Practice

  1. Classify five AI products you use daily on the autonomy spectrum (invisible → assistive → collaborative → autonomous). For each, argue whether it should move left or right on the spectrum, and what would need to change.

  2. Design a HITL contract for an agent that manages a user’s calendar: booking meetings, rescheduling conflicts, declining invitations. Define tool tiers, approval policies, and at least one trust-calibration mechanism.

  3. Implement a streaming agent UI using the SSE pattern from Section 140.3. Build a minimal frontend (HTML + JavaScript EventSource) that renders thinking, tool_call, tool_result, and approval_required events with progressive disclosure (collapsed by default, expandable on click).

  4. Add budget controls to the CostTracker class: implement a project_cost(plan: list[dict]) method that estimates total cost before execution and returns a boolean indicating whether the plan fits within the remaining budget.

  5. Build a trust-scoring dashboard. Extend the TrustScore class to track per-tool, per-user scores over time. Write a function that generates a summary report: which tools have earned auto-approve status, which have been demoted, and what the overall agent reliability score is.

  6. Design a conversation branching data structure. Implement a tree-based conversation history where each node contains a message and its children represent branches. Write methods for: (a) forking at a given node, (b) serializing the active branch for the LLM context window, and (c) listing all branches with their lengths.

  7. Stretch: Build an end-to-end agent product prototype that combines SSE streaming, a three-tier HITL contract, progressive autonomy with Wilson-score trust, cost tracking with budget gates, and the error UX template. The agent should perform a simple multi-step task (e.g., search a codebase, propose an edit, apply it after approval). Deploy it locally and run three sessions, recording the trust score progression and cost per session. Write a one-page analysis of where on the autonomy spectrum your prototype sits and what metric thresholds you would need to advance it to the next phase.