Sync, async, SSE, WebSocket, batch: five execution modes
"The protocol is the product. Pick wrong and no amount of latency tuning will save you"
Every request to an ML platform runs in one of five execution modes. Which one you pick is a contract with the client and a constraint on the server. Pick sync when the work should have been async; your tail latency ruins SLAs and your connections pile up in front of the load balancer. Pick SSE when the client expected WebSocket; you lose bidirectional control. Pick batch when the client needed a dashboard number; you ship a 12-hour pipeline for a 200ms query. The mode selection is the first engineering decision on any new endpoint and it is usually made badly.
After this chapter the five modes are second nature — when each is correct, what the wire format looks like, where backpressure lives in each, and how to reason about the trade between simplicity (sync, one request-response) and capability (async, batch, streaming). This is also the chapter that explains why every serious LLM product ends up with SSE for token streaming even when WebSocket would be “cleaner” on paper.
Outline:
- The five modes and the decision tree.
- Mode 1: Synchronous request-response.
- Mode 2: Asynchronous with poll or callback.
- Mode 3: Server-Sent Events (SSE).
- Mode 4: WebSocket.
- Mode 5: Batch.
- Streaming LLM responses via SSE in depth.
- Where backpressure lives in each mode.
- Pros and cons table.
- The mental model.
79.1 The five modes and the decision tree
There are really only five ways a client can talk to a server for work that takes non-trivial time.
| # | Mode | Shape | Latency budget | State on server |
|---|---|---|---|---|
| 1 | Sync | request → response | < 30 s, usually much less | none beyond the request |
| 2 | Async | submit → job id → poll / callback | seconds to hours | operation record + work queue |
| 3 | SSE | request → open stream → server pushes events → server closes | seconds to minutes | one live connection per client |
| 4 | WebSocket | open socket → bidirectional frames | seconds to hours | one live connection, protocol state |
| 5 | Batch | file submit → hours later output file | minutes to days | input/output files, job state |
The decision tree for picking one:
is the work interactive (user waiting in the UI)?
├── yes:
│ is the work < 1 second with 99% reliability?
│ ├── yes → SYNC
│ └── no:
│ does the response arrive incrementally (tokens, partial results)?
│ ├── yes → SSE (or WebSocket if bidirectional)
│ └── no → ASYNC with polling
└── no:
is the work cheap per-item and amortizes well over many items?
├── yes → BATCH
└── no → ASYNC with callback
Most senior engineers violate this tree at least twice a week. The most common violation is picking sync for an LLM completion because sync is “simpler,” which is true until the p99 crosses 30 seconds and the proxy starts dropping connections. The second most common is picking WebSocket for token streaming because it sounds more modern, which is true until the corporate proxy in front of the client strips the upgrade header and nothing works.
79.2 Mode 1: Synchronous request-response
Sync is the default for good reason: it is what HTTP was born for. One TCP connection, one request, one response, one status code, one body. The client blocks until the server finishes. The server blocks a worker thread (or holds an async task) until it can reply. There is no job, no state, no polling. On failure, the client retries the whole thing.
Sync is correct when the work is short and bounded. “Short” means below a few seconds reliably. “Bounded” means you can state an upper bound that is still below a few seconds. Embedding a single 512-token passage on a warm TEI replica is sync. Classifying a safety moderation input is sync. A structured-output call against a small model with a 200-token budget is sync. A search-and-rerank over a 10k-candidate index is sync.
The sharp edges on sync mode are all about timeouts and connection lifetime. Every layer between client and server has its own idle timeout — the browser (~300 s for fetch, 30–60 s for some frameworks), the CDN (typically 60 s default), the load balancer (60 s AWS ALB default, 30 s GCP, 60 s nginx), the API gateway, the reverse proxy, the service mesh. The request can only be as long as the shortest timeout in the chain. If any of them cuts the connection mid-flight the client sees a 502 / 504 and the server sees a dead socket.
For LLM workloads this matters enormously. A 32k-token prompt at 50 tokens/sec is 640 seconds of decode. There is no reasonable HTTP stack that will hold a sync connection that long. You either stream (SSE) or go async. The naive team writes the first version of the chat endpoint as sync, deploys it, and discovers three days later that every prompt over 2k tokens is cutting off at exactly 60 seconds because that is the default ALB idle timeout. This bug gets rediscovered in every company.
The wire format for sync is trivial: status line, headers, body. The body is usually JSON. There is no ambiguity about who owns the lifecycle — it is the request.
POST /v1/embeddings HTTP/1.1
Content-Type: application/json
{"input": "the cat sat on the mat", "model": "bge-large-en-v1.5"}
HTTP/1.1 200 OK
Content-Type: application/json
{"data": [{"embedding": [0.123, -0.456, ...]}], "usage": {"tokens": 7}}
Sync backpressure lives at the server’s accept queue. If the server is saturated, new connections wait in the TCP accept queue until the server can pick them up; past the queue’s bound the kernel starts rejecting SYNs and the client sees connection refused. The practical lever is the worker concurrency setting (or the async task limiter). Every sync server needs a max-concurrent-requests bound enforced at admission (see Chapter 77). Without it, one slow upstream turns into a thundering herd and the whole process thrashes.
79.3 Mode 2: Asynchronous with poll or callback
Async mode splits one logical operation into three physical calls: submit, check, result. The client POSTs the work, the server enqueues it and returns a job id immediately (HTTP 202 Accepted), the client then polls GET /jobs/{id} or waits for a callback to a registered URL.
This mode is the correct answer whenever work takes longer than the shortest timeout in the stack and the client does not need incremental output. Fine-tuning a model, running a large batch inference, transcribing a 2-hour audio file, reindexing a collection, running a deep research pipeline — all async. The test is simple: if the work might take more than a minute, and the user can accept “come back later,” it is async.
The wire format has three calls:
POST /v1/transcriptions HTTP/1.1
Content-Type: multipart/form-data
...
HTTP/1.1 202 Accepted
Location: /v1/transcriptions/op_01HX8...
{"id": "op_01HX8...", "status": "PENDING"}
GET /v1/transcriptions/op_01HX8... HTTP/1.1
HTTP/1.1 200 OK
{"id": "op_01HX8...", "status": "RUNNING", "progress": 0.42}
GET /v1/transcriptions/op_01HX8... HTTP/1.1
HTTP/1.1 200 OK
{"id": "op_01HX8...", "status": "SUCCEEDED", "result": {...}}
This is the Google Long-Running Operations model, which Chapter 82 goes deep on. It is also the AWS Step Functions execution model, the Kubernetes Job model, and roughly every cloud API that does real work. The key insight is that the “operation” is its own first-class thing — an entity with an id, a state machine, and a durable lifecycle that outlives any single HTTP connection.
Polling vs callback is the second sub-choice. Polling is universal and trivial on the client (it is just a loop). The downside is wasted requests: if you poll every second for a 10-minute job, that is 600 unnecessary calls. The mitigation is exponential backoff (1 s, 2 s, 4 s, 8 s, capped at 30 s) plus a long-poll variant where the server holds the connection open for up to N seconds and returns when state changes. Long-poll cuts request count dramatically at the cost of holding more server connections.
Callbacks (webhooks) flip the control: the client registers a URL and the server POSTs to it when the job completes. Callbacks eliminate polling overhead entirely but require the client to be reachable — which rules out browser clients, most mobile apps, and anything behind a NAT. They also add the full “does the client have a reliable webhook receiver” problem: retries, idempotency (Chapter 78), signature verification. Most async APIs support both and let the caller pick.
Backpressure in async lives in the work queue. The enqueue call is fast and cheap; the work happens later, drained by workers. If the queue grows without bound, you reject new submissions at admission (HTTP 429 / 503) or the queue depth turns into latency and eventually into TTL-expired jobs.
79.4 Mode 3: Server-Sent Events (SSE)
SSE is a standard HTTP/1.1 streaming response where the server keeps the connection open and writes one text event at a time, each prefixed with data: and terminated by a double newline. The response never closes until the server is done. The client parses events as they arrive. It is one-way — server-to-client only — and lives entirely within normal HTTP semantics, which is why it punches through corporate firewalls that block WebSockets.
The wire format:
POST /v1/chat/completions HTTP/1.1
Content-Type: application/json
Accept: text/event-stream
{"model": "gpt-4o-mini", "stream": true, "messages": [...]}
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
data: {"choices":[{"delta":{"content":"The "}}]}
data: {"choices":[{"delta":{"content":"cat "}}]}
data: {"choices":[{"delta":{"content":"sat"}}]}
data: [DONE]
That is the OpenAI streaming format and by now it is the de facto standard for every LLM token stream. The data: prefix, the double newline between events, the final [DONE] sentinel — all of it traces back to the SSE spec and OpenAI’s choice to use it.
SSE is correct for any response that arrives incrementally and flows one way. LLM token streaming is the obvious case. Streaming search results as the ranker scores them is another. Progress updates during a long workflow (each stage emits an event) is a third. Logs pushed to a browser tab is a fourth. Anything one-way and incremental.
The virtues of SSE over WebSocket for these cases:
- It is plain HTTP. No upgrade handshake, no subprotocol negotiation.
- It works through every HTTP proxy, gateway, and CDN without special configuration (other than disabling response buffering).
- It reuses your existing auth, rate limiting, observability, and tracing. A SSE stream is just a POST that stays open.
- It auto-reconnects in the browser
EventSourceAPI if the connection drops.
The sharp edges:
- Buffering. If any proxy between client and server buffers the response, events arrive in a lump at the end and the stream looks frozen. The fix is
X-Accel-Buffering: no(nginx), disabling Cloudfront buffering, usingtext/event-streamcontent type, and flushing after every event on the server side. - Connection holding. Every in-flight SSE stream pins a server worker or async task. If your chat endpoint gets 10k concurrent users each streaming for 30 seconds, you need 10k concurrent worker slots.
- One-way only. If the client needs to cancel, it closes the TCP connection and the server notices (the write fails). That is the only client-to-server signal.
Backpressure in SSE is subtle. The TCP socket’s send buffer fills up if the client reads slowly, which eventually blocks the server’s write call. The server should honor this — if the write blocks, the event generator should pause producing tokens. In practice, with an LLM, token generation is slow enough that this almost never triggers.
79.5 Mode 4: WebSocket
WebSocket is a bidirectional, full-duplex channel over a single TCP connection, negotiated via an HTTP/1.1 upgrade. Once upgraded, both sides send frames (text or binary) at will. The protocol is message-oriented, not stream-oriented — each frame has a clear boundary.
WebSocket is the correct answer when you need bidirectional streaming — the client and server both push messages asynchronously. Voice chat where the client sends audio chunks and the server sends back partial transcripts and partial TTS. A live collaborative editor. A stateful agent that pushes thinking steps and accepts user interruptions mid-flight. A realtime multiplayer dashboard. Anything where the client’s message is not just “hello, begin.”
Wire format, simplified:
client → GET /ws HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ...
Sec-WebSocket-Version: 13
server → HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: ...
[from here, frames flow in both directions until either side sends a CLOSE frame]
The sharp edges:
- Proxies and firewalls. Many corporate networks, CDNs, and even some cloud load balancers don’t forward the upgrade properly or silently strip it. The first thing you debug in production WebSocket deployments is “why does it work in my laptop and not in our customer’s VPN.”
- Auth is awkward. You cannot send a bearer token in a header on the upgrade easily from a browser — the browser’s WebSocket API does not let you set arbitrary headers. Solutions: put the token in a subprotocol, put it in a cookie, put it in a query string (worst — it ends up in logs), or do a two-step auth where the client first gets a short-lived ticket from a normal HTTP endpoint and uses that as the subprotocol.
- Stateful servers. Every WebSocket pins a server instance — if the server restarts, every connection dies. This makes blue/green deploys painful; you have to bleed connections over time.
- Scaling is harder. Sticky sessions at the load balancer are mandatory, and horizontal scaling means the “which instance owns this user” problem that sync HTTP happily avoids.
Backpressure in WebSocket is more explicit than SSE. Each direction has its own send buffer; if the receiver is slow, the sender’s buffer fills and send either blocks, returns an error, or drops. Most libraries expose a “bufferedAmount” property that lets you check before sending another frame.
When to prefer WebSocket over SSE: only when you genuinely need the client-to-server channel to be live. For LLM token streaming where the client just waits and consumes, SSE is strictly simpler and more robust. For a voice agent where the client sends audio mid-generation, WebSocket is mandatory.
79.6 Mode 5: Batch
Batch mode is: the client uploads a file of inputs, the server processes them offline on a cheap throughput-optimized pipeline, and the client downloads an output file hours later. The OpenAI Batch API is the prototypical shape — you submit a JSONL file of chat completions with a 24-hour SLA and a 50% discount. Anthropic has an equivalent. Google’s Vertex batch prediction has another. Every major ML platform offers batch.
The economics are the whole point. Batch mode trades latency for throughput. The provider can schedule the work during low-demand periods, pack it into high-throughput configurations (larger batch sizes, more aggressive continuous batching, lower priority scheduling behind interactive traffic), and charge less because the SLA is looser. For the client, if you have 10 million documents to embed and you do not need the answer today, batch is usually 30–70% cheaper than sync.
Batch is correct when: the work is high-volume, not interactive, and can accept hours of latency. Nightly ETL enrichment (“classify every support ticket from yesterday”). Backfill (“re-embed the entire product catalog with the new model”). Scheduled reports (“generate the weekly executive summary”). Bulk evaluation (“run the new model on our 100k-item regression set”).
The shape of the API is async-with-files:
POST /v1/files → {file_id}
POST /v1/batches {input_file_id, endpoint, ...} → {batch_id, status: "validating"}
GET /v1/batches/{batch_id} → {status: "in_progress", ...}
GET /v1/batches/{batch_id} → {status: "completed", output_file_id, error_file_id}
GET /v1/files/{output_file_id}/content → JSONL results
Internally, a batch is an async operation (Mode 2) built on top of a very-high-throughput executor that is decoupled from the interactive serving fleet. The queue is a literal queue; the workers are batch-optimized replicas; the scheduler interleaves batch workloads with spare capacity on the interactive fleet or uses separate spot-instance pools.
Backpressure in batch lives in the scheduler and the SLA. If the backlog grows, the scheduler spills into slower pools or pushes back the completion estimate. The caller sees this as “your batch will complete in 14 hours instead of 4” but not as a rejection.
79.7 Streaming LLM responses via SSE in depth
LLM token streaming is the canonical SSE use case and it deserves its own section because it has become a protocol convention more than an API.
The server architecture:
sequenceDiagram
participant Client
participant Gateway
participant vLLM
Client->>Gateway: POST /v1/chat/completions {stream:true}
Gateway->>vLLM: forward (disable buffering)
loop per generated token
vLLM-->>Gateway: token bytes
Gateway-->>Client: data: {"delta":{"content":"The "}}
end
vLLM-->>Gateway: finish_reason: stop
Gateway-->>Client: data: [DONE]
Each token becomes one SSE event flushed immediately; buffering anywhere in the chain destroys the streaming UX.
The inference server generates one token at a time (Chapter 22). As each token is produced, the server serializes a small JSON chunk and writes it to the open HTTP response with a data: prefix. The gateway — if there is one — forwards bytes through without buffering. The client reads the stream, parses each event, and appends the content delta to its UI.
The event schema, modeled on OpenAI Chat Completions:
{"id": "chatcmpl-01HX...", "object": "chat.completion.chunk",
"created": 1712345678, "model": "llama-3-70b-instruct",
"choices": [{"index": 0, "delta": {"content": "The "}, "finish_reason": null}]}
The last event before [DONE] has finish_reason: "stop" (or "length" or "tool_calls" or "content_filter"). This tells the client why generation ended.
A few implementation notes every senior engineer should know:
Flushing. Every write must be followed by an explicit flush. Otherwise Python’s gunicorn, Node’s http module, and every HTTP/1.1 framework will buffer the write and wait for the TCP send buffer to fill before actually sending. For a 5-token/sec output stream this means the client sees nothing for seconds. The fix is response.flush() (Flask), response.write(data); response.flushHeaders() (Node), or await response.write(data); await response.drain() (aiohttp).
Cancellation. If the client closes the connection mid-stream (user hit stop), the server’s next write raises an exception. This needs to propagate back to the inference server so it kills the generation request and frees the KV cache (Chapter 22). Without cancellation propagation, the model keeps decoding 2000 more tokens for an output nobody will read — pure waste.
Heartbeats. If the server has nothing to send for a while (e.g., during a long tool call), send a comment line : heartbeat\n\n every 15 seconds. This keeps intermediate proxies from timing out the idle connection. SSE comments (lines starting with :) are ignored by EventSource so they are invisible to the client.
Gzip and compression. Don’t enable response compression on SSE. Most compressors buffer the response looking for a big enough block to compress, which defeats streaming. Content-Encoding: identity on the SSE route.
Retries are nuanced. SSE supports id: events and retry: fields so the client can reconnect and resume from a last-seen id. For LLM token streams this is almost never used — you can’t resume an LLM generation from the middle — but it’s part of the spec and sometimes people ask about it in interviews.
Why SSE and not WebSocket for this case: because token streaming is one-way. The client has nothing to say to the server after “go.” And SSE inherits all of the normal HTTP ecosystem — auth, rate limiting, observability, gateways, CDNs — for free. The only time LLM products use WebSocket instead is when the product is actually bidirectional: voice agents, interruptible streams, live collaborative sessions.
79.8 Where backpressure lives in each mode
Backpressure (Chapter 77) is the signal that flows “upstream” when a downstream component can’t keep up. In each mode it lives in a different place.
| Mode | Backpressure signal | Location |
|---|---|---|
| Sync | accept queue, 503/429 | server admission |
| Async (poll) | 429 on submit, queue depth metrics | work queue bound |
| Async (webhook) | 429 on submit, webhook delivery queue | both sides |
| SSE | TCP send buffer, slow writes | socket + worker slot |
| WebSocket | bufferedAmount, frame drop | each direction’s buffer |
| Batch | scheduler pushback, extended SLA | batch scheduler |
The practical implication: you cannot design a robust endpoint until you know where its backpressure lives. If sync has no admission limit, a slow downstream turns into worker exhaustion (a thread-pool explosion). If async has no queue bound, memory grows without limit. If SSE has no per-user concurrency cap, one abusive user can pin every worker slot. If WebSocket has no per-connection message rate limit, one bad client floods the server. If batch has no backlog bound, the scheduler ETAs become fiction.
In every mode, rate limiting (Chapter 76) goes at admission and flow control lives somewhere inside. Neither one is optional.
79.9 Pros and cons table
| Mode | Pros | Cons | Use when |
|---|---|---|---|
| Sync | Simplest. All tooling works. No state. | Timeout ceiling. Blocks workers. Bad tail latency. | Work is bounded and < a few seconds. |
| Async + poll | Universal. Survives any client. No connection state. | Polling overhead. Three calls instead of one. Operation SoR required. | Long jobs, any client, unknown duration. |
| Async + webhook | No polling. Clean for server-to-server. | Client must be reachable. Webhook reliability is its own problem. | B2B, service-to-service, predictable client. |
| SSE | Incremental output. HTTP-compatible. Works through proxies. | One-way. Holds worker. Needs flushing discipline. | LLM token streaming, progress events, live logs. |
| WebSocket | Bidirectional. Low overhead per frame. | Auth awkward. Sticky sessions. Proxy-hostile. | Voice agents, live collab, truly bidirectional UX. |
| Batch | Cheapest. Throughput-optimal. Flexible scheduling. | Not interactive. Needs file pipeline. | ETL, backfill, non-interactive bulk work. |
Most real platforms expose three: sync, async-with-polling, and SSE (for LLM streaming). Batch is a separate add-on for the bulk use case. WebSocket is added only for the narrow set of products that genuinely need bidirectional streaming. The shape of a well-designed ML platform API is usually these three or four primary modes, with everything else built on top.
79.10 The mental model
Eight points to take into Chapter 80:
- Sync is the default, but only for work bounded below a few seconds. Above that, your timeout ceiling eats you.
- Async + operation id is the correct shape for long work. Three calls — submit, poll, result — and the operation is a first-class entity (Chapter 82).
- SSE is the protocol standard for LLM token streaming. Plain HTTP, flushing, cancellation propagation, heartbeats, no gzip.
- WebSocket is for truly bidirectional work only. Voice, live collab, interruptible agents. Not for “streaming that sounds cooler.”
- Batch trades latency for cost. 24-hour SLA, 50% discount, separate executor pool. Correct for high-volume non-interactive work.
- Each mode has its own backpressure location. Know where it lives or the endpoint breaks under load.
- The wire format of each mode is not negotiable. SSE events have a specific shape. Async has 202 + Location. WebSocket has a specific upgrade handshake.
- Most platforms need sync + async + SSE. Everything else is an add-on. Design this axis deliberately.
In Chapter 80, workflow orchestration: how Temporal, Airflow, Step Functions, and Cadence power the async mode under the hood.
Read it yourself
- The HTML Living Standard, Server-Sent Events section (the canonical SSE spec).
- RFC 6455, The WebSocket Protocol.
- Google Cloud Long-Running Operations guide (
google.longrunning.Operation) — the canonical async-with-operation API shape. - The OpenAI API reference, specifically the
stream: truechat completions format and the Batch API reference. - Fielding’s REST dissertation, chapters 5 and 6, for why HTTP semantics matter and why sync is the default.
- The AWS ALB timeout documentation, for the real-world numbers on sync connection ceilings.
Practice
- A client wants to transcribe 3-hour audio files. Which mode and why? Draw the full sequence of HTTP calls.
- Implement a minimal SSE endpoint in Python that streams the numbers 1–10 with a 200ms delay between each. Verify the client sees them one at a time, not all at once. What is the minimum flushing discipline required?
- A team picks WebSocket for LLM token streaming because “it streams better.” Write two specific production failures this choice will cause.
- Design the state machine for an async image-generation operation: what states, what transitions, what is stored where? (Preview of Chapter 82.)
- Compute the backpressure failure for a naive sync endpoint with no admission limit: 100 workers, upstream latency goes from 100 ms to 10 s. How many in-flight requests pile up before everything breaks?
- For each of the five modes, name the default timeout or buffer limit that bites teams in production and how you’d detect it.
- Stretch: Build a minimal Temporal-free async job pipeline with a Redis queue, an operation store in Postgres, and a webhook-or-poll client API. Implement cancellation: how does a cancel request propagate to an in-flight worker?