Part VI · Distributed Systems & Request Lifecycle
Chapter 82 ~19 min read

The operations service pattern: why job lifecycle is its own SoR

"The workflow engine runs the work. The operations service owns the story the client tells about the work. These are not the same thing, and conflating them is the mistake"

Every long-running API has an “operation” — a durable handle that the client polls, the UI displays, the audit log references, and the billing system attaches usage to. In the naive design, the operation is just a view over the workflow engine’s execution state. In every mature design, the operation is its own service of record, with its own database, its own state machine, and its own API, loosely coupled to the executor underneath. This split is not academic. It is load-bearing for any platform that wants to expose long-running work as a stable product contract.

After this chapter the operations service pattern is clear: why it exists, what it stores, how it stays consistent with the underlying executor, and how Google Cloud LROs, AWS Step Functions executions, and Kubernetes Jobs all instantiate the same pattern. Polling and callback patterns, idempotency at the operation layer, and the gap between “what the workflow engine knows” and “what the client needs to see.”

This chapter builds directly on Chapter 80 (workflow orchestration) — you should have the durable-execution vocabulary in hand. It also connects to Chapter 78 (idempotency), because the operation is the natural idempotency key for async work, and to Chapter 79 (execution modes), because the operation is the protocol shape of the async mode.

Outline:

  1. The shape of an operation.
  2. Why the operation is its own SoR.
  3. The operation state machine.
  4. What the operation stores vs what the executor stores.
  5. Consistency with the underlying executor.
  6. Polling vs callbacks at the operation layer.
  7. Idempotency and operation naming.
  8. Real-world instantiations: GCP LRO, Step Functions, K8s Jobs.
  9. Operation service API design.
  10. The mental model.

82.1 The shape of an operation

An operation is a durable record of a request for long-running work. It has an id, a state, an input snapshot, optionally a result, optionally an error, timestamps, and metadata. It is created when the client submits the work and exists until it is explicitly deleted or ages out.

The minimum shape:

Operation {
  id              : string (opaque, globally unique)
  kind            : string ("fine-tune", "batch-embed", "transcribe", ...)
  tenant_id       : string
  user_id         : string
  state           : enum { PENDING, RUNNING, SUCCEEDED, FAILED, CANCELED }
  created_at      : timestamp
  started_at      : timestamp | null
  completed_at    : timestamp | null
  input           : json (request body or reference)
  result          : json | null
  error           : { code, message, details } | null
  metadata        : json (progress, logs-ref, etc.)
  etag            : string (optimistic concurrency)
  executor_ref    : opaque handle to the workflow engine (workflow id, execution arn, job uid)
}

This is the record the client sees when they poll GET /v1/operations/op_01HX.... It is NOT the record the workflow engine stores for the same work. The workflow engine stores its own thing — an event history, activity tasks, retry counts, internal identifiers — and the operation is a translation layer that presents a stable, product-level view of that internal state.

The key phrase is “stable, product-level view.” Clients do not want to know about Temporal workflow ids and event histories. They want to know “is my job done, did it succeed, what is the result, how much did it cost, how do I cancel it.” The operations service provides exactly that.

82.2 Why the operation is its own SoR

“Why not just query the workflow engine directly?” is the first question every junior designer asks. The answer has five parts.

(1) The product contract outlives the executor. Workflow engines change. You start on Airflow, migrate to Temporal, maybe end up on a custom executor for one workload. The client-visible operation API must not change across those migrations. If clients are polling the Temporal workflow id directly, migrating off Temporal breaks them. If they are polling an opaque operation id that the operations service translates, the migration is invisible.

(2) Cross-executor work. Some operations involve more than one executor. A “fine-tune” operation might involve a Temporal workflow that triggers a Kubeflow training job that waits for a signed-off evaluation that kicks off another Temporal workflow. The client operation is one logical thing; the underlying work spans multiple systems. Only a dedicated operations service can present this as a single coherent entity.

(3) State the executor doesn’t model. The client needs things like: “the operation is queued but the upstream billing check hasn’t completed,” “the operation is waiting for human approval,” “the operation succeeded but the result is being packaged,” “the operation failed but we are refunding the usage.” These are product-layer states that don’t exist in the workflow engine’s vocabulary.

(4) Auth and tenancy. The operation needs per-tenant authorization — “user U from tenant T can only see operations created by tenant T.” Enforcing this over the workflow engine’s native API is awkward at best; the workflow engine’s model is “workflow id is the identifier, if you have it you can query it.” The operations service owns the per-tenant access control.

(5) Cost accounting and billing hooks. Metering (Chapter 83) attaches usage to operations, not to workflow executions. The operation is the unit of billing. Billing needs its own durable record of “did this operation succeed, how much work did it do, what is the bill.”

Put differently: the operation lives in the product domain; the execution lives in the infrastructure domain. They are separate concerns and should live in separate services.

82.3 The operation state machine

The operation’s state machine is deliberately simpler than the executor’s. The executor has dozens of internal states (queued, scheduled, running activity X, waiting on retry, blocked on signal, suspended, continuing-as-new, and so on). The operation exposes roughly five.

Operation state machine: five states, monotonic forward transitions, terminal states are stable and immutable. PENDING submitted RUNNING executor active SUCCEEDED result populated FAILED error populated CANCELED client requested terminal — stable forever
Terminal states never regress; a SUCCEEDED operation stays SUCCEEDED even if the executor is replaced — the operation is its own source of truth.
  • PENDING: the operation has been created and durably stored but work has not started. The executor may or may not have accepted it yet.
  • RUNNING: work has started in the executor.
  • SUCCEEDED: terminal; result is populated.
  • FAILED: terminal; error is populated.
  • CANCELED: terminal; the client (or admin) explicitly canceled.

Some designs add an ABORTING transient state between RUNNING and CANCELED for the window where cancellation has been requested but the executor has not yet acknowledged.

The guarantees clients can rely on:

  • State transitions are monotonic in the forward direction. Once an operation is SUCCEEDED, it stays SUCCEEDED — it does not become FAILED later.
  • Terminal states are stable. After reaching a terminal state, the operation record does not change (except for metadata/TTL updates).
  • Every operation either reaches a terminal state or ages out. There are no forever-stuck operations from the client’s perspective.

The last guarantee is the hard one. It means the operations service must have a reaper that catches operations stuck in non-terminal states for too long and force-transitions them to FAILED with a “timed out” error. Otherwise bad executor bookkeeping leaks into the API.

82.4 What the operation stores vs what the executor stores

The split is load-bearing. The operation stores:

  • Everything the client is allowed to see: id, state, input snapshot, result, error, timestamps.
  • Everything the billing pipeline needs: usage counters or a reference to a usage record, cost attribution.
  • Everything the auth layer needs: tenant, user, resource references.
  • The executor reference: an opaque handle that only the operations service uses to find the corresponding execution.

The executor stores:

  • The workflow history (event log, activity records, retry state).
  • Intermediate values, checkpoints, heartbeats.
  • Everything needed to resume the work after a crash.
  • Internal identifiers that are NOT stable contracts.

The separation rule is: if a client would ever see it, it lives in the operation. If only the infrastructure cares about it, it lives in the executor. The result field on the operation is the canonical product-level answer; the workflow history is the debugging tool.

A practical consequence: the executor’s data model is owned by the workflow engine team (or by the engine itself, if it’s Temporal). The operation’s data model is owned by the platform team and changes slowly, with API versioning, deprecation, and backwards compatibility discipline. These are different rates of change and different ownership; keeping them in the same schema is a recipe for friction.

Another consequence: the operation’s storage can be a plain Postgres table with nothing fancy. The operation record is a row. Updates are simple UPDATEs. Polling is a simple SELECT. The whole pattern works at scale without exotic infrastructure. By contrast, the executor typically needs its own purpose-built storage — Temporal’s history, Airflow’s metadata DB, Step Functions’ managed log. Don’t collapse these.

82.5 Consistency with the underlying executor

The operation and the executor must stay consistent. If the executor says a workflow completed but the operation is still RUNNING, clients see stale state. If the operation is SUCCEEDED but the executor is still running, work happens invisibly. The consistency model is the implementation detail with the most failure modes.

Four reasonable consistency models:

Operations service and executor are loosely coupled via callbacks plus a reconciliation loop as backstop; neither path is the sole source of truth. Operations Service Postgres operations table state: PENDING → RUNNING… Executor (Temporal) event log, activities runs the actual work ① start_workflow(executor_ref) ② callback on completion (fast path) Reconciler loop (every 30 s) backstop: catches callbacks lost during restarts or network hiccups
Use callbacks for the happy path and reconciliation as the backstop; operations that miss their callback are caught and resolved within one reconciler cycle.

(1) Eager write, then best-effort sync. On create: write operation PENDING, then call executor to start the workflow, then update operation RUNNING with executor ref. On executor completion: executor calls back to operations service, which flips the operation terminal. The executor callback is the source of truth for state transitions.

The failure mode: if the callback is lost (network hiccup, operations service restart at the wrong moment), the operation stays RUNNING forever. Mitigations: retries on the callback, reconciliation loop that sweeps non-terminal operations and queries the executor for their state.

(2) Polling reconciliation. A reconciler runs periodically (every few seconds), takes the list of operations in PENDING/RUNNING, queries the executor for each, and updates the operation state. This is robust — it catches any drift — but it has a latency floor (the poll interval) and costs executor calls proportional to in-flight operations.

(3) Event stream from executor. The executor publishes state-change events to a stream (Kafka, PubSub, etc.). The operations service consumes the stream and updates operations accordingly. Low latency, cleanly decoupled. Requires the executor to expose such a stream, which Temporal and Step Functions both do, Airflow doesn’t natively.

(4) Hybrid. Use callbacks for the happy path (fast, low cost), polling reconciliation as the backstop (catches drift), and/or an event stream for high-volume workloads. This is what mature systems run.

Temporal supports pattern 1 cleanly via workflow completion callbacks or the GetWorkflowHistory API. Step Functions supports pattern 3 via EventBridge integration. Airflow forces pattern 2 because its native API is polling-oriented.

The edge cases that bite:

  • Operation created but executor call fails. The operation is stuck PENDING with no workflow. Mitigation: a background worker retries the executor call or transitions the operation to FAILED after a timeout.
  • Executor runs the workflow but the operation record was never persisted (crash between “start workflow” and “commit operation”). Mitigation: deduplication key on the executor side so the call is idempotent (Chapter 78).
  • Executor completes but the operations service is down. Mitigation: executor callback retries with exponential backoff, plus reconciliation catches it on recovery.
  • Cancellation requested but the workflow is in a non-interruptible phase. Mitigation: operations service flags the operation as ABORTING, workflow checks for the flag at activity boundaries.

All of these are handled by mature implementations and all of them are bugs in the first version.

82.6 Polling vs callbacks at the operation layer

The operations service offers clients two ways to discover state changes: polling (GET the operation) or callbacks (webhook to a registered URL). This is parallel to but distinct from the operation/executor consistency patterns — it is the client-facing view.

graph LR
  A[Create operation] --> B[PENDING]
  B --> C{executor started?}
  C -->|yes| D[RUNNING]
  D --> E{result?}
  E -->|success| F[SUCCEEDED]
  E -->|error| G[FAILED]
  D --> H[CANCELED]
  B --> I[retry if executor start fails]
  style D fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style F fill:var(--fig-surface),stroke:var(--fig-fg)
  style G fill:var(--fig-surface),stroke:var(--fig-fg)

Client polling with exponential backoff (1→2→4→8→30 s cap) is the universal default; webhooks are the optimization for server-to-server callers that want to avoid polling entirely.

Polling is the default because it works for every client type, including browsers behind NAT. Clients typically poll with exponential backoff: 1s, 2s, 4s, 8s, 16s, 30s, 30s, … For long operations this is wasteful and the mitigations are (a) long-polling (the GET endpoint holds the connection open for N seconds and returns early on state change) or (b) HTTP/2 server push or (c) just accepting the cost because it is small in aggregate.

Callbacks (webhooks) push state changes to a client-registered URL. The operations service POSTs a JSON payload:

{
  "operation_id": "op_01HX...",
  "state": "SUCCEEDED",
  "completed_at": "2026-04-10T12:34:56Z",
  "result_url": "https://api.example.com/v1/operations/op_01HX.../result"
}

Webhooks require reliability machinery: retries on non-2xx responses, exponential backoff, dead-letter queue after N failures, HMAC signature for authenticity verification, idempotent delivery (the client should handle duplicates). Real webhook systems are their own little service.

For the operations service specifically, webhooks often point at the client’s own backend (server-to-server) rather than end users. A typical shape: user kicks off fine-tune from the UI, the UI polls the operation briefly to show “pending / running,” the operation is too long for the UI to wait on, so it backgrounds the tab, and the user’s backend receives a webhook on completion and emails them.

82.7 Idempotency and operation naming

Operation creation is the canonical idempotency problem (Chapter 78). If the client submits a job, the network hiccups, and the client retries, you do not want two copies of the job — you want the second submission to return the same operation as the first.

The clean solution is client-supplied operation names. Google Cloud calls this a “resource name” and the pattern is:

POST /v1/fine-tunes HTTP/1.1
Content-Type: application/json

{
  "operation_id": "client-generated-uuid-or-idempotency-key",
  "model": "llama-3-8b",
  "dataset": "gs://...",
  ...
}

The operations service uses operation_id as a unique key. If it already exists, return the existing operation (possibly with 201 Created on the first call and 200 OK on duplicates). The client can safely retry the create call as many times as it wants.

Server-generated ids are the other option but require a separate Idempotency-Key header to deduplicate retries. Both work. Client-generated ids are cleaner because the id is stable from the client’s perspective before the server responds.

Operation ids should be:

  • Globally unique (no collisions across tenants).
  • Opaque (client should treat as a string, not parse).
  • Short enough to fit in URLs without encoding pain (≤ 50 characters).
  • Lexically sortable by creation time if you can swing it (ULIDs, KSUIDs) — this makes listing and pagination much nicer.

The typical format is something like op_01HX8ZKJA3P5T4YNCNDW34Q9VX (ULID with a op_ prefix). Tenant-scoped variants prepend the tenant: acme/op_01HX....

82.8 Real-world instantiations

Google Cloud Long-Running Operations

Google Cloud’s google.longrunning.Operation is the reference design for this pattern. Every GCP service that does async work returns an Operation resource:

message Operation {
  string name = 1;        // "operations/..." or "projects/.../operations/..."
  google.protobuf.Any metadata = 2;   // service-specific progress info
  bool done = 3;
  oneof result {
    google.rpc.Status error = 4;
    google.protobuf.Any response = 5;
  }
}

Clients poll operations.get(name) until done: true, then read response (on success) or error (on failure). The same API is used across every GCP service — Vertex AI training, Speech-to-Text long audio, Compute Engine instance creation — so clients learn it once and reuse it.

AWS Step Functions Executions

Step Functions represents each workflow invocation as an Execution with an ARN. The API is almost exactly the operation-service pattern:

  • StartExecution → returns execution ARN.
  • DescribeExecution(arn) → returns state (RUNNING, SUCCEEDED, FAILED, TIMED_OUT, ABORTED), input, output.
  • StopExecution(arn) → cancels.
  • CloudWatch Events fires on state changes for callback-style consumers.

The execution is the operation in this model. Step Functions collapses the operations service and the executor into one product, which works because it’s all managed together.

Kubernetes Jobs

A Kubernetes Job resource is the same pattern in K8s vocabulary:

apiVersion: batch/v1
kind: Job
metadata:
  name: process-doc-abc123
status:
  conditions:
    - type: Complete
      status: "True"

Clients (or controllers) watch the Job’s status to know when it’s done. The Job resource IS the operation record — it’s durable in etcd, has a name, has a state machine, and is queried by name. Kubernetes collapses the operation and executor in the same way Step Functions does.

OpenAI Batches

OpenAI’s Batch API explicitly uses the operation pattern at the application layer:

POST /v1/batches → {id, status: "validating", ...}
GET  /v1/batches/{id} → {id, status, output_file_id, ...}
POST /v1/batches/{id}/cancel

Status values: validating, failed, in_progress, finalizing, completed, expired, cancelling, cancelled. A rich state machine over a much simpler client-visible abstraction.

Each of these instances makes the same trade: product-stable state machine on top, opaque executor details below. Each looks slightly different on the wire but the pattern is identical.

82.9 Operation service API design

A reasonable operations service API:

POST /v1/operations                        create (with idempotency_key)
GET  /v1/operations/{id}                   read
GET  /v1/operations                        list (filter by tenant, kind, state)
POST /v1/operations/{id}:cancel            cancel
DELETE /v1/operations/{id}                 purge (terminal only)
GET  /v1/operations/{id}/events            event stream / long-poll

Creation is usually not a generic endpoint — it’s specialized per operation kind (POST /v1/fine-tunes, POST /v1/transcriptions, etc.) because the input shape differs. Read, list, cancel, and delete are generic.

The read endpoint should support:

  • ETags for optimistic concurrency and for long-polling support.
  • Field masks so clients can request minimal payloads (don’t return 50MB of metadata on every poll).
  • Includes/expansions for related data (linked result file, linked usage record).

The list endpoint should support:

  • Filtering by state, kind, tenant, time range.
  • Pagination (cursor-based, not offset-based, because operations churn).
  • Limit parameter with a sane default (50) and a hard max (500).

Cancel should be:

  • Idempotent (cancel of an already-canceled or terminal operation is a no-op, returns 200).
  • Fast to acknowledge but not synchronous on the executor (returns ABORTING state immediately; the executor takes some time to actually stop).

Delete should:

  • Only work on terminal operations (reject on running).
  • Actually free storage (or schedule background GC).
  • Be auditable.

The events endpoint (optional but nice) exposes a sub-resource for progress events, logs, intermediate outputs. This is how fine-tune APIs expose training metrics mid-run; how batch APIs expose partial progress; how transcription APIs stream partial transcripts. The events endpoint is usually a long-polled or SSE stream (Chapter 79).

82.10 The mental model

Eight points to take into Chapter 83:

  1. The operation is a product-domain entity. It is what the client sees and the API contract. It is not the workflow engine’s execution record.
  2. The operation is its own system of record. Separate storage, separate API, separate state machine, loosely coupled to the executor.
  3. The operation’s state machine is deliberately simpler than the executor’s. Five states, monotonic, terminal states are stable.
  4. Consistency with the executor is the hardest part and is usually a hybrid of callbacks plus reconciliation.
  5. Client-supplied operation ids are the clean idempotency solution for creation. ULIDs with a op_ prefix are the de facto shape.
  6. Polling is the default client pattern because it works for every client. Webhooks are an optimization for server-to-server callers.
  7. The operation is the unit of billing. Metering attaches usage to operations, not to executor internals (Chapter 83).
  8. Google Cloud LROs, Step Functions executions, K8s Jobs, and OpenAI Batches all instantiate the same pattern. Learn one, recognize the rest.

In Chapter 83, how metering and billing pipelines attach usage to operations and turn data-plane events into invoices.


Read it yourself

  • Google Cloud’s “Long-running operations” design guide: google.longrunning.Operation in the googleapis repo.
  • The AWS Step Functions “Execution” API reference and the EventBridge integration docs.
  • The Kubernetes Job controller source (pkg/controller/job in kubernetes/kubernetes) for a reference reconciler.
  • The OpenAI Batch API reference — a clean small instance of the pattern.
  • Google API Design Guide, “Standard Methods” and “Long-running Operations” sections.
  • Martin Kleppmann, Designing Data-Intensive Applications, Chapter 5 for the consistency framing that underlies the executor/operation sync problem.

Practice

  1. Design the state machine for a “fine-tune a model” operation. What are the non-terminal states, what are the terminal states, and what causes each transition?
  2. A naive team uses the Temporal workflow id directly as the public operation id. List three concrete problems they will hit.
  3. Write the schema for an operations Postgres table. What are the indexes? What is the primary key?
  4. Implement the “create is idempotent” property: client POSTs with operation_id=X, operations service either creates or returns the existing. What happens if the create is concurrent with itself?
  5. Design the reconciliation loop that catches operations stuck in RUNNING when the executor has actually completed. How often does it run? What does it query?
  6. Write the cancel flow: client calls POST /v1/operations/{id}:cancel. What happens in the operations service, what happens in the executor, what state transitions happen in which order?
  7. Stretch: Implement a minimal operations service in ~300 lines: Postgres-backed store, REST API, in-memory executor (a thread pool), webhook notification on terminal transitions, idempotent create, cancel endpoint. Prove the “operation survives executor restart” property.