Chapter 80: Workflow orchestration: Temporal vs Airflow vs Step Functions vs Cadence

Workflow orchestration is the answer to a simple question: how does work get done reliably when the work is longer than a single process’s lifetime? The naive answer — “write a Python script, put it in cron, pray” — fails as soon as you have retries, long-running jobs, fan-out, human approvals, or partial failures that must not leave the system inconsistent. The mature answer is a workflow engine. Temporal, Airflow, Step Functions, and Cadence are the four major systems, and choosing between them is one of the recurring design questions in senior ML-systems interviews.

After this chapter the mental model for “when do I need a workflow engine and which one” is sharp. What durable execution actually means and how it is implemented. What event sourcing under the hood looks like. Why the workflow/activity split exists and why determinism is the enabling constraint. A comparative table of the four systems and a clear answer to when each wins.

This chapter pairs with Chapter 70 (workflow vs agent), which sits on top of the vocabulary introduced here — Chapter 70 argues that many “agents” should be workflows, and this chapter explains what a workflow actually is at the infrastructure level.

Outline:

Durable execution from first principles.
Event sourcing under the hood.
The workflow vs activity split.
Deterministic replay.
Temporal in depth.
Airflow in depth.
AWS Step Functions.
Cadence and the lineage.
Comparative table.
When to use each.
The mental model.

80.1 Durable execution from first principles

Durable execution is the property that your program makes progress even when the underlying compute fails. The process can crash mid-function, the VM can be evicted, the cluster can fail over to a different region — and when a worker comes back, the program resumes exactly where it left off, with all local variables restored and the next step executed as if nothing had happened.

This is a strong guarantee and it is not what a normal program has. A normal Python script holds its state in the process’s memory. If the process dies, the state is gone. Any work in flight is lost. Re-running the script re-runs it from the top, which is either wasteful (you redo everything) or dangerous (you send two payment requests because the second run doesn’t know the first half of the work succeeded).

The workflow engine’s job is to lift the program’s state out of the process memory and into durable storage, continuously. Every time the workflow reaches a meaningful event — “start this activity,” “activity returned X,” “timer fired,” “signal received” — the engine writes that event to a persistent log. When the process dies and a new worker picks up the workflow, it reads the log and rebuilds the state.

The consequence is counterintuitive. The workflow code looks like normal imperative code — “call A, wait for the result, call B with it, if B failed retry, then call C” — but underneath it is being materialized out of an event log every time it resumes. The engineer writes sequential code; the runtime transforms it into a crash-safe state machine.

The four systems share this property, differently:

Temporal / Cadence: imperative code in any SDK language; the runtime replays the code against the event log to reconstruct state.
Airflow: DAG definition in Python; the scheduler runs tasks in topological order and records each task’s outcome.
Step Functions: JSON state machine definition (Amazon States Language); the service executes the state machine and records each state’s output.

Temporal sits at high expressiveness with self-hosting cost; Step Functions sits at low operational burden with limited expressiveness — choose based on which constraint bites first.

The axis they all sit on is “how much code vs how much configuration.” Temporal is the most code-like. Step Functions is the most config-like. Airflow is the DAG-of-tasks mid-point. The trade is expressiveness against deployability.

80.2 Event sourcing under the hood

Every major workflow engine is implemented as event sourcing. The persistent state is not a snapshot of “where the workflow is now” but an append-only log of events. Rebuilding the current state means replaying the log.

For a Temporal workflow that calls two activities sequentially, the event log looks like:

1. WorkflowExecutionStarted      { input: {...} }
2. WorkflowTaskScheduled
3. WorkflowTaskStarted
4. WorkflowTaskCompleted
5. ActivityTaskScheduled          { activity: "downloadFile", input: "s3://..." }
6. ActivityTaskStarted
7. ActivityTaskCompleted          { result: "/tmp/abc" }
8. WorkflowTaskScheduled
9. WorkflowTaskStarted
10. WorkflowTaskCompleted
11. ActivityTaskScheduled         { activity: "processFile", input: "/tmp/abc" }
12. ActivityTaskStarted
13. ActivityTaskCompleted         { result: "done" }
14. WorkflowExecutionCompleted    { result: "done" }

Every event is written durably before the next step is dispatched. If the worker crashes after event 7 is written but before event 8, the workflow’s state is unambiguously “activity 1 returned /tmp/abc, next task has been scheduled.” A new worker picks up the workflow, sees the log, and continues from exactly that point.

The log is the source of truth. The in-memory state of the workflow is derived and ephemeral. This is the core idea of event sourcing (Martin Fowler’s writeup is the canonical reference) and workflow engines are the most successful commercial application of the pattern.

The practical implications:

Logs must be monotonic and durable. Before any event is “observed” by the workflow, it must be committed to durable storage. Temporal uses Cassandra, MySQL, or PostgreSQL as the event store. Step Functions uses a managed distributed log. Airflow uses its metadata DB (a single Postgres/MySQL). Durability is the hard constraint.

Workflows have a maximum history size. The event log for a single workflow cannot grow unbounded without hurting replay performance. Temporal has a default limit of 50k events and forces “continue-as-new” when close — the workflow starts a fresh execution with the current state. Airflow sidesteps this by keeping DAG runs short. Step Functions limits execution history to 25k events.

Replay is the killer feature and the constraint. To rebuild the workflow state, the engine replays the code (in Temporal/Cadence) or the state definition (in Step Functions) against the event log. This means the code must be deterministic — more on this in 78.4.

80.3 The workflow vs activity split

Every workflow engine splits work into two categories: the workflow code (deterministic, replayable, runs on workers but never talks to the outside world directly) and the activity code (the actual side-effecting work, runs on activity workers, can do anything but must report its result back).

The workflow code is like a conductor. It says “do X, wait for the result, do Y with the result, if Y takes more than an hour, retry, after three retries fall back to Z.” It does not make network calls, does not touch the clock directly, does not read random numbers, and does not produce non-deterministic output. Every side-effect goes through an activity.

The workflow never directly calls external systems; every result is recorded in the event log so replay after a crash restores state without re-running completed activities.

The activity code is where the real work lives. Downloading a file, calling an external API, running a model, writing to a database, sending an email — all activities. Activities run on their own workers, can be retried independently, have their own timeouts, and their results are recorded in the event log so the workflow can read them back on replay.

A typical Temporal workflow in Python:

@workflow.defn
class ProcessDocument:
    @workflow.run
    async def run(self, doc_id: str) -> str:
        file_path = await workflow.execute_activity(
            download_document,
            doc_id,
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )
        extracted = await workflow.execute_activity(
            extract_text,
            file_path,
            start_to_close_timeout=timedelta(minutes=2),
        )
        summary = await workflow.execute_activity(
            summarize_with_llm,
            extracted,
            start_to_close_timeout=timedelta(minutes=10),
            retry_policy=RetryPolicy(maximum_attempts=5),
        )
        await workflow.execute_activity(
            store_summary,
            {"doc_id": doc_id, "summary": summary},
            start_to_close_timeout=timedelta(seconds=30),
        )
        return summary

The workflow is a sequence of activity calls with timeouts and retry policies. Each execute_activity call translates into an ActivityTaskScheduled event. When the activity completes, the result is stored as an ActivityTaskCompleted event, and the workflow’s code resumes with the result in hand. The workflow code itself does nothing stateful — all state is in the event log.

The split is enforced by the SDK. If you try to call requests.get() from inside a Temporal workflow, the sandbox will raise an error (or warn you at least) because it is non-deterministic network I/O. If you try to use datetime.now(), the SDK provides workflow.now() instead, which returns a deterministic, replay-safe value.

Why the split? Because workflow code must be replayable (§80.4) and activities must not be. Activities can do anything as long as they report their final result to the log. Workflows are deterministic state machines whose only job is to orchestrate activities.

80.4 Deterministic replay

The whole architecture rests on this property: given the same event log, the workflow code must produce the same sequence of actions.

Why: when a worker picks up a running workflow, it replays the workflow code from the top against the existing event log. Every time the code calls execute_activity, the runtime checks the log. If the corresponding event is already there (“activity X returned Y”), it returns Y immediately. If not, it actually schedules the activity. The code executes the same path it executed before, re-deriving its local state as it goes.

This only works if the code is deterministic. If the code reads the system clock and branches on it, two different replays will take different paths, and the replay will fail (Temporal will raise NonDeterminismException). Same for random numbers, network calls, reading files, anything non-deterministic.

Temporal’s enforcement is rigorous. The SDK provides deterministic wrappers for anything that would be non-deterministic: workflow.now(), workflow.random(), workflow.sleep(), workflow.execute_activity(), workflow.wait_condition(). Anything else should go through an activity.

The restrictions feel annoying at first but are exactly what enables the durability guarantee. You trade the ability to “just call an HTTP endpoint from inside the workflow” for the guarantee that “your workflow resumes exactly where it left off no matter what happened to the machine running it.” It is a good trade.

Airflow sidesteps the determinism problem by not replaying. Airflow’s DAG defines the task graph, and each task’s result is recorded — but the DAG code itself only runs at scheduling time. The scheduler is a single process that queries the DB and decides what to run next. This is simpler but much more limited: you can’t express anything that isn’t a DAG (no conditionals, loops, signals in the full sense), and the scheduler is a single point of failure.

Step Functions also sidesteps it by making the workflow definition declarative JSON. There is no code to replay — the definition is data, and the engine executes it directly. This is the simplest model and also the most limited.

80.5 Temporal in depth

Temporal is the modern standard for durable execution. It is an open-source fork of Cadence (Uber’s original system), rewritten and extended by the team that built Cadence. It exposes SDKs in Go, Java, Python, TypeScript, .NET, PHP, and Ruby. The server is written in Go.

Architecture:

Frontend service: API gateway for SDK clients. Receives StartWorkflowExecution, SignalWorkflowExecution, QueryWorkflow.
History service: owns the event log for each workflow. Uses Cassandra / MySQL / PostgreSQL / Elasticsearch for persistence.
Matching service: task queue dispatcher. Workers poll for work on named task queues; matching service routes tasks to pollers.
Worker service: runs internal system workflows (like archival).
Visibility store: search over workflows. Typically Elasticsearch.

The mental model: workflows live in the History service. Workers (your code) poll the Matching service for tasks. When a worker picks up a workflow task, it calls the History service to read the event log, runs the workflow code against it, and sends back the “next actions” (new activities to schedule, completion, etc.). History service writes them to the log.

Strengths:

Arbitrary code. You write workflows as normal imperative programs in your favorite language. Loops, conditionals, recursion, complex control flow — all fine.
Signals, queries, timers, child workflows, continue-as-new, cron — the vocabulary is rich.
Horizontal scalability. Each service is stateless and scales by adding replicas.
Battle-tested. Uber, Stripe, Snap, Netflix, Datadog, Coinbase, HashiCorp all run Temporal at serious scale.

Weaknesses:

Operational complexity. Running a self-hosted Temporal cluster is a real commitment: Cassandra (or the supported SQL variants), Elasticsearch, multiple services, careful backup strategy. Most teams use Temporal Cloud to avoid this.
Learning curve. The determinism model confuses new users constantly. The first time you hit a NonDeterminismException you spend an afternoon debugging.
Per-workflow history limits. The 50k event soft limit forces patterns like continue-as-new for long-running workflows.

80.6 Airflow in depth

Airflow is the oldest and most widely deployed of the four. It was born at Airbnb in 2014 as a data pipeline scheduler and is now an Apache top-level project. It is not really a general workflow engine — it is a DAG scheduler — but people use it for everything.

Architecture:

Scheduler: reads DAG files, decides which tasks to run, queues them.
Executor: runs the tasks (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.).
Workers: pick up queued tasks and run them.
Metadata database: Postgres/MySQL, stores DAG runs, task instances, state.
Web UI: the famous Airflow DAG graph view.

You define a DAG in Python:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

with DAG(
    dag_id="daily_etl",
    start_date=datetime(2026, 1, 1),
    schedule="0 2 * * *",
    catchup=False,
) as dag:
    extract = PythonOperator(task_id="extract", python_callable=do_extract)
    transform = PythonOperator(task_id="transform", python_callable=do_transform)
    load = PythonOperator(task_id="load", python_callable=do_load)
    extract >> transform >> load

The DAG file is loaded by the scheduler, the graph is computed, and the scheduler starts tasks in topological order. Each task runs as a subprocess (or pod, depending on executor) and reports its result back to the metadata DB. If a task fails, Airflow retries it according to the retries parameter.

Strengths:

Mature, huge community, huge operator library. There is probably an Airflow operator for whatever external system you need to talk to.
Cron scheduling is first-class. If your use case is “run this pipeline every night at 2 AM,” Airflow is the right default.
The DAG UI is genuinely useful. Operators click around in it and understand what is happening.
Git-ops friendly. DAGs are just Python files; deploying them is rsync.

Weaknesses:

Not really durable execution. If the scheduler crashes mid-run, the tasks it was about to schedule don’t get scheduled until the scheduler restarts (and the scheduler historically has been a single point of failure; Airflow 2 added HA but it is newer).
The DAG model is limiting. Conditional branches exist (BranchPythonOperator) but they are awkward. Loops over arbitrary numbers of items exist (expand() in Airflow 2.3+, the “dynamic task mapping”) but are limited. Recursion does not exist.
DAG parse time matters. The scheduler re-parses DAG files regularly; if your DAGs are slow to import, the scheduler lags.
Task instances are not workflows — each task is a separate subprocess with no shared state. Passing data between tasks goes through XCom (small values only) or external storage (S3, etc.).

Airflow wins for scheduled data pipelines and loses for “general durable execution.” If what you have is “run this SQL, then this Spark job, then this notebook, every night,” Airflow is the right answer. If what you have is “for each incoming document, run a 47-step workflow with retries and human approvals that might take weeks,” use Temporal.

80.7 AWS Step Functions

Step Functions is Amazon’s managed workflow engine. The workflow is defined in JSON using the Amazon States Language — a declarative state machine description with states like Task, Choice, Parallel, Map, Wait, Succeed, Fail.

A small example:

{
  "Comment": "Process a document",
  "StartAt": "Download",
  "States": {
    "Download": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "download-doc" },
      "Retry": [{"ErrorEquals": ["States.ALL"], "MaxAttempts": 3}],
      "Next": "Extract"
    },
    "Extract": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "extract-text" },
      "Next": "Summarize"
    },
    "Summarize": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": { "FunctionName": "summarize-llm" },
      "End": true
    }
  }
}

The “activities” are AWS service calls — Lambda functions, ECS tasks, SageMaker jobs, DynamoDB operations, SNS publishes, anything with an ARN. Step Functions executes the state machine, calls into the services, records state transitions, and produces an execution history you can view in the AWS console.

Strengths:

Fully managed. No cluster to run, no database to back up.
Tight AWS integration. If your workload is “call 10 AWS services in sequence with retries,” Step Functions is the least-friction answer in the cloud.
Two modes: Standard (long-running, up to a year, full history) and Express (short-running, higher throughput, no full history). Pick one per workflow.
The console’s execution graph is beautiful and operationally useful.

Weaknesses:

Lock-in. You are all-in on AWS. Not a concern if you are already all-in on AWS, but it is the main trade.
JSON DSL is verbose and awkward for complex logic. Amazon has a higher-level abstraction (CDK constructs, Step Functions Workflow Studio) but it is still a state machine definition, not code.
Per-state cost. Standard workflows charge per state transition, which gets expensive at high volume. Express mode is cheaper per transition but loses full history.
No signals. Step Functions supports SendTaskSuccess/SendTaskFailure for human-in-the-loop, but the general “signal a running workflow from outside” pattern is clumsy.

Step Functions wins for small-to-medium orchestration entirely inside AWS. It loses for “I want to write the workflow in my own language” or “I need arbitrary code, not a state machine.”

80.8 Cadence and the lineage

Cadence is Temporal’s predecessor. It was built at Uber starting in 2015 and open-sourced in 2017. The Temporal team left Uber and forked Cadence in 2019, rewriting and improving it while keeping the same durable-execution model. Today both exist — Cadence is still maintained by Uber and used at massive scale internally (millions of workflows per day) — but outside of Uber, the center of gravity has moved to Temporal.

Technically, Cadence and Temporal are very close. Same architecture (frontend, history, matching, worker services), same event-sourcing model, same workflow/activity split, same determinism constraint. The SDKs have diverged slightly — Temporal’s are generally considered friendlier — and Temporal has added features (schedules, global namespaces, nicer visibility API). For a new project in 2026, Temporal is the right default. Cadence is mostly relevant if you are joining a team that already runs it.

The lineage matters because interview questions sometimes probe it. “What’s the difference between Temporal and Cadence?” is a gimme: Temporal is the fork, Temporal has better SDKs, both implement the same core model.

80.9 Comparative table

Dimension	Temporal	Airflow	Step Functions	Cadence
Model	Imperative code	DAG of tasks	JSON state machine	Imperative code
Durability	Event log, full replay	Task DB, no replay	Managed state log	Event log, full replay
Long-running	Days, weeks, forever	Minutes to hours typical	Up to 1 year (Standard)	Days, weeks, forever
Signals / human-in-loop	First-class	Via sensors, awkward	Via Task tokens, ok	First-class
Language	Multi-SDK	Python (for DAGs)	JSON / CDK	Multi-SDK
Self-host	Yes (complex)	Yes (moderate)	No	Yes (complex)
Managed option	Temporal Cloud	Astronomer, MWAA, Cloud Composer	Native AWS	No public cloud
Scale (workflows/day)	Millions	Tens of thousands	Millions	Millions
Dev loop	Fast (run locally)	Slow (DB + scheduler)	Slow (deploy-test)	Fast
Dynamic fan-out	Natural (loops)	Dynamic task mapping (2.3+)	Map state	Natural
Cost model	Infra + Cloud tier	Infra	Per-state transition	Infra
Typical use	General durable execution	Scheduled data pipelines	Small AWS-native orchestration	Same as Temporal

The table is necessarily a summary — every row has qualifications. But the overall shape is accurate and maps to the “when to use each” decision in the next section.

80.10 When to use each

The decision is usually clear if you ask the right questions.

graph TD
  A{Work is a scheduled DAG?} -->|yes| B[Airflow]
  A -->|no| C{All-in on AWS?}
  C -->|yes| D[Step Functions]
  C -->|no| E{Need arbitrary code / signals / loops?}
  E -->|yes| F[Temporal]
  E -->|no, already on Cadence| G[Cadence]
  style F fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style B fill:var(--fig-surface),stroke:var(--fig-border)

When in doubt between Temporal and “just a Redis queue,” ask whether you need crash-safe state, retries across process death, or signals — if yes, Temporal pays rent.

Use Temporal when you need arbitrary code, long-running workflows, signals, human-in-the-loop, and you can run infrastructure (or pay for Temporal Cloud). This is the general case for most ML-platform async work: “when a document arrives, run a 10-step pipeline that might take hours and must retry at each step.” ML inference pipelines, document processing, fine-tuning jobs, evaluation runs, deep research workflows.

Use Airflow when your work is “a DAG of tasks on a cron schedule.” Classic ETL. Nightly model training. Weekly report generation. Scheduled evaluation runs. The DAG model is a perfect fit and the operator ecosystem saves months of work. Don’t use Airflow for “durable execution of arbitrary code” — it will fight you.

Use Step Functions when you are all-in on AWS, your workflow is small-to-medium (<100 states), you want zero infrastructure, and the definition is easy to express as a state machine. A Lambda-backed data enrichment pipeline is the canonical fit. A five-step orchestration across DynamoDB, SNS, and SQS is the canonical fit. Resist it for anything over a couple hundred states or for logic that wants to be code, not configuration.

Use Cadence when you already run Cadence. Otherwise, Temporal.

Use none of them for work that can be sync (Chapter 79). Not every long-running thing needs a workflow engine — sometimes a Redis queue and a worker is enough. The workflow engine pays rent when you need durability, retries across process crashes, signals, or complex state. If you don’t need those, don’t pay the complexity cost.

The anti-pattern worth naming: using Airflow as a general workflow engine for interactive work. Airflow is built for scheduled pipelines. “User submits a request, Airflow runs a DAG, user gets the result” is a shape that Airflow technically supports but suffers with — the scheduler polling interval is measured in seconds, DAG parse time adds latency, the whole model assumes “time-boxed scheduled work.” Teams that try to build user-facing async APIs on Airflow usually regret it within six months and migrate to Temporal.

The opposite anti-pattern: using Temporal for nightly ETL. You can, but you lose the Airflow operator ecosystem (hundreds of integrations with every data tool) and you write more code to get the same result. The DAG of tasks is a better abstraction for ETL, and Airflow happens to be the best implementation of it.

80.11 The mental model

Eight points to take into Chapter 81:

Durable execution means your code makes progress through process death. The engine lifts state out of memory and into a persistent log.
The event-sourced log is the source of truth. State is derived, not stored. Replaying the log rebuilds state.
Workflow/activity split is mandatory. Workflows are deterministic and replayable; activities are the side-effecting work.
Determinism is the constraint that makes replay work. The SDK provides replay-safe wrappers for time, random, sleep, I/O.
Temporal is the modern default for general durable execution. Fork of Cadence, multi-SDK, rich vocabulary.
Airflow is the default for scheduled data pipelines. DAG model, cron, huge operator ecosystem. Not a general workflow engine.
Step Functions is the default for small AWS-native orchestration. JSON state machine, managed, tight AWS integration, per-transition cost.
Most ML platforms end up running both Temporal and Airflow. Temporal for interactive async work, Airflow for scheduled pipelines. They are complementary.

In Chapter 81, inter-service trust: mTLS, signed URLs, JWT chains, and the defense-in-depth layers that make a workflow engine’s activities safe to call.

Read it yourself

Temporal’s “Workflow Execution” documentation and the “Determinism” guide on docs.temporal.io.
Martin Fowler, “Event Sourcing” (martinfowler.com/eaaDev/EventSourcing.html) — the canonical writeup of the underlying pattern.
The Airflow documentation, especially the “Concepts” and “Executor” pages.
The AWS Step Functions Developer Guide and the Amazon States Language specification.
The Cadence GitHub repo and the Uber engineering blog post “Cadence: The Only Workflow Platform You’ll Ever Need” for the origin story.
Maxime Beauchemin, “The Rise of the Data Engineer” — the blog post that launched Airflow and framed the DAG-as-pipeline worldview.

Practice

For a new “process a user’s uploaded video through a 6-step ML pipeline with retries” feature, which workflow engine would you pick and why? Write a one-page design doc.
Explain why a workflow must be deterministic. Construct a workflow that violates determinism and describe the failure mode on replay.
What is the workflow/activity split and why does it exist? For a workflow that calls two external APIs, identify exactly what goes in the workflow code and what goes in activities.
Write an Airflow DAG and a Temporal workflow that both do the same thing: download a file, compute an embedding, store the result. Compare the lines of code, the failure behavior, and the dev loop.
A team wants to use Airflow as the backing engine for a user-facing async API (“user submits a job, Airflow runs it, user polls for result”). Write three specific production problems they will hit.
Step Functions charges per state transition. Compute the cost of a 50-state Standard workflow running 1M times per month at current AWS pricing. When does Express mode win on cost?
Stretch: Implement a minimal durable-execution engine in ~200 lines. Event log in SQLite, one worker process, deterministic replay of a sequential activity chain. Prove the “kill the worker mid-run” property by killing it and restarting.