Chapter 109: Multi-cluster, multi-region, multi-cell architecture

Every serving platform starts as one cluster in one region. It works well until it doesn’t — a bad config rollout breaks the ingress controller and takes down every service in the cluster, or a regional cloud provider outage takes everything offline, or a customer demands data residency in Frankfurt and the platform has nowhere to run their traffic. The answer to all three problems is the same shape: stop running everything in one place. The shape is called cells, and understanding it is what separates architectures that survive their first real outage from ones that don’t.

The core idea is blast-radius reasoning. Any failure mode — a bad deploy, a misconfigured IAM policy, a corrupt etcd, a rogue pod exhausting node memory — has a radius: the set of users and requests affected when it happens. Single-cluster architectures have a radius equal to the entire user base. Multi-cell architectures deliberately partition the system so the radius is a fraction of the user base, ideally a small fraction. The cost is complexity, and the cost is real. But at a certain scale and a certain reliability bar, there is no other way to run.

This chapter is about the architecture, not the tools. The tools (Helm, Kustomize, GitOps, autoscaling) have been covered elsewhere. The question here is how those tools get arranged across multiple clusters, regions, and cells, and when each arrangement is appropriate.

Outline:

Blast radius as the organizing principle.
The failure taxonomy.
Single-cluster, multi-cluster, multi-region, multi-cell.
The cell pattern in detail.
Traffic steering — DNS, anycast, global load balancers.
Active-active vs active-passive.
Data sovereignty and residency.
State, replication, and consistency.
When to go multi-region.
Operating a multi-cell platform.
The mental model.

109.1 Blast radius as the organizing principle

Reliability engineering has a single organizing question: when something breaks, how many users see it? The answer is the blast radius. A single-node failure should affect zero users if the system is horizontally scaled. A single-cluster failure should affect a bounded fraction of users if the system is multi-cluster. A single-region failure should affect a bounded fraction if the system is multi-region. Cascading the isolation all the way down is the only way to build something that can sustain an SLO of “three nines or better” under realistic failure rates.

The adversarial version of the question: what is the worst thing a single engineer, a single config push, a single process, or a single component can take down? Call this the single point of deployment failure. In a naïve setup, a broken Helm chart that fails to render on apply can take down every service that depends on that chart, across every environment, simultaneously. The blast radius is the whole fleet. In a properly partitioned setup, the blast radius is bounded by the cell boundaries — the bad chart breaks one cell, and the other cells continue serving.

Blast-radius reasoning generalizes beyond Kubernetes. It applies to database schemas (a bad migration should only touch one shard), to IAM changes (a bad policy should only affect one account), to feature flags (a bad flag should only hit a cohort), to CI/CD pipelines (a bad pipeline should only deploy to a subset). Everywhere the system has a failure mode, the architecture should bound the radius.

Cells are the cleanest implementation of this principle for serving platforms. A cell is a fully independent stack — its own cluster (or set of clusters), its own load balancer, its own databases, its own config, its own deployment pipeline — that serves a subset of users end-to-end. Cells do not share state across boundaries. A complete failure of one cell leaves all other cells healthy.

Cells reduce blast radius from 100% to 1/N — a complete cell failure leaves the rest of the fleet healthy, and progressive rollouts limit even that to "canary cell only" for bad deploys.

109.2 The failure taxonomy

To reason about cell boundaries, enumerate the failures that a cell is supposed to contain. A useful taxonomy:

Node-level failures. A single node crashes, its kubelet wedges, its disk fails. Containment: Kubernetes node drain + pod rescheduling. No cell work needed.

AZ-level failures. An entire availability zone in a cloud region goes down — power, networking, cooling. These happen a few times a year per region across major providers. Containment: multi-AZ clusters. Still no cell work needed, just correct PDBs and topology spread constraints.

Cluster-level failures. Something breaks the control plane or a core component. Examples: a Kubernetes upgrade goes wrong, a custom controller writes bad state to etcd, an admission webhook starts rejecting all pods. These are the argument for multi-cluster. A single control plane is a single point of failure for every workload it runs.

Region-level failures. An entire cloud region has a major outage. Rare (maybe once per year for a major provider, maybe less) but catastrophic when it happens. Containment: multi-region.

Config-level failures. A bad Terraform plan or Helm chart is applied fleet-wide. Containment: progressive rollout across cells with halt-on-failure.

Data-level failures. A database corruption or a bad migration takes down dependent services. Containment: per-cell databases so a corruption only touches one cell’s data.

Dependency failures. A shared upstream (authentication service, billing, message broker) fails. Containment: per-cell replicas of critical dependencies, or at least per-cell failover paths.

Noisy-neighbor failures. One tenant’s workload consumes all of some shared resource (CPU, network bandwidth, IOPS, global rate limit). Containment: cells per tenant tier, or at least quota enforcement at cell boundaries.

Each row in this taxonomy pushes toward a different architectural response. Cells are the synthesis — they contain cluster-level, config-level, data-level, and noisy-neighbor failures all at once, at the cost of N-way duplication of everything that runs inside a cell.

109.3 Single-cluster, multi-cluster, multi-region, multi-cell

Four stages of maturity, in order.

Single-cluster. One cluster, usually multi-AZ. Everything runs here. Simple, cheap, fast to operate. Acceptable until you have either a reliability bar above ~99.9% uptime or a clear blast-radius concern. Most platforms start here and probably should.

Multi-cluster, single-region. Two or more clusters in the same region, behind one load balancer. The motivation is almost always blast-radius on cluster-level failures: upgrades, admission webhooks, CRDs, control plane incidents. A classic setup: a “blue” and a “green” cluster, with traffic split 50/50, and upgrades rolled to one at a time. If the upgrade breaks the blue cluster, green takes all traffic and the team debugs blue in peace.

Multi-region, single-cell-per-region. Two or more regions, each with one cluster. The motivation is regional-outage survival or latency for globally-distributed users. Traffic routing is now geographic — users in Europe hit the Frankfurt region, users in the US hit Virginia. Failover is region-to-region, which is much more expensive to operate than cluster-to-cluster (data replication, DNS TTLs, warm standbys).

Multi-cell. Multiple cells per region, multiple regions. Each cell is an independent stack. Traffic is routed by user identity (or tenant, or shard key) to a specific home cell. Cells do not talk to each other for serving traffic. This is the endgame for reliability-first platforms. It is also the most expensive to build and operate.

The leap from “multi-cluster in one region” to “multi-region” is usually the biggest jump. Data replication becomes a real problem (see §109.8). Operational complexity roughly doubles. Cost goes up because egress across regions is expensive. The motivation has to be concrete: either a regional-residency requirement, a latency SLA that no single region can meet, or a reliability bar that requires surviving a full region loss.

The leap from “multi-region” to “multi-cell” is smaller in principle but larger in practice. It requires a sharding key (user ID, tenant ID, geographic region) that can be resolved at the edge before traffic is routed. It requires per-cell observability so you can attribute incidents correctly. It requires a control plane that can deploy to N cells with different rollout stages. Most teams doing cells have a dedicated platform team whose main job is this.

109.4 The cell pattern in detail

A cell is an isolated stack. What “isolated” means, concretely:

Independent deploy surface. Each cell has its own deployment pipeline slot. A deploy to cell A cannot touch cell B. A bad chart is applied to one cell, the health checks fail, and the rollout halts — other cells are untouched.

Independent data. Each cell has its own databases, caches, and message queues. User data is sharded by cell. Cross-cell reads are either disallowed or handled by a deliberate, throttled replication pipeline for analytics.

Independent dependencies. Each cell runs its own copies of critical infrastructure: ingress controller, cert-manager, service mesh, autoscaler. Shared dependencies (cloud IAM, global DNS, the registry) are treated as external and have their own reliability story.

Independent observability. Each cell has a Prometheus, each cell has its own dashboards, each cell emits its own SLO burn alerts. A global view exists but is derived from per-cell views, not primary.

Independent identity and authorization. Different cells have different IAM credentials and different blast radii for permissions. A cell’s service account cannot touch another cell’s resources.

Sticky routing. A request for a given user always goes to the same cell (within a region) unless that cell is unavailable. This is essential because the user’s data lives in that cell. Routing is typically by user ID hash, tenant ID, or a lookup table.

A cell is conceptually one small company’s worth of infrastructure. If you bought the cell and moved it to a different cloud account, it would still work.

Cell sizing is a judgment call. Smaller cells mean smaller blast radius per incident but more cells to operate. Larger cells mean fewer cells to operate but more users affected per incident. A reasonable starting point is cells sized so that a full cell outage affects no more than 10-20% of users. At 100,000 users that’s 5-10 cells. At 10 million users that’s more like 50-100 cells.

The biggest mistake in cell design is letting cells accrete dependencies on each other. “Just this one RPC across cells” becomes “the cells share a database” over time, and suddenly the cell abstraction leaks and the blast radius is the whole fleet again. Architectural discipline is the only defense — explicit cell boundaries in the system diagram, explicit rules about what a cell can and cannot depend on, and code review that enforces it.

109.5 Traffic steering

Given N cells, how does a request find its home cell? Three main approaches.

DNS-based routing. The edge resolves a hostname to a cell-specific IP. user-123.api.example.com resolves to the load balancer of the user’s home cell. Simple, works with any cell backend. Downside: DNS TTLs are long (even 60s TTLs take minutes to fully propagate), so failover is slow, and DNS-based sharding requires the client to know which hostname to query. Works well for API clients that can look up the right hostname once per session.

Anycast + edge routing. All cells advertise the same IP via BGP anycast. Traffic lands at the nearest edge PoP, where an edge load balancer (Envoy, HAProxy, a cloud LB) looks up the user’s home cell and forwards. This is how the big CDNs and global LBs (Cloudflare, Google Cloud LB, AWS Global Accelerator) work. The edge layer itself is a shared dependency, but it’s typically run by a provider with a very high reliability bar, and its radius is different from application failures.

Global load balancer + backend lookup. A managed global LB (GCLB, AWS Global Accelerator, Cloudflare Load Balancing) with health checks against each cell’s regional LB. Traffic is steered by region (latency-based or geo-based) and by health. The LB doesn’t do per-user sharding unless you configure it to via headers or cookies.

For cells to work, you need per-user stickiness, not just per-region. The common arrangement:

sequenceDiagram
  participant Client
  participant GlobalLB as Global LB (anycast)
  participant EdgeProxy as Edge proxy (region)
  participant Lookup as Home-cell lookup
  participant Cell as User's home cell

  Client->>GlobalLB: request
  GlobalLB->>EdgeProxy: route to nearest region
  EdgeProxy->>Lookup: which cell for user_id=X?
  Lookup-->>EdgeProxy: cell-03 (cached in JWT)
  EdgeProxy->>Cell: forward request
  Cell-->>Client: response

Traffic steering requires a fast per-user home-cell lookup on the critical path — most implementations cache the result in a signed JWT or edge cookie to avoid the lookup on every request.

Global load balancer with anycast IP and latency-based routing. This gets requests to the closest region.
Edge proxy at the region boundary that reads an identity header (or a cookie set on first request) and forwards to the user’s home cell within that region.
Home-cell lookup service that answers “which cell serves user X?” — this is often a tiny, heavily-cached lookup API backed by a Cassandra or DynamoDB table.
Fallback logic for the case where a user’s home cell is unhealthy — either fail the request (if data is tied to the cell) or serve a degraded response (if the data can be reconstructed).

The home-cell lookup is the critical path. It must be fast (p99 < 5ms), it must be highly available, and it must be correct — routing a user to the wrong cell means their data isn’t there. Many platforms cache the mapping in edge-side cookies or JWTs so the lookup only happens once per session.

109.6 Active-active vs active-passive

Within a multi-region deployment, the replication model matters.

Active-active. All regions serve traffic simultaneously. User data is either sharded (users belong to one region) or replicated (users can be served from any region with eventual consistency). Active-active is more expensive (you pay for full capacity in every region) but has zero failover time — when a region goes down, traffic shifts and the remaining regions continue serving.

Active-passive. One region serves all traffic. A standby region is running but idle, receiving data replication, and ready to take over on failover. Cheaper (the passive region can run at reduced capacity) but failover has real latency — you have to detect the failure, flip DNS or the LB, let traffic drain, warm up caches, and bring the passive region up to serving capacity. Realistic active-passive failover is minutes to tens of minutes, not seconds.

Pilot-light. A weaker form of active-passive where the standby region has minimal running infrastructure (enough to validate data replication, not enough to serve traffic). On failover, infrastructure scales up. Failover time is even longer, in the tens of minutes to hours. Appropriate for disaster recovery from rare catastrophic events, not for “normal” region outages.

Most modern serving platforms go active-active because the failover stories for active-passive are painful and the capacity overhead is manageable at scale. The cost delta between “two regions at 50% each” and “one region at 100% plus a standby” isn’t as large as it looks — you need headroom for spikes in both models.

The decision hinges on data. If your data model is naturally shardable (per-user, per-tenant, per-geography), active-active is straightforward. If your data model is highly connected and requires strong consistency across regions (an ad auction, a global rate limiter, a shared inventory), active-active is painful and active-passive might be the honest answer.

109.7 Data sovereignty and residency

Regulation is the other driver of multi-region architecture. GDPR, Schrems II, China’s PIPL, India’s DPDP, and a growing list of similar laws impose data residency requirements: certain categories of user data must be stored, processed, and served from within specific geographic boundaries. You cannot route an EU user’s request to a US region, even briefly, if the law says their data has to stay in the EU.

Residency interacts with cells cleanly: cells have geographic homes. A user’s cell is determined (or at least bounded) by their residency requirement. An EU user’s cell lives in an EU region; a US user’s cell lives in a US region. Cross-region traffic is then explicitly prohibited by the routing layer, not just discouraged.

Residency interacts with active-active poorly when the replication model would cross boundaries. You cannot blindly replicate user data to every region for availability — the law says you can’t. So active-active plus residency usually means per-region sharding, not per-region replication, with the downside that a region outage for an EU user cannot fail over to a US region even if the US region is healthy.

The practical upshot: if you have residency requirements, design cells around them from day one. Retrofitting residency into an existing active-active replicated system is a migration project that can take a year.

A related concern is control plane residency. Even if the data is in-region, a control plane (Kubernetes API, CI/CD, monitoring) that lives in another region can violate the spirit (and sometimes letter) of residency law. Fully-isolated in-region control planes are sometimes required, which means every cell has its own ArgoCD, its own Prometheus, its own deploy pipeline, and its own key material.

109.8 State, replication, and consistency

The hardest part of multi-region is data. A stateless service is easy to replicate — just run copies in every region. A stateful service requires explicit choices about where data lives and how consistent it is across replicas.

Per-cell data. Each cell owns its data. No replication across cells for serving. Cross-cell analytics runs on a separate warehouse fed by a replication pipeline. This is the simplest model and the one cells are designed around.

Global data with per-region read replicas. A primary database in one region, async read replicas in others. Writes go cross-region (high latency, weak consistency). Reads are local (fast). Works for read-heavy workloads where strong consistency on writes is acceptable. Failover on primary loss is painful.

Multi-master with conflict resolution. Databases that accept writes in every region and resolve conflicts (CRDTs, last-write-wins, vector clocks). Examples: Cassandra, DynamoDB global tables, CockroachDB, Yugabyte. These work for workloads that tolerate eventual consistency and can define conflict resolution semantics. They do not work for workloads that need strong transactional guarantees across regions.

Synchronous multi-region. A consensus protocol (Paxos, Raft) stretched across regions. Spanner, CockroachDB, FoundationDB. You get strong consistency at the cost of write latency (every write waits for cross-region round-trips). For most workloads, the latency cost is unacceptable — a p99 write of 100ms+ is a hard pill to swallow.

Serving platforms typically use a mix. Per-cell sharded data for the main user data. Global data for small, slow-changing configuration (feature flags, tenant metadata). Eventually-consistent replication for anything else.

The key question for any piece of state: if this region disappears right now, what’s the acceptable outcome? “Some writes are lost” is often acceptable for analytics and audit logs. “The user sees a stale version of their profile for 30 seconds” is often acceptable. “The user’s financial transaction is lost” is almost never acceptable. The replication model follows from these answers, not from a uniform policy.

109.9 When to go multi-region

Not every platform should go multi-region. It’s expensive, complex, and slow to operate. The honest triggers:

Hard SLA requirement. A customer contract says 99.99%, and the math on a single region cannot get you there given realistic cloud-region outage rates.
Data residency. Regulation requires in-region processing.
Latency SLA for global users. p99 request latency from Singapore to us-east-1 is 250ms just on the network. If your target is 100ms, you need a presence in Asia.
Disaster recovery posture. The business literally cannot tolerate a multi-day region outage, and a DR story involving another region is cheaper than the alternative.
Scale beyond a single region’s capacity. Rare but real — some platforms outgrow what a single region can provide.

If none of these apply, don’t go multi-region yet. Run a well-architected multi-cluster single-region setup with multi-AZ and strong backups. It will be significantly simpler to operate and will serve you fine until one of the triggers above fires.

The reverse trap is also real. Teams go multi-region because it sounds like the mature thing to do, then spend two years building the replication pipeline, and then have a region outage and discover their failover doesn’t work because they never tested it under load. Untested DR is worse than no DR — it’s a false sense of security. If you go multi-region, you commit to regular failover drills. If you won’t do the drills, don’t take on the complexity.

109.10 Operating a multi-cell platform

A few patterns that are non-negotiable at scale.

Progressive rollouts across cells. Never deploy to every cell at once. Roll to one cell (the “canary cell”), watch for 10-30 minutes, roll to a few more, watch again, then roll to the rest. A bad change is caught in the canary and never reaches the fleet. This is a pipeline feature, not a prayer — your CD system should enforce the cell-by-cell rollout automatically.

Per-cell health and SLO burn. Each cell has its own SLO dashboard and error-budget alerts. A cell-level incident is visible immediately as “cell X is burning its budget” without being drowned out by healthy cells in the global view.

Automated traffic shedding. When a cell is unhealthy, the routing layer should automatically drain traffic from it. Manual intervention at 3 AM is too slow. The automation needs careful thresholds to avoid false positives cascading into overloaded neighbors.

Tested DR plans. Run a failover drill at least quarterly. Actually fail over. Do it during business hours. Find the broken pieces while people are watching. The first time you fail over should not be during a real outage.

Cell inventory as first-class data. The list of cells, their regions, their capacities, their tenants, their versions — all in a database or config-as-code that the tooling reads. Not in a wiki. Not in someone’s head. The rollout pipeline, the monitoring, the incident tooling all read from this inventory.

Clear naming and labeling. Cells get stable, human-readable names (eu-west-1-cell-03, not prod-cluster-4). Every metric, log line, and alert is labeled with the cell name. Operators should always know which cell they’re looking at.

The operational cost of a multi-cell platform is real. Budget for a platform team whose job is keeping the cell abstraction clean and keeping the rollout pipeline trustworthy. A cell abstraction that leaks is worse than no cells.

109.11 The mental model

Eight points to take into Chapter 110:

Blast radius is the organizing principle. Any failure has a radius; the architecture bounds it.
Cells are the cleanest implementation: independent stacks serving a subset of users end-to-end with no shared state across boundaries.
Four stages: single-cluster → multi-cluster → multi-region → multi-cell. Each stage is motivated by a specific class of failure.
Traffic steering needs per-user stickiness, not just per-region routing. A fast home-cell lookup is on the critical path.
Active-active is the modern default if data sharding allows it. Active-passive has minutes-to-tens-of-minutes failover time.
Residency requirements force cells to have geographic homes and prohibit cross-region replication.
State replication is the hardest piece. Pick per-cell data where possible; use global data only for small, slow-changing things.
Untested DR is worse than no DR. Run failover drills on a schedule or don’t claim the capability.

In Chapter 110, everything below the cell boundary — the cloud infrastructure itself — gets defined in code. Terraform, Pulumi, CDK.

Read it yourself

Site Reliability Engineering (Beyer, Jones, Petoff, Murphy, eds., Google/O’Reilly, 2016), especially the chapters on load balancing and on managing critical state.
The AWS Well-Architected Framework’s Reliability Pillar whitepaper.
Colm MacCárthaigh’s talk Amazon’s Approach to High-Availability Deployment (AWS re:Invent, various years).
Werner Vogels, Cell-based architecture (AWS Builders’ Library article).
Release It! (Nygard, 2nd ed., Pragmatic Bookshelf, 2018), on stability patterns and blast-radius containment.
The CockroachDB and Google Spanner papers for the consensus-across-regions story.

Practice

List the failure taxonomy from §109.2 and, for each row, identify which stage of maturity (single-cluster → multi-cell) is the minimum required containment.
Sketch a cell-based architecture for a platform with 1M users, 99.95% uptime SLA, and a GDPR requirement. How many cells, where, with what replication model?
Explain why DNS-based routing alone is insufficient for per-user stickiness in a multi-cell platform.
Compute the annual expected downtime for 99.9%, 99.95%, 99.99%, 99.999%. Which of these realistically requires multi-region?
Write a PromQL query that alerts when a specific cell is burning its error budget at >2× the rate of other cells.
Argue the case against going multi-region for a startup with 10K users and a 99.9% SLA.
Stretch: Design the home-cell-lookup service. Latency budget p99 < 5ms, read QPS 100K, correctness on writes. Sketch the data model, the caching strategy, the failover story, and the consistency guarantees.