Canary patterns: traffic-shifted, statistical, ML model canaries
"A canary is a guess you haven't had to commit to yet"
A canary is a gradual rollout. You deploy a new version to a small slice of traffic, watch it, and either continue the rollout if it looks healthy or roll back if it doesn’t. The principle is old — miners carried canaries into coal mines for exactly this reason. The engineering discipline is making the “watch it” part rigorous enough that you catch real regressions before they hit all users.
This chapter covers the full taxonomy of canary patterns: traffic-shifted canaries (route a fraction of requests to the new version), statistical canaries (Kayenta-style automated pass/fail), shadow traffic (mirror without affecting users), ML model canaries (which are different animals because of stochastic outputs), holdout sets, and scheduled synthetic traffic. Chapter 97 established the error budget as a resource you protect; this chapter is how you protect it during the most dangerous moment in a service’s life: deployment.
Outline:
- Why canaries exist.
- Traffic-shifted canaries.
- Statistical canaries: the Kayenta pattern.
- Shadow traffic canaries.
- Dark launches and feature flags.
- ML model canaries are different.
- Holdout sets.
- Scheduled synthetic canaries.
- Rollback automation.
- Production patterns.
- The mental model.
98.1 Why canaries exist
The baseline alternative is “deploy everywhere at once and see what happens.” This works for small, low-traffic, low-risk changes. It stops working the moment a bug in the new version affects more users than you can apologize to. The math:
- All-or-nothing deployment: every bug affects 100% of users for the time it takes to detect and roll back.
- 5% canary: a bug affects 5% of users for the detection window, then 0% after rollback.
A bug that takes 10 minutes to detect and rolls back immediately affects:
- Without canary: 100% × 10 min = 10 minutes of full-scale failure.
- With 5% canary: 5% × 10 min = 30 seconds of full-scale-equivalent impact.
That’s a 20× reduction in user pain. And the canary pays for itself on the very first real regression it catches.
Beyond math, canaries are how you make rollouts reversible. A full deploy is a one-way door unless you have a rollback path; a canary leaves the old version running and in-place, so rolling back is instant. The old version is the backstop, always.
The question is how to do a canary — specifically, how to decide whether the canary is healthy without a human staring at dashboards for the full rollout window. The patterns below are different answers to that question.
98.2 Traffic-shifted canaries
The most common pattern. A load balancer or service mesh is configured to route a small fraction of traffic to the new version:
phase 0: 100% -> v1 (old)
phase 1: 95% -> v1, 5% -> v2 (canary)
phase 2: 80% -> v1, 20% -> v2
phase 3: 50% -> v1, 50% -> v2
phase 4: 0% -> v1, 100% -> v2
At each phase, you wait some “bake time” and verify that the canary is healthy. If yes, advance. If no, roll back (shift all traffic back to v1).
The traffic split is usually implemented in:
- Kubernetes Ingress / Istio / Linkerd: header-based or weighted routing at the service mesh layer.
- Argo Rollouts / Flagger: CRD-driven rollout controllers that automate the phased shift.
- LB config: for non-K8s deployments, Envoy/HAProxy config or cloud LB rules.
The key parameters:
- Traffic percentages per phase — 5%, 20%, 50%, 100% is a common pattern. More conservative: 1%, 5%, 25%, 50%, 100%.
- Bake time per phase — 5 to 60 minutes, depending on how fast signals are expected to materialize. For latency regressions, 5 minutes might be enough. For rare correctness bugs, hours.
- Gating metric — what defines “healthy.” Usually: error rate, p99 latency, saturation.
- Rollback trigger — automatic (based on metric thresholds) or manual.
- Traffic routing strategy — random per-request, sticky per-user (hash on user ID), or header-based (only certain clients get the canary).
Sticky routing matters for stateful workloads. If a user’s first request goes to v1 and the second goes to v2, they may get inconsistent behavior or even errors if the two versions have different session formats. Hash-based sticky routing (route by hash(user_id) % 100 < canary_pct) keeps each user on one version for the duration of the canary.
The trap with traffic-shifted canaries is small-sample noise. At 5% of total traffic, your canary metrics have 1/20th the data of the full fleet’s metrics, so the confidence intervals are wide. A small latency bump in the canary at 5% might be real (regression) or just noise (small sample). This is the problem statistical canaries solve.
98.3 Statistical canaries: the Kayenta pattern
Kayenta (Netflix’s open-source canary analysis tool, 2017) formalized statistical canary analysis. Instead of “is the canary metric above threshold X,” the question becomes “is the canary metric distribution significantly different from the baseline metric distribution?”
The key idea: compare baseline vs canary side by side on a set of metrics, using statistical tests, and return a pass/fail score based on the confidence that they differ.
The baseline isn’t “the old version in production.” Both versions run side by side on equal-sized slices of traffic (so the sample sizes match), and you compare them:
baseline pool: 5% of traffic → v1 (same version as production)
canary pool: 5% of traffic → v2 (new version)
stable pool: 90% of traffic → v1 (main production)
Why run a second v1 pool? Because you want to compare v2 to a v1 running under identical conditions — same traffic volume, same hardware, same neighbors. If you compare canary (v2) to stable (v1 at 90%), sample sizes differ and load-induced differences creep in.
graph TD
LB[Load balancer] -->|90%| S[Stable v1 pool]
LB -->|5%| B[Baseline v1 pool]
LB -->|5%| C[Canary v2 pool]
B -->|metrics| K{Kayenta<br/>statistical test}
C -->|metrics| K
K -->|pass| P[Promote canary]
K -->|fail| R[Rollback canary]
style C fill:var(--fig-accent-soft),stroke:var(--fig-accent)
style K fill:var(--fig-surface),stroke:var(--fig-border)
Kayenta compares baseline and canary at equal traffic volumes using Mann-Whitney U tests — this eliminates the load-induced bias that would appear if v2 were compared directly to the larger stable pool.
For each metric in the canary config (error rate, p99 latency, CPU, memory, …), Kayenta runs a statistical test (Mann-Whitney U, usually) on the two distributions and returns a per-metric pass/fail. The overall canary verdict is “pass” if enough metrics pass, “fail” if not.
The metrics are grouped with weights:
metrics:
- name: errors_per_second
group: errors
weight: high
- name: p99_latency
group: latency
weight: high
- name: cpu_utilization
group: resource
weight: medium
Common regressions show up as “one metric in one group crossed threshold” rather than “all metrics across all groups cross.” The weighted-groups scheme lets you be more sensitive to latency regressions than CPU regressions, say.
The Kayenta approach scales to more metrics and more sensitivity than threshold-based canaries. The cost: configuration complexity. You have to define the metrics, the thresholds, the groups, and the decision rules. Teams often start with simple threshold canaries and graduate to statistical canaries only when the threshold approach is too noisy.
Modern alternatives: Argo Rollouts with Prometheus-based analysis templates, Flagger with its built-in canary analysis, Spinnaker’s Kayenta integration. Functionally similar, differing mostly in UI and integration story.
98.4 Shadow traffic canaries
The next level of safety. Shadow traffic mirrors production requests to the new version without affecting users. The user’s response comes from the old version; the new version processes the same request in parallel and its output is recorded but discarded.
request -> LB -> v1 -> response to user
↓
(mirror) -> v2 -> discarded output (logged for analysis)
Shadow traffic is the safest form of canary because users never see the new version’s output. A bug in v2 can’t cause a user-visible failure. You compare v2’s behavior to v1’s offline and only promote v2 if the comparison is clean.
The cost:
- Double the capacity. Every request runs twice, once on each version. At 5% shadow, costs go up 5%. At 100% shadow, costs double.
- Side effects are hard. A request that writes to a database can’t be mirrored naively — you’d double-write. Shadow traffic only works cleanly for read-only or side-effect-free requests. For write paths, you need request mocking or a separate shadow database.
- Comparison semantics. For a deterministic service, “same response” is a clean check. For a stochastic service (an LLM), “same response” is ill-defined. See §98.6 for the ML case.
Shadow traffic shines for:
- Read-heavy services — search, recommendation, read APIs, inference. Easy to mirror.
- High-risk changes — a new indexing algorithm, a new model architecture. You want to see it under real load before committing.
- Load testing in production — a shadow pool lets you test new scaling without affecting users.
Common implementation: Envoy’s request_mirror_policies, Istio’s VirtualService with mirror, or a middleware layer that fans out the request. The mirror traffic is usually tagged so it doesn’t pollute the primary version’s metrics.
98.5 Dark launches and feature flags
A dark launch is a code path that’s deployed but not reachable by users. The code runs in production (e.g., as a background evaluation on real requests) but its output isn’t returned. The goal is to verify the new code doesn’t crash under real load before exposing it.
A feature flag is a runtime switch that enables or disables a code path per user, per tenant, or globally. Feature flags are not strictly canaries, but they enable canary-like rollout patterns: flip the flag on for 5% of users, watch metrics, expand if healthy.
The distinction between a traffic-shifted canary and a feature-flag rollout:
- Traffic-shifted canary: two versions of the binary are running. Routing decides which one handles the request.
- Feature-flag rollout: one version of the binary is running. A flag inside the binary decides which code path executes.
The feature-flag approach is more flexible (you can roll out multiple features independently in the same binary) but couples the rollout state to the running code. A buggy feature flag implementation can itself be a failure mode. Most mature teams use both: feature flags for application-level feature rollouts, traffic-shifted canaries for binary rollouts.
The feature-flag platforms (LaunchDarkly, Unleash, Flagsmith, OpenFeature standard) give you per-user, per-segment, per-rollout-percentage control with a consistent SDK across languages. For ML-system-sized teams, having a flag system is the difference between shipping safely and shipping scared.
98.6 ML model canaries are different
An LLM canary is not a traditional canary because the new model doesn’t produce the same output as the old model. You can’t just compare v1 output to v2 output and check equality — the two will differ on almost every request even when both are “correct.”
The consequences:
- Shadow traffic doesn’t give you a clean signal. “v2’s output differs from v1’s” is tautology, not a regression signal.
- Latency and resource metrics still work. You can still compare TTFT, TPOT, error rate, GPU memory between v1 and v2.
- Quality needs a different evaluation. You need a scoring mechanism — human rating, LLM-as-judge, task-specific automated evaluation, downstream metric (click-through rate, user retention).
The ML canary pattern that works in practice:
(1) Offline eval gate first. Before any production canary, v2 must pass an offline eval suite on known tasks. This catches most regressions before real traffic sees the model.
(2) Shadow traffic with quality scoring. Run v2 in shadow at low volume, score both v1 and v2 outputs with a judge (either an LLM judge or a task-specific metric), compare distributions. Any significant drop in v2’s score distribution is a red flag.
(3) Small live traffic slice with online metrics. Route 1-5% of real traffic to v2. Track the observable proxy metrics: thumbs-down rate, retry rate, session abandonment, response length. A drop here is a regression even if the offline eval looked fine.
(4) Gradual ramp-up with continuous monitoring. Ramp from 5% to 100% over hours to days, not minutes. The longer the ramp, the more confidence you have that slow-burning regressions (user engagement, retention) have time to show up.
(5) Fast rollback path. The router must be able to swap back to v1 instantly. Pin the v1 deployment alive for the full ramp window.
The hardest part of ML canaries is quality regression that passes the proxy metrics. Users are often tolerant of small quality dips, so thumbs-down rates don’t move until the regression is significant. By then, you’ve shipped the regression to half your users. Offline eval suites are your friend here — catch as much as possible before production sees it.
One more wrinkle: A/B testing is not the same as a canary. An A/B test is a controlled experiment to measure the effect of a change, usually running for days to weeks to collect statistical power. A canary is a progressive rollout to catch regressions. An A/B test can serve as a canary in the first few minutes of its rollout (regressions show up as immediate metric changes), but the two serve different purposes and have different durations.
98.7 Holdout sets
A holdout set is a fixed slice of users or requests that never receives new versions. It stays on the old version indefinitely (or for a long period). The purpose is to provide a clean baseline to compare against, especially for ML systems where concept drift and seasonality confound simple before/after comparisons.
traffic: 95% -> current version (changes over time)
5% -> holdout (v0, frozen)
The holdout group is the control. Over weeks and months, you can compare aggregate metrics (engagement, retention, task success rate) between holdout and non-holdout to measure the net effect of the version changes. Without a holdout, you can’t tell whether a 3% retention improvement came from your new model or from a seasonal effect.
Holdouts are particularly important for ML systems because ML “improvements” are often not improvements on the metric the team optimized, if you include long-tail effects. A holdout is the only honest way to measure that.
The downside: the holdout group gets worse service (they never see improvements). This is a real cost and needs to be priced against the measurement value. Most teams rotate holdouts or limit their size to 1-5% of total traffic.
98.8 Scheduled synthetic canaries
Real traffic varies. A canary that runs during a quiet hour sees different load than one during peak. To test under controlled conditions, teams run synthetic canaries: a fixed, known set of requests issued on a schedule.
every 5 min: send 50 requests to canary, measure response, alert on failure
Useful for:
- Baseline health checks. A synthetic canary that hits every critical endpoint every minute catches outages that may not be visible in real traffic (low-traffic endpoints, for example).
- Rollout verification. After a deploy, the synthetic canary runs first and has to pass before real traffic is shifted.
- SLA compliance testing. Contract SLAs often require a certain response rate on a defined synthetic workload.
Synthetic canaries don’t replace real-traffic canaries because they don’t exercise the full request distribution. Real users send weird inputs that your synthetic set will never cover. But they catch the obvious failures fast, which is valuable as a first line of defense.
Tools: Grafana Synthetic Monitoring, Datadog Synthetic, bespoke cron jobs hitting the service. For ML systems, synthetic canaries often include a fixed prompt set with expected output ranges — the simplest form of continuous correctness monitoring.
98.9 Rollback automation
A canary without automatic rollback is a canary with a human in the loop who might be asleep. The failure modes:
- Slow rollback: human notices canary regression in 15 minutes, initiates rollback. Total impact: 15 minutes × canary %.
- Fast rollback: automated rollback triggers on metric breach within 2 minutes. Total impact: 2 minutes × canary %.
Automation trades off false positives (rollbacks triggered by noise) against detection time. The threshold tuning is the interesting engineering problem: strict enough to catch real regressions fast, loose enough to avoid rolling back on every blip.
Typical automation:
canary:
analysis:
interval: 60s
threshold: 5 # number of failed checks before rollback
maxWeight: 50 # max traffic % during canary
stepWeight: 5 # increment per successful check
metrics:
- name: request-success-rate
thresholdRange: { min: 99 }
interval: 1m
- name: request-duration-p99
thresholdRange: { max: 500 }
interval: 1m
A check runs every minute. If 5 consecutive checks fail (meaning 5 minutes of sustained regression), roll back. If checks pass, promote by stepWeight up to maxWeight.
The threshold: 5 is the anti-noise buffer. One flaky check won’t trigger. Five in a row means something is actually wrong.
Modern rollout controllers (Argo Rollouts, Flagger) implement this pattern natively. Configure once per deployment, and every rollout follows the same automated canary pattern with rollback on breach.
98.10 Production patterns
(1) Canary everything. Even low-risk services deserve canary rollouts. The marginal cost of a canary is small, and it catches the one-in-a-hundred regression that would otherwise cost hours of incident response.
(2) Use sticky routing. Hash on user ID or session ID so users see a consistent version throughout the rollout.
(3) Separate canary metrics. Canary traffic should emit its own set of metrics tagged with version=canary. Gate dashboards and alerts on the tagged metrics so the canary is visible in isolation.
(4) Short bake time for small canaries, long bake time for big ones. 5% canary for 5 min, 25% canary for 15 min, 100% promotion after 30 min of stability. Early phases need less time because the blast radius is small.
(5) Automate rollback on error-rate and latency regressions. Both metrics. Either breaches triggers rollback.
(6) Hold the old version alive for a full window after promotion. Don’t tear down v1 the moment v2 reaches 100%. Keep v1 scaled down but available for 1-24 hours in case a slow-burning issue appears and you need to roll back.
(7) Canary the infra, not just the binary. A new Kubernetes version, a new node pool, a new GPU driver — all of these benefit from canary rollouts via cordoning and gradual migration.
(8) For ML models, gate on both performance and quality. Automated latency/error gates are necessary but not sufficient. Add at least one quality metric (thumbs-down rate, retry rate, automated eval score) before promoting.
98.11 The mental model
Eight points to take into Chapter 99:
- A canary turns full-scale deploy risk into percentage-scale deploy risk. 20× reduction on the first real regression.
- Traffic-shifted canaries are the baseline: route 5% to v2, bake, promote or roll back.
- Statistical canaries (Kayenta) compare baseline and canary distributions side by side with statistical tests. Better signal, more config.
- Shadow traffic mirrors requests to v2 without user-visible impact — safest for read-heavy services, impossible for write-heavy ones.
- Dark launches and feature flags enable fine-grained rollout without binary-level canary.
- ML model canaries are different because outputs aren’t comparable. Add offline eval gates and quality proxy metrics.
- Holdout sets give you an honest long-term baseline for measuring the net effect of changes.
- Automate rollback with sustained-failure thresholds to catch regressions faster than humans can.
In Chapter 99, we look at what happens when the canary doesn’t catch the regression and the pager goes off.
Read it yourself
- The Kayenta GitHub repository and its design docs at Netflix Tech Blog (Michael Graff and Taylor Kline, “Automated Canary Analysis at Netflix with Kayenta”).
- Argo Rollouts documentation on progressive delivery patterns.
- Flagger documentation on canary analysis with Prometheus.
- Release It! by Michael Nygard (2nd ed, 2018) — chapter on deployment patterns.
- Google’s “Canary Analysis Service” paper and the Spinnaker documentation.
- Feature flag management books and LaunchDarkly’s whitepapers on progressive delivery.
Practice
- Compute the user-impact savings for a 5-minute regression detected during a 10% canary vs a full deploy. Assume 1M users/day and uniform distribution.
- Why is sticky routing important for canaries? Construct a failure scenario that breaks without it.
- For a stateless read-only API, design a shadow-traffic canary. What’s the extra cost?
- Why is shadow traffic problematic for a write-heavy API? What’s the workaround?
- Contrast a canary and an A/B test. When would you use each?
- Design an ML model canary for a new LLM version. What offline gates, what shadow signal, what live rollout criteria?
- Stretch: Set up Argo Rollouts on a local cluster with a simple web service. Configure a 5-step canary with Prometheus-based analysis. Deliberately introduce a latency regression in v2 and watch the automated rollback trigger.