ORR: what production-readiness actually means
"A service is ready for production when the questions on the ORR are all answered with 'yes, and here's where"
The Operational Readiness Review (ORR) is the checklist a service walks through before it’s allowed to take production traffic. Sometimes it’s a formal gate, sometimes a conversation with a senior engineer, sometimes a Google Doc. The format varies; the content does not. The ORR is the organizational memory of every mistake that’s ever been made — every outage, every page, every postmortem — distilled into a list of questions that, if you can answer them all, you’re much less likely to repeat those mistakes.
This chapter walks the ORR as a cultural artifact and as a concrete checklist. Each item on the list exists because someone, somewhere, shipped a service without it and had a bad time. The chapter closes Part VIII by recapping how the preceding nine chapters map onto the questions an ORR asks. Observability, SLOs, canaries, postmortems — they’re all here, but now in the context of “do you have these things for this specific service before it ships?”
Outline:
- What an ORR is and why it exists.
- The cultural function.
- Capacity planning.
- Observability.
- Runbooks.
- Backups and data durability.
- Security review.
- On-call coverage.
- Dependency mapping.
- Failure mode analysis.
- The Part VIII recap.
- The mental model.
100.1 What an ORR is and why it exists
An Operational Readiness Review is a pre-launch checklist and review process that a service has to pass before it is considered “production-ready.” The term comes from Amazon and Google, both of which formalized the process in the early 2000s as they scaled past the point where “senior engineer vouches for the service” was a reliable mechanism.
The ORR answers the question: is this service safe to put in front of users?
Without an ORR, the implicit answer to that question is “the team pinky-swears it is.” That works when the team is small and the senior engineers have personally reviewed every line. It fails when the team has dozens of services, each built by different sub-teams, each with different levels of maturity. The ORR is the institutional memory that survives team turnover.
The ORR is not a bureaucratic gate. It’s a structured conversation. A typical ORR has 30-80 questions organized into categories, and the review takes 30-90 minutes. The reviewer (usually a senior engineer or SRE) walks the service owner through the questions. For each question, the answer is one of:
- Yes, here’s where. A link to the dashboard, the runbook, the test, the monitoring, the capacity doc.
- Partially, here’s the plan to finish. A link to the in-progress work.
- No, and here’s why it’s OK for this service. An explicit waiver.
The output of the ORR is a document: answers for every question, explicit waivers for the skipped items, and a clear go/no-go decision. Often the ORR is iterative — the first pass surfaces gaps, the team goes off and fills them, a follow-up pass signs off.
The reason the ORR format beats ad-hoc review: it’s a repeated question set that catches the things you forget to check. A senior engineer reviewing a service in an unstructured conversation might remember to ask about observability and capacity but forget to ask about backups and dependency mapping. The checklist doesn’t forget.
graph LR
T[Team self-review<br/>fill ORR doc] --> R[Reviewer walk-through<br/>30–90 min]
R --> G{All items answered?}
G -->|Yes + no blockers| S[Go / production-ready]
G -->|Gaps found| W[Team addresses gaps]
W --> R2[Follow-up review]
R2 --> S
style S fill:var(--fig-accent-soft),stroke:var(--fig-accent)
ORR is iterative by design — surfacing gaps in the first review and filling them before the follow-up is the intended flow, not a failure.
100.2 The cultural function
The ORR is cultural before it’s technical. Its most important output is not the checklist being filled in; it’s the conversation the service team has with themselves while filling it in.
When a team reads the ORR and realizes they can’t answer “what happens if the database goes down?” — they pause the launch and go think about it. The ORR didn’t tell them the answer; it asked them a question they didn’t have one for. The pause is the value.
The cultural rules that make ORRs work:
- ORR is mandatory. Every new service, every major change, every revamp. No exceptions for “it’s just a small service.” The small services are where the mistakes happen.
- Waivers are explicit. If you’re skipping an item, write down why in the ORR doc. Future readers (and auditors) need to know this was a deliberate choice.
- Senior engineers do the reviews. The reviewer’s job is to ask hard questions, not to rubber-stamp. Junior reviewers haven’t seen enough bad things to know what to worry about.
- Failing the ORR isn’t a performance issue. A service that fails the ORR goes back and fixes the gaps. The team is not punished for failing; they’re punished for hiding failure.
- The ORR evolves. Every major postmortem should add a question to the ORR: “have you planned for the thing that just broke for us?” The ORR is a living document.
Teams that treat the ORR as a formality (“we checked all the boxes, ship it”) lose the benefit immediately. Teams that treat it as a genuine conversation catch real problems pre-launch.
100.3 Capacity planning
The question: how much load can this service handle, and what happens when it runs out?
The ORR expects:
- Peak capacity estimate. How many RPS (or tokens/sec) can one replica handle at target latency? How many replicas at peak? What’s the headroom multiplier?
- Growth projection. What’s the expected traffic in 3 months, 12 months? Can the current architecture handle it?
- Bottleneck identification. Which resource saturates first — CPU, GPU, memory, KV cache, DB connections? You should know before launch.
- Load test results. Has this been actually load-tested at 1.5× peak? Not a simulated number — an actual test.
- Autoscaling configuration. HPA/KEDA rules, min/max replicas, scale-up latency (Chapter 51).
- Overload behavior. When saturation hits, what does the service do? 429s? Queueing? Graceful degradation?
The default answer is usually “we’ve guessed based on benchmarks.” That’s not enough. The ORR forces the team to actually run the load test before launch. Most services that fail the ORR on capacity fail because nobody did the load test.
For ML serving specifically, capacity planning is unusually hard because the resource (GPU + KV cache) is expensive and the scale-up path is slow (cold start — Chapter 51). The ORR should ask:
- What’s the KV cache budget per replica and how many concurrent users does that imply?
- What’s the p99 cold start time?
- What’s the headroom above the autoscaler’s max replicas?
- What happens when the GPU node pool is exhausted (no GPUs available for autoscaling)? The answer is usually “we page,” and the ORR surfaces that the team has thought about it.
100.4 Observability
The question: if this service breaks, how will we know?
This is where Chapters 90-94 become ORR items:
- Golden signals dashboards. Latency, traffic, errors, saturation graphed and visible to on-call (Chapter 92).
- Prometheus metrics exposed. Standard metric names, label conventions, cardinality under control (Chapter 93).
- Structured logs with trace IDs. JSON logs, consistent schema, shipper configured (Chapter 94).
- Distributed tracing instrumented. OTel SDK wired up, trace propagation tested across boundaries (Chapter 95).
- Continuous profiling running. eBPF profiler covers the service, queryable by label (Chapter 96).
- SLOs defined with burn-rate alerts. Not just SLIs; actual SLOs with error budgets (Chapter 97).
- Alerts route to on-call correctly. A test page should be issued pre-launch to verify the alert → PagerDuty/Rootly path works.
- Dashboards linked from runbooks. On-call shouldn’t have to hunt for the right dashboard during an incident.
An ORR that skips observability is essentially “trust us, it’ll be fine.” The observability checklist catches the services that hadn’t thought through any of it and were one incident away from disaster.
The easiest ORR failure mode: a dashboard exists, but nobody has ever looked at it. The review should include an open-the-dashboard-and-show-me step. “What does healthy look like on this graph? Show me where an error spike would appear.”
100.5 Runbooks
The question: when an on-call engineer gets paged at 3 AM, what are they supposed to do?
A runbook is a document that says, for a given alert or symptom, what to check and what to try. The ORR expects:
- One runbook per alert. Every pageable alert has a linked runbook. The alert’s notification should include the link.
- Runbooks are current. Runbooks go stale fast. Review at least quarterly, or after every incident.
- Runbooks are executable. Commands are copy-pasteable. No placeholders, no “TBD.”
- Common failures covered. The top 5-10 known failure modes have their own runbook entries.
- Escalation path documented. If the runbook doesn’t fix it, who to call next.
A runbook is not a replacement for skill. An experienced on-call engineer will often diagnose the problem from the dashboards alone. But runbooks turn rare events (a crashloop you haven’t seen in six months) from “page the senior engineer” to “follow the runbook.” And they’re the on-ramp for new on-call engineers — the playbook they use while they’re still building intuition.
Bad runbook patterns:
- “Call Alice.” Alice leaves the company, the runbook is broken.
- “Check if the database is up.” How? With what command? Against which host?
- “Restart the service.” Which command? What should you check before restarting? What should you check after?
Good runbooks have specific commands, specific queries, specific expected outputs. A junior engineer following the runbook should be able to make progress without prior context.
100.6 Backups and data durability
The question: if the data is lost, can we get it back?
This is the category most often underestimated. Most teams think “we don’t own any data, so backups don’t apply to us.” Usually wrong. Even stateless services have some state:
- Configuration. The set of feature flags, tenant configs, rate limits.
- Secrets. API keys, credentials. Can you rotate them if the secret store dies?
- Model weights. For ML serving, the model is data. If the registry disappears, can you restore?
- Indexes. Vector store, search index, feature store. These are data even if they’re derivable.
- User state. Session data, cached auth tokens.
The ORR expects:
- What state does the service own? Enumerated explicitly.
- What’s the recovery point objective (RPO)? How much data loss is acceptable? For some services, zero; for many, a few minutes.
- What’s the recovery time objective (RTO)? How fast can you restore? Minutes, hours, days?
- Backup cadence and retention. How often, kept for how long, stored where (a different region/account is ideal).
- Restore test. Has the backup actually been restored, as a drill? Not in theory — has someone actually done it?
The “restore test” is the item that catches the most issues. Backups that haven’t been restored in a year have a way of failing silently. The tape is corrupted, the schema has drifted, the restore path has a bug. Drilling the restore surfaces these before you need them.
100.7 Security review
The question: is this service exposed to attack, and have we mitigated the obvious classes?
The ORR isn’t a full security audit, but it covers the basics:
- Authentication and authorization. Who can call the service? How is identity verified? Is it mutual-TLS, JWT, session cookie, API key, something else? Does the authorization check happen per-request?
- Secret management. Are secrets stored in a proper secret manager (Vault, AWS Secrets Manager, SOPS) or hard-coded? Can they be rotated?
- Input validation. Does the service accept user input? Is it validated? Are there bounds on sizes? For LLMs: prompt injection, jailbreak attempts, PII handling.
- Network exposure. Is the service internal-only, or is it exposed to the internet? If internet-facing, is there a WAF in front?
- TLS. Encrypted in transit? With current cipher suites?
- Logging of security events. Failed auth, rate limit hits, suspicious patterns logged and retained?
- Dependencies scanned. Containers have been scanned for CVEs recently?
- Privilege minimization. Does the service run as root? Does it have more permissions than it needs?
For ML systems specifically:
- Model weight security. Is the model file protected? Who can write to the registry?
- Prompt injection mitigation. For LLMs exposing tools/functions, is there a defense against prompt injection?
- PII in logs. Are user prompts and outputs logged? If so, is PII scrubbed? Are retention policies appropriate?
- Rate limiting. To prevent abuse and cost blow-up, are there per-user rate limits?
A security specialist usually joins the ORR for services with nontrivial exposure. The ORR isn’t trying to replace a security team; it’s trying to make sure the service didn’t ship with a glaring hole.
100.8 On-call coverage
The question: who gets paged when this service breaks, and do they have the knowledge to respond?
An ORR expects:
- Named on-call rotation. A rotation with at least 3-5 people. Solo on-call is an ORR failure; one person can’t sustain 24/7.
- Rotation tool configured. PagerDuty, Rootly, whatever — alerts route correctly.
- On-call training. New rotation members have shadowed, reviewed runbooks, been through an incident (or a drill).
- Handoff process. Weekly (or daily) handoffs with shared notes.
- Escalation paths. When the primary on-call can’t resolve, who’s next? Secondary on-call, SME on-call, manager, company-wide escalation.
- Compensation and time-off policy. On-call is work; it has to be compensated (time off, pay, or both). Burnt-out on-call engineers are a reliability risk.
The hardest part is coverage quality, not coverage existence. A rotation of 5 engineers, where 4 are new and don’t know the service, is functionally a rotation of 1. The ORR should check that the majority of the rotation has actually responded to alerts for this service before.
On the tooling side: the on-call system should be tested with a test page before launch. “We configured the rotation” is not enough; “we issued a test page and the right person’s phone rang” is.
100.9 Dependency mapping
The question: what does this service depend on, and what happens when those dependencies break?
The ORR expects a list of every dependency, with:
- The dependency’s SLO. Do they commit to reliability? What number?
- The impact of failure. Does your service fail, degrade, or continue normally when this dependency fails?
- Timeouts. Is there a timeout on every call to the dependency? Unbounded waits are an ORR failure.
- Retries. Is there a retry policy on transient failures? Are retries bounded?
- Circuit breakers. Is there a mechanism to stop calling a broken dependency to avoid cascading failure?
- Fallback behavior. If the dependency is down, do you return cached data? An error? A degraded response?
This category forces the team to actually think about the failure modes of their downstream. The default is “we assume the dependency is up.” That’s not good enough; every dependency fails eventually.
For ML serving, the dependency chain is usually long: auth, routing, model registry, feature store, embedding service, vector index, retrieval API, external LLM provider, rate limiter, billing. Each one is a potential failure. The ORR forces you to enumerate them and say what happens when each fails.
A fun exercise: use the dependency map to compute the theoretical availability ceiling via Chapter 97’s math. If your dependencies multiply out to 99.5%, you cannot promise 99.9% without fallbacks. The ORR surfaces that gap.
100.10 Failure mode analysis
The question: what are the ways this service can fail, and have we thought about each of them?
This is the open-ended category. The reviewer asks the team to brainstorm failure modes and walk through each:
- What if a pod crash-loops?
- What if the GPU node is preempted?
- What if the load balancer drops traffic to half the pods?
- What if the database is in a split-brain state?
- What if a dependency responds correctly but slowly?
- What if the model weights fail to load?
- What if we get a traffic spike 10× baseline?
- What if we get a traffic drop of 90% and then it comes back?
For each scenario: what does the service do, what does the user see, what page fires, what’s the runbook?
The pattern this exercise catches: failures the team had never considered. “What if the pod OOMs during model load?” is a real scenario and if nobody has thought about it, the answer is usually “we don’t know what happens.” Then the service is one OOM away from a surprise. The ORR surfaces it so the team can think about it before it happens.
FMEA (Failure Mode and Effects Analysis) is the formal version of this exercise, imported from aerospace and manufacturing. In software it’s usually less rigorous, but the spirit is the same: enumerate failure modes, assess impact, pre-plan responses.
Another useful framing: chaos engineering. Don’t just think about the failure modes; inject them in a controlled way and verify the service actually handles them. Many ORRs now include “have you done at least one chaos test?” as a question.
100.11 The Part VIII recap
Part VIII has built the observability and reliability story from bottom to top, and the ORR is where it all gets applied in one place.
Chapter 92 gave the frameworks — golden signals, USE, RED, LETS — that compress thousands of metrics into a handful of first-class signals for different layers of the stack.
Chapter 93 explained Prometheus’s pull-based architecture, the PromQL mental model, and the cardinality math that kills most installations. Every ORR question about “do you have metrics” maps to here.
Chapter 94 covered logs — structured JSON as the non-negotiable baseline, Loki’s label-indexed design, LogQL, and the cost economics of retention. “Do you have logs?” is Chapter 94.
Chapter 95 built distributed tracing from first principles: spans, context propagation, head vs tail sampling, and the OTel collector. “Can you trace a slow request end-to-end?” is Chapter 95.
Chapter 96 added the fourth pillar: continuous profiling via eBPF and flame graphs. “Can you see why a function is slow?” is Chapter 96.
Chapter 97 introduced SLIs, SLOs, SLAs, and error budgets — the framework for turning measurements into commitments and for deciding when to stop shipping. “What does healthy mean and how do you know you’re there?” is Chapter 97.
Chapter 98 covered canary patterns — traffic-shifted, statistical, shadow, ML-specific. “How do you deploy safely?” is Chapter 98.
Chapter 99 walked incident management: roles, postmortems, blameless culture, causal chains. “What happens when the canary misses and something breaks?” is Chapter 99.
Chapter 100 (this chapter) is the capstone: the ORR that asks all of the above, plus capacity, runbooks, backups, security, on-call, dependencies, and failure modes — as a checklist applied to every new service.
The thread: observability is the mechanism by which the reliability story becomes testable. Every layer of Part VIII is a piece of machinery that, when applied together, lets a team ship fast and stay up. Skip any layer and the system gets worse. Apply all of them and the system becomes one of the rare production stacks that’s boring to run.
The ORR is the ritual where you check whether you actually did the work.
100.12 The mental model
Eight points to take into Chapter 101:
- The ORR is a checklist of every lesson the organization has learned from every outage. Its content grows as incidents teach it.
- The cultural function is the conversation, not the document. Pausing to think about a question is the value.
- Capacity, observability, runbooks, backups, security, on-call, dependencies, failure modes are the eight canonical categories.
- Waivers are explicit. Every skipped item gets a written reason.
- Restore tests and test pages matter. Theoretical backups and theoretical alerts don’t count.
- The dependency map bounds your achievable SLO. Know your chain.
- Failure mode analysis is where you catch the failures you hadn’t imagined. Brainstorm, pre-plan, chaos test.
- The ORR is the capstone of Part VIII — observability, SLOs, canaries, and postmortems all show up as checklist items you have to satisfy before you ship.
Part IX starts next, with Chapter 101: the build-deploy-operate pipeline that sits under everything in Parts VIII and earlier. You now know how to measure a system and how to decide it’s reliable; next we look at how you actually get it to production.
Read it yourself
- Beyer et al. (eds.), The Site Reliability Workbook (Google, 2018). Chapter 18 is the canonical ORR chapter.
- Michael Nygard, Release It! Design and Deploy Production-Ready Software, 2nd ed (2018). The “Production-Ready” chapters.
- Charity Majors, “The Pillars of Operational Maturity,” Honeycomb blog.
- Amazon’s principal engineer writeups on operational excellence (various public talks).
- Richard Cook, “How Complex Systems Fail.” Short and canonical.
- Chaos Engineering: System Resiliency in Practice by Rosenthal, Jones, Blohowiak et al. (O’Reilly, 2020) — for the FMEA-plus-injection approach.
Practice
- Write a one-paragraph answer to each of the eight ORR categories (§100.3–98.10) for a hypothetical new LLM serving deployment.
- A team’s ORR answer to “do you have a runbook?” is “yes, here’s the link.” The link goes to a doc with one sentence: “Check the dashboards.” What’s your follow-up as reviewer?
- Why is the “restore test” more valuable than “the backup exists”?
- For a service with three critical dependencies each committing to 99.9% availability, what’s the achievable availability? Does your service’s SLO exceed it? What do you do about it?
- List five failure modes that a typical new ML serving service should FMEA-analyze before launch.
- Design the on-call rotation for a new service: how many engineers, how long is a shift, what’s the escalation path, and what’s the training plan for new rotators?
- Stretch: Take a real service you know well (open-source or otherwise) and write an ORR for it as if you were launching it today. Note the gaps that would need to be filled before launch.