Chapter 99: Incident management: postmortems, blameless culture

Eventually, something breaks. The canary from Chapter 98 missed it, or the change looked safe and wasn’t, or a dependency three hops down failed in a way nobody anticipated. An alert fires, a pager goes off, and someone has to coordinate the response while users experience the impact. This chapter is about how that process works in a mature engineering organization, and why the postmortem — the document written afterwards — is the most valuable artifact the whole machinery produces.

The content: incident command roles, the mechanics of incident response, the postmortem template, five whys vs causal chains, what makes a good postmortem, and the cultural function of the incident review. Three generic case studies at the end to show the pattern in action. Chapter 97 gave the error budget as the accounting mechanism for incidents; this chapter is how you actually run one and learn from it.

Outline:

What an incident is.
The incident commander role.
The response team roles.
Incident tooling: Rootly, PagerDuty, and successors.
The postmortem as an artifact.
Blameless culture.
Five whys vs causal chains.
What makes a good postmortem.
Three case studies.
The review meeting.
The mental model.

99.1 What an incident is

An incident is a situation where the service is degraded or down and active human attention is required to resolve it. The word is deliberately broad. Key properties:

Active. If it’s a latent bug nobody has noticed, it’s not an incident. It’s an incident when it starts affecting users or threatens to.
Requires coordination. A single engineer fixing a bug in 5 minutes with no impact is not an incident, it’s a task. Incidents have multiple responders, scattered context, and time pressure.
Has a blast radius. Incidents are sized by their impact — how many users, how severe, for how long.

Organizations classify incidents by severity. A common scheme:

SEV1: company-wide outage, major customer impact, all hands. Page everyone relevant.
SEV2: partial outage, significant impact on a product area. Page the on-call rotation.
SEV3: degradation, some impact, not breaking. Ticket, no page.
SEV4: minor issue, no user impact. Ticket, handled in normal flow.

The severity dictates the response. SEV1 gets an incident commander, a dedicated chat channel, a status page update, and executive notifications. SEV3 gets a Jira ticket. The rules for “what severity is this” should be clear and written down; arguing severity in the middle of an incident wastes time.

The other key classification: incident or not-incident. Not every alert is an incident. A flaky metric that fires once and recovers is not an incident. A scheduled maintenance window with expected impact is not an incident. A single pod crashloop that the deployment system is already handling is not an incident. The point is to reserve the incident designation for situations where coordinated human action actually changes the outcome.

graph TD
  A[Alert fires] --> B{Severity?}
  B -->|SEV1 — company outage| C[Page IC + all relevant on-calls<br/>status page update<br/>exec brief]
  B -->|SEV2 — partial outage| D[Page on-call rotation<br/>open incident channel]
  B -->|SEV3 — degradation| E[Ticket, no page<br/>handle in normal flow]
  B -->|SEV4 — minor| F[Ticket only]
  C --> G[Incident response process]
  D --> G
  style C fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Severity triage should take under 60 seconds — if the team debates SEV1 vs SEV2 during the incident, the severity rubric needs to be written down before the next one.

99.2 The incident commander role

During an incident, one person is the Incident Commander (IC). The IC does not fix the bug. The IC coordinates the people who are fixing the bug. This separation is the single most important cultural technique in incident response, and it’s the hardest to instill in engineering teams who want to dive straight into the code.

The IC’s responsibilities:

Declare the incident and its severity. Make the call.
Assemble the response team. Page the right on-calls, pull in subject-matter experts.
Track state. What’s broken, what’s tried, what’s pending, what’s known, what’s unknown.
Coordinate comms. Who’s talking to customers, who’s updating the status page, who’s briefing executives.
Make decisions. When the team can’t agree, the IC decides.
Declare the incident over when the service is restored.

The IC does not touch the system. The IC runs the meeting. This is backwards for most engineers, who want to help fix the problem directly. The discipline is hard: the most senior engineer in the room is often the best IC precisely because they don’t need to be the one typing commands. Their role is to keep the response coherent.

Rotating ICs matters. A team where one person is always the IC has a bus factor problem. Every senior engineer should take the IC role during their on-call rotation, trained via shadow-IC on low-severity incidents first. IC is a skill that can be taught; it’s mostly about calm prioritization and clear communication.

99.3 The response team roles

Besides the IC, a mature incident response has:

Subject-Matter Responder(s) — the engineers actually debugging and fixing. Usually the on-call for the affected service. They read dashboards, tail logs, run commands, form hypotheses, test fixes. They report to the IC.

Communications Lead — updates the status page, drafts customer-facing messages, fields questions from the incident channel. Often the IC for smaller incidents; separate role for larger ones.

Scribe — writes down everything that happens in chronological order. Timestamps, commands run, observations, decisions. This becomes the timeline for the postmortem. Without a scribe, the timeline is reconstructed from chat logs after the fact, which always loses information.

Customer Liaison — for externally visible incidents, talks to customers directly. Account managers, support leads, whoever has the relationships.

Executive Liaison — briefs leadership so they know what’s happening without interrupting the response team. This is important precisely because it keeps executives out of the way.

These are roles, not titles. A single person can hold multiple roles on a small incident. A large incident might have each role staffed by a separate person. The point is to name them so everybody knows who does what, and nobody is trying to do everything.

Common anti-patterns:

Everyone debugging. Five engineers all tailing logs, all running commands, no one coordinating. Result: contradictions, wasted effort, fixes that conflict.
IC also fixing. The IC is head-down in vi while the chat channel fills with unanswered questions. Result: loss of coordination, missed updates, bad decisions.
No scribe. Nobody writes the timeline as it happens. Result: a postmortem that misses half the story.

99.4 Incident tooling: Rootly, PagerDuty, and successors

Historically, incident tooling meant PagerDuty (2009): an on-call schedule + paging system. It does one thing well — “page the right person” — and for years that was the whole market.

Modern incident tooling extends far beyond paging. Rootly, FireHydrant, incident.io, Grafana OnCall — all offer:

Schedule management — rotations, overrides, handoffs.
Paging — via phone, SMS, Slack, email, mobile app.
Incident lifecycle — declare an incident, auto-create a Slack channel, auto-create a Zoom call, auto-invite responders, auto-update status pages.
Runbook integration — linked playbooks based on the alert.
Postmortem generation — timeline reconstruction from the Slack channel, template-driven postmortem docs.
Metrics and analytics — MTTR, incident frequency, which services page most.

The new pattern is Slack-first incident response. /incident declare in Slack creates a channel, pings responders, starts a timeline, and drops a runbook. Engineers stay in Slack; the tooling handles the meta-work. This is a much better workflow than the PagerDuty-of-2015 pattern where you got a phone call, opened a separate UI, and tried to coordinate from there.

For ML-serving teams specifically, the incident tooling should:

Auto-attach dashboards. The declare command links to the service’s Grafana dashboards automatically.
Pre-fill likely responders. The on-call for the affected service, plus the ML team if the model is suspected.
Integrate with the deploy system. An incident declaration should surface the last N deploys to the affected service automatically — deploys are the most common root cause.
Trigger runbooks. The runbook for “vLLM pod crash-looping” should be one click away from the incident channel.

Which tool to pick: whatever integrates best with your existing Slack and on-call system. Rootly, incident.io, and FireHydrant are all within a stone’s throw of each other on features. PagerDuty’s classic product still works if you’re used to it, but the Slack-first tools are a step change in ergonomics.

99.5 The postmortem as an artifact

After the incident is resolved, the team writes a postmortem — a structured document that describes what happened, why, what the impact was, and what to change. The postmortem is the single most valuable output of the whole incident-response machinery, because it’s what turns the incident from “a bad day” into “a lesson the organization learned.”

The Google SRE template (Chapter 15 of the SRE book) has these sections:

Summary — one paragraph explaining the incident to someone who doesn’t know.
Impact — quantitative: how many users affected, for how long, what SLI was breached, revenue impact if known.
Timeline — chronological, with timestamps. What happened, what we observed, what we tried, what we learned. Every major event.
Root cause — what actually caused the incident. Often a chain, not a single cause (see §99.7).
Detection — how was it detected? How long did detection take? Was the detection adequate?
Response — what did the response team do? What worked, what didn’t?
Recovery — how was the service restored? Was the recovery path clean?
Action items — specific, owned, prioritized. Not aspirational.
Lessons learned — what was surprising? What assumptions were wrong? What will we do differently?

Each section has a specific purpose. The summary is for the executives who will never read the full doc. The impact is for the accountants and for calibrating severity later. The timeline is for the engineers who will read this in six months when a similar incident happens. The action items are the actionable output.

A postmortem without action items is a file nobody reads. The action items have to be real: tracked in your ticketing system, owned by a specific person, prioritized against other work, and reviewed in the postmortem meeting. If action items aren’t getting done, the postmortem process has failed, and you’re accumulating the same incidents over and over.

The timeline and action items sections (marked ★) are the postmortem's highest-value outputs — the timeline preserves the institutional memory and the action items are the only thing that prevents the same incident from recurring.

99.6 Blameless culture

The single most important property of a postmortem process is blamelessness. The idea: the postmortem focuses on what failed in the system, not who failed. Individuals make mistakes, and the mistakes are often the trigger for an incident, but the mistake was possible because the system allowed it — and the system is what you can fix.

“Alice pushed a bad config” is a blame statement. “The config validation didn’t catch the missing field, and the rollout didn’t detect the breakage until production” is a system statement. Both are true; only the second points to a fix.

The reason blamelessness matters isn’t hippie politeness. It’s that engineers who fear blame hide information. They won’t admit they didn’t run the validation script. They won’t admit they bypassed the deploy gate. They won’t tell the postmortem author about the shortcut they took that seemed reasonable at the time. And without that information, the postmortem can’t identify the real failure mode, and the organization doesn’t learn.

The rule of thumb from the SRE book: if a human error caused the incident, the system that allowed the human error is the root cause. Every “human error” is an opportunity to improve tooling, process, or training. If Alice’s push destroyed production, the question is not “why did Alice push?” but “why did we have a system where one push could destroy production without checks?”

Blamelessness is easy to say and hard to practice. It requires:

Leadership modeling it. Engineers take their cues from senior leaders. If leadership asks “who broke it?” in postmortem meetings, the blameless culture is gone.
Language discipline. “Alice pushed a bad config” is blame; “a bad config was pushed” is passive evasion; “the config validator didn’t catch the missing field” is the right framing. Practice the third.
Separation of consequences. Postmortems are not performance reviews. An engineer who caused a major incident might still get a good performance review if they handled it well, learned from it, and contributed to the fix. Conflating the two kills the culture.

Blameless does not mean consequence-free. A pattern of deliberate negligence is a different matter and is handled separately, in the normal HR process. The postmortem itself is blameless; the performance process is its own thing.

99.7 Five whys vs causal chains

Two methods for finding root causes. Five whys is the older and simpler: ask “why?” five times in a row.

The site went down. Why? The load balancer stopped routing. Why? The health check started failing. Why? The service returned 500s. Why? The database connection pool was exhausted. Why? A bad query held connections.

Five whys gives you a root cause, but it tends to find a single linear chain and miss the actual failure mode, which is usually multi-factor. Real incidents are almost never one cause; they’re a combination of several things happening together that individually would have been fine.

Causal chains (or “cause-effect diagrams,” or in safety literature, “Swiss cheese models”) represent the failure as a set of contributing factors that lined up. A failure happens when multiple holes in multiple layers of defense align.

For the same example:

Proximate cause: a bad query held connections.
Why was the query bad: a recent deploy introduced an N+1 pattern that hadn’t been caught in review.
Why didn’t review catch it: the reviewer was unfamiliar with the code area.
Why didn’t tests catch it: the unit tests didn’t exercise the production query pattern.
Why didn’t the connection pool have headroom: the pool size was set years ago for an older traffic level.
Why didn’t the health check catch the degradation earlier: the health check was a simple HTTP 200 on a cached response.
Why didn’t the canary catch it: the canary didn’t run long enough to see pool exhaustion.

None of these alone would have caused the outage. All of them combined did. The action items address each factor: tighten review, expand tests, resize the pool with a formula, deeper health checks, longer canary bake times for database-related changes. Fix five layers, not one.

code

The Swiss cheese model: a real outage requires holes in multiple defense layers to align simultaneously — fixing only the proximate cause leaves the other holes open for the next incident.

The five-whys method would have landed on “the query was bad” and suggested “review queries more carefully.” Weak. The causal chain method surfaces the compound failure and lets you invest in systemic improvements.

The modern incident-response community has moved toward the causal-chain view, influenced by the Learning from Incidents (LFI) community and writers like John Allspaw (“How Complex Systems Fail” by Richard Cook is the seminal paper). Read it.

99.8 What makes a good postmortem

Six qualities:

(1) Honest. Doesn’t hide the embarrassing details. The shortcuts people took, the assumptions that were wrong, the alerts that were silenced. Everything.

(2) Specific. “Improve monitoring” is not an action item; it’s a prayer. “Add alert on connection-pool utilization > 80%” is an action item.

(3) Quantitative. Impact in minutes, users, revenue, SLI budget. Detection time. Response time. Recovery time. Hard numbers.

(4) Structural. Root cause is a system failure, not a person failure. Every action item addresses a layer in the failure chain.

(5) Readable. Someone who wasn’t in the incident can read the postmortem and understand what happened and why. Minimize jargon, define terms, link to dashboards.

(6) Followed up. Action items are tracked, owned, and reviewed. A postmortem that ends when the doc is published is wasted.

Common failure modes in postmortems:

Hand-wavy action items. “We should consider improving X” — nobody will ever do this. Make it “Alice will add Prometheus alert X by Friday.”
Skipping the timeline. “The service went down, we fixed it.” What did you observe? What did you try? What did you learn during the incident? The timeline is where the lessons live.
Single root cause. As covered in §99.7, almost no incident has a single cause. Force yourself to find at least three contributing factors.
Postmortem as performance review. “Alice caused this by pushing a bad config.” See §99.6.
Action items never prioritized. A postmortem that closes with 15 action items and no priorities means none of them will get done.
Postmortem culture of “we already know the answer.” The responder knows what went wrong at minute 5 and writes the postmortem in 10 minutes, missing the systemic factors. Slow down; write it after sleeping.

The SRE Workbook (chapter 10) has specific templates and examples. Read them. The worked examples are more instructive than the template itself.

99.9 Three case studies

Three brief, generic case studies — the kind of thing that shows up in a real-world incident postmortem. Presented as teaching examples.

Case 1: Kubernetes admission controller OOMing

Summary: A cluster-wide outage where all new pod creations started failing because the admission webhook was OOM-killing under load.

Timeline: A mid-afternoon deploy to an internal service triggered a large pod churn. Pod creation requests accumulated at the API server, each one going through an admission webhook that performed policy checks. The webhook’s memory grew linearly with in-flight requests and the pod had no memory request or limit. When the webhook OOMed, the API server started rejecting new pods because its admission path was broken.

Contributing factors:

The admission webhook had no memory limit and no horizontal scaling.
The deploy system created many pods quickly instead of rolling them slowly.
The webhook’s code path allocated a full YAML representation for every request, instead of streaming.
The failure mode (admission webhook failure) was not in any runbook because it had never been exercised.

Lessons:

Every webhook pod should have memory requests/limits.
Admission webhooks should be horizontally scaled and rate-limited.
Runbooks should cover admission failures as a first-class scenario.
Deploys should cap pod churn rate, not just max concurrent deploys.

Action items: Add memory limits to the webhook deployment. Add HPA based on in-flight requests. Refactor the webhook to stream instead of materialize. Add a runbook entry for admission failures. Cap pod churn in the deploy pipeline.

Case 2: GitOps controller fighting an operator

Summary: A Redis custom resource was continuously reverting its own configuration back-and-forth for hours, causing intermittent connection failures.

Timeline: A GitOps controller applied a Redis CR from the Git repository, which set a specific maxmemory value. A Redis operator installed on the cluster also managed the CR, and it recomputed maxmemory based on the node’s available memory and wrote it back. The GitOps controller saw the change, detected drift from Git, and applied the original value. Loop. Each reconciliation caused a brief Redis restart.

Contributing factors:

Two controllers both owned the same field of the same CR.
Neither controller had a way to say “I own this field, the other should ignore it.”
The reconciliation logs didn’t clearly show the loop; it looked like normal reconciliation activity.
Monitoring didn’t alert on high reconciliation rates.

Lessons:

Multi-controller ownership of CRs requires explicit coordination (e.g., managed-fields in Kubernetes Server-Side Apply).
High reconciliation rates are an alertable signal.
GitOps controllers should not manage fields that are operator-computed.

Action items: Switch the CR to Server-Side Apply with explicit field ownership. Alert on reconciliation rate > N per minute. Document which fields are owned by which controller.

Case 3: Ingress proxy buffer truncating long responses

Summary: A subset of API responses returning long bodies were being silently truncated, causing downstream parsing errors.

Timeline: A user reported JSON parse errors on certain requests. Investigation showed the response body was cut off at exactly 64 KB. The ingress proxy had a default buffer size of 64 KB for response bodies; larger bodies were being truncated instead of streamed. The feature had existed for years but was only triggered now because a new API endpoint returned larger payloads.

Contributing factors:

The proxy’s default configuration did not match the application’s requirements.
There was no integration test for large response bodies.
The truncation happened silently at the proxy level, without logging an error.
The application team did not know the proxy had a response body limit at all.

Lessons:

Infra components with default limits should be documented per-service.
Silent truncation is a worst-case failure mode; configure proxies to error instead.
Integration tests should exercise realistic payload sizes.

Action items: Raise the proxy’s buffer limit and configure it to error on overflow. Add integration tests with large payloads. Document the proxy’s limits in the service’s README. Audit other services for similar exposure.

Each case shows the same pattern: a proximate cause, multiple contributing factors, and systemic action items across layers. None of them blame a person. Each one, if the postmortem is done well, makes the system incrementally more robust against a class of failure.

99.10 The review meeting

The postmortem review meeting is where the learning is amplified. A week or so after the incident, the team (and often representatives from related teams) meets to read the postmortem together and discuss.

Agenda:

Walk the timeline. The author reads it aloud or the group reads silently. Questions welcome.
Discuss the contributing factors. Did we miss any? Are they the right ones?
Discuss the action items. Are they the right ones? Are they owned and prioritized?
Identify patterns. Does this incident look like other recent incidents? Is there a class of issue the organization keeps hitting?
Share the write-up. Post the postmortem to a company-wide channel or mailing list for visibility.

The meeting is deliberately low-pressure. Nobody is on trial. The goal is collective learning, and the tone is “let’s understand this together.” Senior engineers should participate and model the blameless discussion style.

The meeting’s outputs:

A postmortem that’s been reviewed and updated.
Action items in the ticket tracker.
Shared understanding across the team.
Occasionally, a recognition that several recent incidents share a common theme — which becomes a larger initiative.

Teams that skip the review meeting end up with postmortem documents that sit unread. The meeting is the forcing function that makes the document matter.

99.11 The mental model

Eight points to take into Chapter 100:

An incident is an active situation requiring coordinated human response. Classify by severity and staff accordingly.
The Incident Commander coordinates but does not fix. Separate the roles.
Name the response roles — IC, responder, scribe, comms. Everyone knows what they’re doing.
Modern incident tools (Rootly, incident.io) are Slack-first and radically better than legacy PagerDuty-style workflows.
The postmortem is the payoff. Summary, timeline, impact, root cause, action items, lessons.
Blameless culture is non-negotiable. Human errors trace to systems that allowed them. Leadership has to model it.
Causal chains beat five whys. Real incidents have multiple contributing factors.
Action items are owned, prioritized, tracked, and reviewed. A postmortem without follow-through is waste.

In Chapter 100, we close Part VIII with the ORR — the checklist that tries to prevent the incident in the first place.

Read it yourself

Beyer et al. (eds.), Site Reliability Engineering (Google, 2016). Chapters 14-15 on managing incidents and postmortems.
Beyer et al. (eds.), The Site Reliability Workbook (Google, 2018). Chapter 9 (Incident Response) and 10 (Postmortem Culture).
Richard Cook, “How Complex Systems Fail” (1998). Short, famous, essential.
John Allspaw’s work at Adaptive Capacity Labs, especially “Blameless Post-Mortems and a Just Culture.”
The Field Guide to Understanding Human Error, Sidney Dekker. The foundational text on the new view of human error.
The Learning from Incidents community (learningfromincidents.io).

Practice

Write the five required roles for a SEV1 incident response and one-sentence responsibilities for each.
You’re running an incident as IC. A senior engineer is head-down debugging and not responding to your coordination messages. What do you do?
Rewrite this blame statement in blameless form: “Alice rolled out a bad config on a Friday afternoon and broke production.”
Apply five-whys and a causal-chain analysis to one of the three case studies in §99.9. Which method surfaces more action items?
Write a one-page postmortem summary for a hypothetical 30-minute outage caused by a bad database migration. Include impact, timeline, root cause, and three action items.
A postmortem has 15 action items and no priorities. What’s going to happen to them? What do you do about it?
Stretch: Run a tabletop exercise with your team: pick a hypothetical incident (GPU node fails during peak traffic) and walk through the full response process, assigning roles and writing a timeline as you go.