SLI, SLO, SLA, error budgets
"100% is the wrong reliability target for everything. Including the pacemaker"
SLI, SLO, SLA, and error budgets are the vocabulary Google SRE gave the rest of the industry for thinking about reliability as an engineering constraint instead of a vibe. They aren’t academic jargon — they’re the mechanism by which a team decides whether to ship the next feature or pause and fix the last incident. Without them, “reliability vs velocity” is a political argument. With them, it’s arithmetic.
This chapter walks the full chain: what makes a good SLI (service level indicator), how to choose an SLO target (service level objective), what an SLA (service level agreement) actually is and why you should hate it, the error budget as a forcing function, and the math that makes it all work. Chapters 90-94 taught you how to measure; this chapter is how to turn measurements into commitments.
Outline:
- The vocabulary: SLI, SLO, SLA.
- Defining SLIs that matter.
- Choosing an SLO target.
- The error budget.
- Burn rate alerts.
- The 99.9% conversation.
- SLOs for ML systems specifically.
- When to stop shipping.
- Composite SLOs and dependency math.
- Common anti-patterns.
- The mental model.
97.1 The vocabulary: SLI, SLO, SLA
SLI — Service Level Indicator. A measurement of some aspect of service behavior. “The fraction of requests served successfully in the last 30 days.” “The 99th percentile TTFT over the last 5 minutes.” An SLI is a number that comes out of a metrics or trace system.
SLO — Service Level Objective. A target for an SLI, internal to the team. “99.9% of requests over 30 days will succeed.” “p99 TTFT ≤ 500 ms over a 5-minute window.” An SLO is a promise the team makes to itself about the SLI.
SLA — Service Level Agreement. A contractual commitment to an external party, usually with financial penalties for violation. “99.95% uptime or we credit 10% of the monthly bill.” An SLA is almost always looser than the internal SLO (the team wants the internal SLO to be tighter so they have a buffer before the SLA bites).
The distinction matters because they serve different purposes. The SLI is the fact. The SLO is the engineering target. The SLA is the lawyer’s concern.
Teams get these confused all the time. “Our SLA is 99.9%” usually means “our SLO is 99.9% and we don’t actually have an SLA at all.” “We’re violating our SLO” usually means “the SLI number dropped below the target.” Internalize the three-word taxonomy and you’ll be clearer than most.
97.2 Defining SLIs that matter
Most teams’ first SLI attempt is “uptime.” This is wrong. “Uptime” is not a user-visible concept — a service that responds but returns 500s is “up” by any ping check. The right SLI is user-visible: did the request succeed, and was it fast enough?
The SRE book formalizes this with the concept of “good events vs total events”:
SLI = count(good events) / count(valid events)
For a request-oriented service, a typical SLI:
success_rate = count(requests with status=2xx and duration<threshold)
/ count(requests)
The “valid events” denominator excludes things that weren’t the service’s fault — requests canceled by the client before reply, requests the load balancer rejected for malformed input, health checks, synthetic probes. You want the denominator to reflect “real user requests that this service was supposed to handle.”
A good SLI has four properties:
- Measurable at the user-facing boundary. Measured at the load balancer, the API gateway, or the CDN — not at the pod. A request that times out at the gateway but “succeeds” on the pod is a failure from the user’s perspective.
- Observable continuously. Derived from logged events, not from synthetic probes alone. Synthetic probes are useful for coverage, but the primary SLI should be real traffic.
- Aligned with user experience. If users care about latency, measure latency. If they care about answer correctness, measure that (which is hard, but not impossible — see §97.7).
- Stable when nothing is wrong. An SLI that fluctuates 5% when the system is healthy is too noisy to set a target on.
The usual shortlist of SLIs for a service:
- Availability — fraction of requests returning a success status code.
- Latency — fraction of requests returning in under a threshold (e.g., <500 ms). Typically measured as “count of fast successful requests / count of all valid requests.”
- Throughput — for systems where “serving all the load” matters more than “serving each request fast.”
- Correctness — for ML or data systems, fraction of requests with the right answer. Usually a sampled eval.
- Durability — for storage systems, fraction of data preserved.
- Freshness — for pipelines, fraction of data up to date within a window.
Pick 2-4 SLIs per service. More than that and nobody looks at them.
97.3 Choosing an SLO target
This is the most contested part of the whole framework. What number do you pick?
The wrong answer: “as close to 100% as possible.” 100% is infeasible and pursuing it costs infinite engineering effort for diminishing returns. The SRE book states this bluntly: 100% is the wrong reliability target for everything. Even pacemakers (which are designed to be extremely reliable) do not target 100%; they target 99.99% or so, with acceptable batteries-and-monitors failover modes.
The right answer: pick a target that’s meaningfully below the reliability of the systems that depend on you. If your users are on flaky mobile networks that drop 0.5% of requests on average, spending months to improve from 99.9% to 99.95% gains them nothing — they can’t distinguish your reliability from the network’s. Match the user’s actual experience.
The classic target ladder:
| SLO | Allowed downtime per 30 days | Allowed downtime per year |
|---|---|---|
| 99% | 7h 12m | 87h 36m |
| 99.5% | 3h 36m | 43h 48m |
| 99.9% (“three nines”) | 43m 12s | 8h 45m |
| 99.95% | 21m 36s | 4h 22m |
| 99.99% (“four nines”) | 4m 19s | 52m 34s |
| 99.999% (“five nines”) | 26s | 5m 15s |
Each additional nine costs 10× more engineering effort. The jump from 99% to 99.9% is painful; from 99.9% to 99.99% is excruciating; from 99.99% to 99.999% is nearly impossible for a typical software team and usually reserved for specialized infrastructure (DNS, network backbones).
The rule of thumb: start at 99% or 99.5% for new services, raise as the service matures, and don’t go above 99.95% unless you have a specific reason. Most application services target 99.9% for user-facing paths and 99.5% for internal dependencies.
97.4 The error budget
Given an SLO, the error budget is the amount of unreliability the team is allowed to produce before the SLO is violated. It’s the 1 - SLO, expressed as an allowance over the SLO window:
error_budget = (1 - SLO) × total_events_in_window
For an SLO of 99.9% over 30 days at 1M requests/day, the budget is:
budget = 0.001 × 30M = 30,000 requests of failure
The team can “spend” those 30,000 failures however they want: one big outage, many small ones, planned maintenance, risky deployments. When the budget is gone, the SLO is violated and the team has to stop risky work until the budget replenishes.
This is the forcing function. Error budget says: reliability is a fixed resource, not an unlimited good. If you blew your budget on a bad deploy, you don’t get to also ship the risky feature this week. You fix the bad deploy, let the budget regenerate, and then ship.
The practical policy implications:
- Budget-healthy state (plenty of budget left): normal shipping, normal risk.
- Budget-low state (< 25% remaining): slow down, extra review, freeze non-critical work.
- Budget-exhausted state (budget gone): full freeze on risky changes, focus on reliability work until budget replenishes.
This policy is documented, automated where possible (CI gates, deployment pauses), and agreed upon before any incident happens. Agreeing in the middle of an incident is too late — emotions are high and the argument is political.
97.5 Burn rate alerts
A naive alert on SLO: “fire when the SLI drops below the target.” This is wrong — it only fires when you’ve already violated the SLO, which is too late. The right alert fires when you’re on a trajectory that will violate the SLO.
Burn rate is the ratio at which the error budget is being consumed compared to the allowed rate. A burn rate of 1 means you’re using budget at exactly the rate that keeps you at the SLO. A burn rate of 10 means you’re using it 10× faster — if you sustain that, you’ll burn through a month’s budget in 3 days.
Google’s multi-window multi-burn-rate alerting, from the SRE Workbook:
- Fast burn: burn rate > 14.4 for 1-hour window → alert immediately (pages on-call). Consumes 2% of a monthly budget in an hour.
- Slow burn: burn rate > 6 for 6-hour window → ticket (doesn’t page). Consumes 5% of a monthly budget in 6 hours — significant but not immediately paging.
The specific numbers are chosen so that fast burn triggers in time to save most of the budget, while slow burn catches slow regressions that would otherwise quietly eat the budget.
Prometheus rules for burn rate alerts look like:
- alert: HighErrorBudgetBurnRate
expr: |
(
(
sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="api"}[1h]))
) > (14.4 * 0.001)
)
and
(
(
sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m]))
) > (14.4 * 0.001)
)
for: 2m
labels:
severity: page
The and across two windows (1h and 5m) prevents spurious alerts from brief blips — the burn has to be sustained at both scales.
The mental model: alert on burn rate, not on the SLI itself. Rate tells you the trajectory. The SLI tells you the position. Trajectory is what matters for response time.
97.6 The 99.9% conversation
Teams negotiate SLO targets. The conversations are usually the same and they’re usually wrong in predictable ways.
Wrong: “Let’s target 99.99%.”
Usually proposed by someone who thinks higher is better and hasn’t priced it. 99.99% means 4m 19s of allowed failure per month. A single deploy that causes a 5-minute blip violates the SLO. A single minor bug that takes 10 minutes to roll back violates it. To actually meet 99.99% you need blue/green deploys, automated rollback on error-rate regression, pre-production canaries that match production load, and a culture of not breaking things. That’s a multi-quarter investment and it trades off against feature velocity.
Wrong: “Let’s target whatever our SLA is.”
If your SLA is 99.9%, your internal SLO should be tighter — 99.95% or so — so you have room to miss and still not trigger the SLA’s penalties. Making them equal is giving yourself zero buffer against legal exposure.
Wrong: “We’ll just target 100% aspirationally.”
Aspirational SLOs don’t work. People either ignore them (because they’re impossible) or they burn out trying to meet them. Pick a number that is achievable with reasonable effort, and take it seriously.
Right: “What number reflects actual user experience without us over-investing?”
Look at the SLI history, look at user complaint data, look at what your upstream dependencies commit to. Pick a number 1 “nine” above the dependencies’ floor (so their occasional flakiness doesn’t auto-violate you) and below the level where extra effort would go unappreciated.
For most application services, that number lands around 99.9%. For user-facing chat with inference, 99.5% - 99.9% for availability; latency SLOs are often more like “p99 TTFT < 500 ms 95% of the time.”
97.7 SLOs for ML systems specifically
ML systems break the classical SLI/SLO model in three ways:
(1) Correctness is not binary. A traditional service returns 200 or 500. An LLM returns text, and the question “was that text correct?” is squishy. You can’t put it on a Prometheus histogram by default.
The workarounds:
- Proxy metrics — refusal rate, response length, retry rate, thumbs-down rate. None of them measure correctness directly but all correlate.
- Online evals — run a small fraction of requests through an eval model that scores the output. Alert when the score distribution drops.
- Shadow tests — periodically run a known-answer set through production and check that the answers match expectations.
None of these are perfect. Most production ML systems have explicit SLOs for availability and latency, and softer “correctness” monitoring that’s more like a dashboard than a hard SLO.
(2) Latency is two numbers, not one. TTFT (time to first token) and TPOT (time per output token) are separately observable and separately tunable. A single “request duration” SLO hides important behavior. Write separate SLOs:
SLO: p95 TTFT ≤ 800 ms
SLO: p95 TPOT ≤ 50 ms
SLO: availability (2xx requests that start streaming) ≥ 99.5%
(3) “Success” depends on request type. A short factoid request has different latency expectations than a 50k-token document summary. If you mix both, your p95 latency is meaningless. Partition the SLI by request type (short/long, streaming/non-streaming) and set different SLOs for each bucket.
The ML-serving SLO pattern that works: separate SLOs for availability, TTFT, TPOT, partitioned by model and request type, plus a soft eval-based quality metric tracked as a dashboard but not a hard gate. It’s more complex than a single “99.9% uptime” number, but it’s honest about what the system actually does.
97.8 When to stop shipping
The error budget’s most important function is to tell a team when to stop. The policy is usually:
- Budget > 50%: ship normally.
- Budget 25-50%: reduce risky work, extra review on risky PRs.
- Budget 10-25%: freeze non-critical risky changes. Focus on reliability work.
- Budget < 10%: full feature freeze. Only reliability fixes, security patches, and approved exceptions.
- Budget exhausted: declare an incident, run a full retro, block any risky work until the budget replenishes through the rolling window.
The tricky part is that this cuts against feature pressure. A PM who promised a feature by end of quarter doesn’t want to hear “we burned the error budget so we’re freezing.” The team has to pre-agree on the policy, with management buy-in, so that when the freeze hits, it’s a rule rather than a negotiation.
When the policy works, it creates the right incentives. Engineers who would have cut corners on reliability start investing in it because they know budget exhaustion means freeze. PMs who would have pushed for speed start appreciating stability because they’ve seen a freeze eat a quarter. Management starts valuing reliability work because they can quantify the cost of not doing it.
When the policy doesn’t work, it’s usually because management overrides it during budget exhaustion (“we have to ship this feature anyway”). One override is fine; a pattern of overrides makes the SLO meaningless. The team learns the policy is fake and stops enforcing it, the reliability work stops, and the next major incident is worse. Don’t override the freeze. That’s the entire point.
97.9 Composite SLOs and dependency math
A service depends on other services. Its achievable reliability is bounded by theirs. The math:
If service A calls services B and C sequentially, and both must succeed, the availability of A is at most:
avail(A) ≤ avail(B) × avail(C)
Two dependencies at 99.9% each give a 99.8% ceiling on the dependent service. Three dependencies at 99.9% give 99.7%. A service with 10 sequential critical dependencies each at 99.9% can’t exceed 99.0% — the compounded unreliability eats it.
The implications:
- You cannot commit to a tighter SLO than your dependencies allow, unless you can route around them (retries, circuit breakers, graceful degradation).
- You should know the SLOs of your critical dependencies. If you don’t know, you’re flying blind — and your dependencies may have no SLO at all, in which case you’re trusting their luck.
- Reliability patterns that hide failures (retries, caches, fallback data, graceful degradation) let you exceed the naive ceiling. A service that retries once against an idempotent dependency can double-count the dependency’s availability:
1 - (1 - avail)² ≈ 1 - error², which for a 99.9% dependency gives 99.9999%. This is why retries are so valuable.
For ML systems specifically, the dependency chain usually includes: API gateway, auth, routing, serving, GPU hardware, model cache, external LLM provider (if proxying), databases, feature store. Each has its own availability. The product-facing availability is bounded by the compound. Work out the chain and you’ll often find one link that’s the weak point — usually a link you didn’t think was critical.
97.10 Common anti-patterns
(1) SLOs on the wrong boundary. Measuring availability at the pod instead of the edge. A pod can be fine while users get 500s from the LB. Always measure at the user-facing boundary.
(2) Error budget as a punishment, not a tool. Treating budget exhaustion as “the reliability team’s fault” instead of “a shared signal that the team has to slow down.” The point of the budget is to align incentives, not to blame.
(3) SLOs that never burn. If your SLO is 99% but your SLI is consistently 99.99%, the SLO is too loose and gives you no signal. Tighten until it occasionally bites.
(4) SLOs that always burn. If the SLO is 99.9% but you consistently deliver 99.5%, you’re either going to violate it every month (meaningless) or you need to lower the SLO. Pick a number you can actually hit.
(5) Alerting on the SLI directly. As covered in §97.5, alert on burn rate, not the raw SLI value.
(6) Ignoring error budget exhaustion. “We blew the budget but we have to ship anyway” — a few times is fine, a pattern destroys the framework.
(7) Measuring only successful request latency. Failed requests that return fast pull the latency numbers down. Measure “fast successful requests” specifically.
(8) Using SLA as SLO. Your contractual SLA should be looser than your internal SLO. Making them equal gives you zero buffer against legal exposure.
97.11 The mental model
Eight points to take into Chapter 98:
- SLI is the measurement, SLO is the target, SLA is the contract. Know which is which.
- Good SLIs are measured at the user boundary, exclude invalid events, and reflect what users actually experience.
- Choose SLO targets based on user experience, not aspiration. 99.9% is a common default; higher costs exponentially more.
- Error budget = (1 - SLO) × total events. It’s a finite resource the team spends on risk.
- Alert on burn rate, not on raw SLI. Multi-window (fast + slow burn) to catch both spikes and regressions.
- ML systems need split SLOs — availability, TTFT, TPOT, optionally a soft quality metric, all partitioned by request type.
- Stop shipping when the budget is exhausted. The freeze is the point.
- Dependency math compounds. Your achievable SLO is bounded by your dependencies’ product.
In Chapter 98, we look at the concrete mechanism for using error budgets responsibly during rollouts: canary patterns.
Read it yourself
- Beyer, Jones, Petoff, Murphy (eds.), Site Reliability Engineering (Google, 2016). Chapter 4 is the canonical SLI/SLO/SLA chapter.
- Beyer, Murphy, Rensin, Kawahara, Thorne (eds.), The Site Reliability Workbook (Google, 2018). Chapters 2-5 are the practical guide. The burn-rate alerting math is in chapter 5.
- Alex Hidalgo, Implementing Service Level Objectives (O’Reilly, 2020). The book-length treatment of SLO implementation in real teams.
- The OpenSLO spec (github.com/OpenSLO) — a YAML schema for machine-readable SLO definitions.
- Niall Murphy’s blog posts on SLO math, especially composite SLOs and error budget policies.
- “The Calculus of Service Availability” (ACM Queue) — for the dependency chain math.
Practice
- For a chat API that must return a first token within 1 second, write an SLI and a candidate SLO. What’s the good-event definition?
- Convert an SLO of 99.95% into allowed downtime per month and per year.
- A service burned 25% of its monthly error budget in the first 3 days. What’s the burn rate? Is this a problem?
- Write a Prometheus alert expression for a multi-window burn rate on
http_requests_total, with a fast burn window of 5m and a slow burn window of 1h. - For a service whose two critical dependencies each have a 99.9% availability SLO, what’s the achievable availability of the service (without retries)? With a single retry on an idempotent path?
- Argue why setting an SLO of 99.99% for a new service is almost always wrong.
- Stretch: Define a three-SLI SLO for a vLLM deployment: availability, TTFT, TPOT. Pick targets, write Prometheus expressions, and write burn-rate alerts for each.