Chapter 92: The four golden signals, RED, USE, LETS

Observability frameworks exist because the space of things you can measure is infinite and most of them are useless. A running production system emits thousands of metrics per replica per minute — CPU counters, kernel stats, JVM internals, allocator histograms, per-endpoint latencies, queue depths, cache hit rates. Staring at all of them is worse than staring at none; the signal drowns in the noise. Frameworks like the four golden signals, RED, USE, and LETS exist to compress “how is this thing doing?” into a handful of numbers you can actually look at.

This chapter walks the four major frameworks side by side. What each one optimizes for, where each one fails, and — more importantly — which perspective (system, service, request) each one speaks to. After reading this, the rest of Part VIII has the right vocabulary: when Chapter 93 talks about Prometheus metrics, Chapter 97 talks about SLIs, and Chapter 100 talks about ORR reviews, the framework boundaries should already be clear in your head.

The punchline upfront: these frameworks are not competing answers to the same question. They are different questions asked from different vantage points. A mature team uses all of them, on different parts of the stack, at the same time.

Outline:

Why observability frameworks exist.
Google SRE’s four golden signals.
Brendan Gregg’s USE method.
Tom Wilkie’s RED method.
The LETS framework.
System vs service vs request perspectives.
How the frameworks compose on a real stack.
Applying frameworks to ML serving specifically.
Pitfalls — when frameworks mislead.
The mental model.

92.1 Why observability frameworks exist

A production system has too many possible metrics. A single vLLM pod exports ~200 metrics out of the box. A Kubernetes node exports several hundred via node_exporter. A GPU node adds another ~50 via dcgm_exporter. A service mesh sidecar adds ~80. An API gateway adds ~60. Multiply by replica count and you are looking at tens of thousands of distinct time series per cluster.

The problem is not “we don’t have enough data.” The problem is “we have too much data and can’t tell which ten numbers matter right now.” Dashboards with 200 panels are not dashboards; they are wallpaper. Nobody reads them. When an incident fires, the on-call engineer has five minutes to find the broken thing, and “scroll through 200 graphs” is not a plan.

The observability frameworks are decision rules for which metrics to promote to first-class status. They say: “out of all the metrics you could graph, graph these four (or three, or five). Put them at the top. Alert on them. Review them on every postmortem. Everything else is diagnostic detail you only drill into after the top-level signals tell you where to look.”

There is no single right framework because different layers of the stack have different natural decompositions. A CPU has a natural framework (how busy, how saturated, how many errors). A request-oriented service has a different one (how many, how fast, how many failed). A queue has a third (how deep, how fast it drains, how long items wait). The frameworks we are about to cover each pick one of these natural decompositions and name it.

Each framework speaks to a different layer: traces answer per-request questions, golden signals/RED answer per-service questions, and USE answers per-resource questions — a mature stack runs all three simultaneously.

92.2 Google SRE’s four golden signals

The Google SRE book (Beyer et al., 2016) popularized the four golden signals as “the four things to monitor for any user-facing service”:

Latency — the time it takes to service a request. Measure the distribution, not the mean. P50, P95, P99, P99.9. Separately measure successful request latency and failed request latency; a 5xx that returns in 10 ms can poison your average if you mix them.
Traffic — a measure of how much demand is placed on the system. For a web service, usually HTTP requests per second. For a streaming pipeline, messages per second. For an LLM serving stack, it’s requests per second, prompt tokens per second, or output tokens per second — more on this in §92.8.
Errors — the rate of requests that fail. Explicit failures (5xx, exceptions) and implicit failures (200 OK with a wrong answer, slow responses that exceed the client’s timeout). The implicit category is the one that bites you.
Saturation — how “full” the service is. For a CPU-bound service, maybe CPU utilization. For a queue-driven service, queue depth. For an LLM serving stack, KV cache utilization and num_requests_waiting. Saturation is the leading indicator — by the time latency and error rate show degradation, saturation has been rising for a while.

The framework is deliberately service-oriented, not host-oriented. It assumes the thing you are monitoring is a service that accepts requests and returns responses. It does not say anything about disk I/O or memory pressure on a specific node; those are diagnostic details, not top-level signals.

Why these four and not others? Because together they answer three questions any on-call engineer needs to answer in the first thirty seconds: are we serving traffic, are we serving it correctly, and are we about to fall over? Latency and errors cover “correctly.” Traffic covers “are we serving.” Saturation covers “about to fall over.” You can argue for swapping saturation with something else (more on this in §92.5, LETS), but the shape is right.

The most common mistake with golden signals is to measure only the averages. Latency is a distribution, not a number. A service with a p50 of 50 ms and a p99.9 of 30 seconds is a broken service even though the average might look fine. Always report percentiles, and pick the percentile that matches your SLO (Chapter 97).

Saturation is the only leading indicator among the four golden signals — it typically rises minutes before latency and error rate show degradation, making it the most valuable early-warning metric.

92.3 Brendan Gregg’s USE method

Brendan Gregg’s USE method (2012) is the counterpart to the golden signals for hosts, resources, and low-level system performance. It predates the golden signals in spirit and comes from the Solaris/Linux performance engineering world. The three letters:

Utilization — the fraction of time the resource was busy servicing work. A CPU at 80% utilization is busy 80% of the time. A disk at 95% utilization spends 95% of its time servicing I/O.
Saturation — the amount of work the resource cannot service, usually queued. For a CPU, the run queue length. For a disk, the I/O queue depth. For a network interface, buffer overruns.
Errors — the count of error events. For a disk, I/O errors. For a network interface, dropped packets, CRC errors. For memory, ECC errors.

USE is resource-oriented. You apply it to every hardware resource on the host: CPU, memory, disk, network, GPU, and so on. For each, you ask: utilization, saturation, errors. If all three are clean across all resources, the host is not the bottleneck — go look elsewhere.

The USE method shines in two places. First, when investigating a host that looks unhealthy but you don’t know where. Second, when building capacity plans: USE numbers let you project when a resource will become the bottleneck under load growth.

One subtle point: utilization does not imply saturation. A resource can be at 100% utilization without being saturated (if there is no queued work behind it). And it can be saturated without being at 100% utilization (if the scheduler hasn’t yet pushed enough work through — think bursty workloads). The two numbers tell different stories and you need both.

For GPUs specifically (Chapter 93 covers the exporter), the USE view looks like:

Utilization: DCGM_FI_DEV_GPU_UTIL — fraction of time the SM was doing work.
Saturation: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL for memory saturation; no clean signal for compute saturation (the closest thing is num_requests_waiting from the serving stack).
Errors: DCGM_FI_DEV_XID_ERRORS — Nvidia driver error count. Non-zero here is always a serious signal.

92.4 Tom Wilkie’s RED method

Tom Wilkie (one of the co-founders of Grafana Labs and contributor to the Prometheus ecosystem) proposed RED in 2018 as a simpler framing for microservices:

Rate — requests per second.
Errors — rate of failed requests.
Duration — distribution of request latency.

Notice what’s missing: saturation. RED is three letters, not four. The rationale: at the microservice level, in a horizontally scaled world, saturation is a property of the underlying runtime (pod, node, GPU) that is usually managed by a separate system (autoscaler, scheduler). If your scaling works correctly, you shouldn’t need to think about saturation at the service level — the autoscaler thinks about it for you.

This is a deliberate simplification. It trades off “complete” for “usable.” Three metrics per service, applied uniformly across every service in the fleet, gives you a dashboard where every row looks the same: rate, errors, duration, one row per service. It’s visually digestible and it maps directly onto distributed tracing (Chapter 95) — each span has a rate (how often the operation runs), an error count, and a duration.

RED is the right framework when you have many small services and want a uniform way to look at them. It’s less useful for a handful of monolithic services, where the details inside each one matter more than the cross-service consistency.

The canonical implementation is the http_requests_total counter plus the http_request_duration_seconds histogram, both labeled by service, endpoint, and status code. From those two metric families alone you can compute R, E, and D with PromQL:

# Rate
sum(rate(http_requests_total{service="checkout"}[5m])) by (endpoint)

# Errors
sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m])) by (endpoint)
 / sum(rate(http_requests_total{service="checkout"}[5m])) by (endpoint)

# Duration (p99)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le, endpoint))

Three queries, one dashboard row per service, done. This uniformity is what makes RED stick.

92.5 The LETS framework

LETS is a newer (circa 2020) framework that reorders and slightly relabels the golden signals:

Latency — distribution of response times.
Errors — rate of failing requests.
Traffic — rate of requests.
Saturation — fullness of the service.

It’s effectively the golden signals with a pronounceable acronym. The reason it exists: “golden signals” is not searchable in a dashboard, and “LETS” gives you a one-word name for the same concept. Also, some people argue LETS is easier to teach to new engineers than “four golden signals” because it’s a word.

The content is identical to the golden signals. Don’t get into framework wars about LETS vs four-golden-signals; they are the same thing wearing different hats.

Where LETS is useful: naming conventions in dashboards. A dashboard folder called “LETS” or a row labeled “LETS: checkout” is a clear signal to anyone reading it. Compare with “Overview” or “Key Metrics,” which tell you nothing about what’s on the dashboard.

92.6 System vs service vs request perspectives

Here’s the unifying lens. The four frameworks correspond to three different perspectives on a running system:

Perspective	Question	Framework	Granularity
System / resource	”Is the hardware happy?”	USE	Per CPU, per disk, per GPU, per NIC
Service / endpoint	”Is the service happy?”	RED, golden signals, LETS	Per service, per endpoint
Request / trace	”Is this specific request happy?”	Spans, errors, latency (trace-level RED)	Per request, per span

A production-grade observability setup has all three perspectives running at the same time:

USE on every host via node_exporter, every GPU via dcgm_exporter, every container via cadvisor. Alerts fire when a host resource approaches saturation.
RED / golden signals on every service via app-level instrumentation (Prometheus histograms, OTel auto-instrumentation). Alerts fire on SLO breaches (Chapter 97).
Per-request tracing via OpenTelemetry (Chapter 95). Used for drilling into specific slow or failed requests.

The frameworks are not in tension. They operate on different things and they answer different questions. The golden signals tell you that the checkout service is slow. USE tells you that the GPU it’s running on is saturated. Tracing tells you that inside a specific slow request, the bottleneck was the embedding step. Each framework contributes one piece of the diagnosis.

A common junior mistake is to pick one framework and apply it to everything. “We use RED everywhere” is not a complete observability strategy — you still need USE on the hosts. “We use the four golden signals at the service level” is good, but if you don’t also have USE at the resource level, you will be blind to disk saturation until the disk is completely full.

92.7 How the frameworks compose on a real stack

Walk through a concrete example. An LLM serving stack looks something like:

client -> CDN -> API gateway -> auth service -> router -> [vLLM pod 1 ... vLLM pod N] -> GPU

Observability layers:

CDN: RED at the edge (requests/sec, error rate, p99 latency). Saturation is the CDN vendor’s problem.
API gateway: golden signals (latency, traffic, errors, concurrent connections as saturation).
Auth service: RED (it’s stateless and horizontally scaled).
Router: golden signals (latency, traffic, errors, queue depth as saturation).
vLLM pod: golden signals, because saturation matters intensely here (KV cache, num_requests_waiting — see Chapter 51). Latency is measured as TTFT (time to first token) and TPOT (time per output token), not one “request duration” number.
GPU: USE (utilization, memory saturation, XID errors).
Node: USE across CPU, memory, disk, NIC.

That’s five layers, three frameworks, and maybe fifty actual metrics being graphed. It sounds like a lot, but it decomposes neatly because each layer is speaking one framework at a time. On-call engineers know “if latency is bad at the edge, walk down the stack until you find where it enters; then switch to USE on the host for that layer.”

This is the pattern. Frameworks are signposts along a diagnostic path, not ends in themselves.

graph LR
  E[Edge / CDN<br/>RED] --> G[API gateway<br/>Golden signals]
  G --> A[Auth service<br/>RED]
  A --> R[Router<br/>Golden signals]
  R --> V1[vLLM pod 1<br/>Golden signals]
  R --> V2[vLLM pod N<br/>Golden signals]
  V1 --> GPU1[GPU<br/>USE]
  V2 --> GPU2[GPU<br/>USE]
  style V1 fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style V2 fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Each layer in the LLM serving stack speaks a different framework: CDN and stateless services use RED, stateful nodes with saturation concerns use the full golden signals, and raw hardware uses USE — making the diagnostic path “walk down the stack until you find the broken layer, then switch to USE.”

92.8 Applying frameworks to ML serving specifically

ML serving has quirks that break naive application of the standard frameworks. A few:

Latency is not one number. For an LLM serving stack, “request latency” decomposes into TTFT (time to first token) and TPOT (time per output token). Users experience both, but they have very different causes (prefill cost vs decode batch fullness). If you measure only end-to-end latency, you’ll miss that TTFT degraded while TPOT stayed fine. Always split LLM latency.

Traffic is not one number. Is it requests per second, prompt tokens per second, or output tokens per second? All three matter. A single 100k-token request does more work than 1,000 single-token requests, but RPS treats them identically. For capacity planning, tokens per second is the right traffic metric; for SLO framing, RPS is often more actionable because users care about requests, not tokens.

Errors include “correct but wrong” answers. A model that returns a coherent English response to every query with 100% HTTP 200s can still be broken — if the answers are hallucinated or irrelevant. This is the eval problem (covered elsewhere in the book). From an observability perspective, you either need online eval signals (thumbs up/down, retry rate, answer refusal rate) or you need to accept that your error signal is incomplete.

Saturation is multi-dimensional. KV cache utilization, number of requests waiting, GPU SM utilization, GPU memory utilization, HBM bandwidth — all are saturation signals and none is sufficient alone. A request queue that’s full with a cold GPU means the router is the bottleneck; a busy GPU with an empty queue means the model is the bottleneck. You need both.

So for an LLM serving pod, the golden signals look more like:

Latency: TTFT p50/p99, TPOT p50/p99.
Traffic: requests/sec, prompt tokens/sec, output tokens/sec.
Errors: HTTP 5xx rate, timeout rate, refusal rate, optional online-eval failure rate.
Saturation: KV cache %, num_requests_waiting, GPU util, HBM util.

That’s a lot of numbers, but they cluster naturally under the four golden-signal headings. The framework still works; you just have to accept that each “signal” may be a small handful of sub-signals for the special case of a stochastic, token-streaming workload.

92.9 Pitfalls — when frameworks mislead

A framework is a tool, not a guarantee. Common ways each one goes wrong:

Averaging latency. If you report the mean request duration, a tiny number of very slow requests disappear into the average. Always use histograms and percentiles. Prometheus histogram_quantile exists for this reason.

Ignoring slow errors. Errors that return fast skew average error latency downward; errors that return slow (e.g., a 504 after 30 seconds of waiting) look like successful high-latency requests unless you split by status code. Separate successful and failed latency distributions.

USE on cloud VMs. Cloud providers oversubscribe hosts, so “CPU utilization” reported inside a VM doesn’t mean what it means on bare metal. A VM showing 40% CPU may actually be CPU-starved because the noisy-neighbor VM is stealing cycles (see steal in top). USE still works, but you have to understand what each metric is measuring on your infrastructure.

Saturation without a threshold. Saturation is only useful if you know what “full” means. 80% KV cache? Fine if steady-state, near-panic if rising fast. Always pair saturation with a rate (how fast it’s approaching the limit).

RED on internal services with fan-out. A service that calls five backends has five times the internal rate of its external rate. If you graph internal rate alongside external rate on the same chart, you’ll confuse yourself. Label by direction (inbound vs outbound) and keep them separate.

Metrics lying because of cardinality explosion. If you label a counter with a high-cardinality dimension (request ID, user ID), the time series count explodes and your Prometheus falls over. The metric is there, but you can’t query it. We cover this in depth in Chapter 93.

Forgetting the dependent stack. A service’s golden signals can look fine while its downstream dependency is broken, if the service handles the downstream failure “gracefully” (returning cached stale data, say). The only way to catch this is to also monitor the downstream as its own service. End-to-end tracing (Chapter 95) helps here.

92.10 The mental model

Eight points to take into Chapter 93:

Observability frameworks exist to compress thousands of metrics into a small first-class set that humans can actually look at.
Google SRE’s four golden signals — latency, traffic, errors, saturation — are the standard for user-facing services.
USE (utilization, saturation, errors) is the standard for hardware resources. Apply it to every CPU, disk, GPU, and NIC.
RED (rate, errors, duration) is a three-letter simplification for microservices that trades saturation for uniformity.
LETS is the same as the four golden signals, just with a pronounceable acronym. Don’t waste effort arguing LETS vs four-golden-signals.
Perspectives matter. USE for system/resource, RED/golden signals for service/endpoint, tracing for request/span. A complete setup uses all three.
LLM serving breaks naive framework application — latency is TTFT+TPOT, traffic is tokens/sec, saturation is multi-dimensional (KV cache, queue, GPU). Adapt accordingly.
Frameworks are signposts, not guarantees. They tell you where to look, not what’s wrong. You still need the underlying metrics to do the actual diagnosis.

In Chapter 93, we go one layer down: how Prometheus actually stores and queries the metrics that make the golden signals real.

Read it yourself

Beyer, Jones, Petoff, Murphy (eds.), Site Reliability Engineering (Google, 2016). Chapter 6 introduces the four golden signals.
Brendan Gregg, “The USE Method,” brendangregg.com. The original write-up with per-OS cheat sheets.
Tom Wilkie, “The RED Method: Key metrics for microservices architecture,” Grafana Labs blog (2018).
Cindy Sridharan, Distributed Systems Observability (O’Reilly, 2018). A short, dense tour of modern observability thinking.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O’Reilly, 2022). The Honeycomb view — more traces-first than metrics-first.
The Google SRE Workbook (free online). Practical applications of the four golden signals in case studies.

Practice

For a stateless image resizing service behind an HTTP gateway, write out the four golden signals with concrete Prometheus metric names.
Apply USE to the GPUs on a node running vLLM. Which dcgm_exporter metrics correspond to U, S, and E?
RED deliberately omits saturation. In what kind of deployment is that trade-off wrong?
Why does splitting latency by status code matter? Construct a scenario where the unsplit latency graph is misleading.
For an LLM serving pod, list every sub-metric that rolls up under “latency,” “traffic,” “errors,” and “saturation.” Group them.
A CDN in front of your API shows p99 latency of 200 ms. Your origin shows p99 latency of 50 ms. Where might the extra 150 ms be coming from?
Stretch: Pick a real open-source service (e.g., a vLLM deployment or a Redis instance). Write a Grafana dashboard JSON that presents exactly the four golden signals — no more, no less — with correct PromQL queries.