Chapter 95: Distributed tracing: OpenTelemetry, Jaeger, Tempo

Metrics and logs are time-series data and text streams. Traces are graphs. A request that touches 12 services and 40 downstream calls produces a tree of causally linked operations, and distributed tracing is the discipline of capturing, propagating, and querying that tree.

This chapter covers distributed tracing from first principles: the span/trace data model, how context propagation works across network hops, why instrumentation is harder than it looks (especially at async and message-queue boundaries), the sampling strategies that make traces economically viable, and the OpenTelemetry + collector architecture that has become the default. Chapter 94 showed how to correlate logs with a request via trace IDs; this chapter makes that trace ID mean something.

Outline:

Why metrics and logs aren’t enough.
Span, trace, context.
Propagation headers: W3C Trace Context and B3.
Head-based vs tail-based sampling.
The OTel collector pipeline.
Jaeger vs Tempo vs Zipkin.
Where instrumentation breaks.
Trace-to-log and trace-to-metric correlation.
Trace data volume and retention.
Production patterns.
The mental model.

95.1 Why metrics and logs aren’t enough

A slow request story. User files a ticket: “chat latency was 12 seconds at 14:03.” The on-call engineer looks at metrics — p99 latency looks fine across the serving fleet, no spike at 14:03. Looks at logs — a few hundred thousand info lines in that minute, nothing obviously wrong. What happened to that specific request?

Metrics are aggregated; they don’t carry per-request identity. Logs are per-line; they don’t show causal relationships or timing between services. Neither answers “what did this one request actually do, and where did its 12 seconds go?”

Traces are the third pillar because they answer that question. A trace is a single request’s journey through the system, recorded as a tree of operations (spans) with their start times, durations, and relationships. Given a trace ID, you can reconstruct the full path, see which service added 11 of the 12 seconds, and jump from there into that service’s logs and metrics.

Two claims follow:

Tracing is non-negotiable for a distributed system. A monolith doesn’t need traces because a single stack trace tells the story. Five services and one message queue already defeat stack traces.
Tracing is the glue between metrics and logs. A trace ID in a log line lets you jump from a log entry to the full trace. A span with a high-latency tag lets you jump from a trace to the logs emitted inside that span. Metrics aggregated over trace properties (exemplars) let you jump from a metric anomaly to a representative trace.

This chapter is about how to build that glue correctly.

95.2 Span, trace, context

The data model, in three definitions:

Span: a single unit of work. Has a name (“GET /api/chat”, “db.query”, “vllm.generate”), a start time, a duration, a set of key/value tags (attributes), a set of events (timestamped logs attached to the span), and references to other spans (parent, follows-from, links).
Trace: a set of spans with a common trace ID, connected into a tree or DAG by parent-child references. A trace represents one request’s worth of work across the whole system.
Context: the runtime object that carries the current trace ID and current span ID through function calls and across network boundaries. The span is the record of an operation; the context is the thing that tells you “the current operation has this parent.”

Concretely, a span looks like:

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "parent_span_id": "5f5c8e9d00000000",
  "name": "vllm.generate",
  "start_time": "2026-04-10T14:03:01.234Z",
  "end_time": "2026-04-10T14:03:12.891Z",
  "attributes": {
    "model": "llama-3-70b",
    "prompt_tokens": 3421,
    "output_tokens": 512,
    "ttft_ms": 450
  },
  "events": [
    {"ts": "2026-04-10T14:03:01.450Z", "name": "prefill_complete"},
    {"ts": "2026-04-10T14:03:12.891Z", "name": "eos"}
  ]
}

A trace ID is 128 bits (usually rendered hex). A span ID is 64 bits. Both are random. The fact that they’re random means a single span carries enough information to find its entire trace — you just query “find all spans with this trace_id.”

The decode span — the wide highlighted bar — accounts for 95% of the total request latency; without the trace tree this slow span would be invisible in aggregated metrics because the p50 is fine.

95.3 Propagation headers: W3C Trace Context and B3

For a trace to span services, the trace ID and parent span ID must travel with the request as it crosses network hops. This is context propagation. The mechanism: HTTP headers (or equivalent for non-HTTP transports).

The standard is W3C Trace Context, a W3C Recommendation as of 2020. Two headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor1=state1,vendor2=state2

The traceparent header is version-trace_id-parent_id-flags. The flags byte says things like “this trace is sampled” (bit 0). The tracestate header is optional vendor-specific state.

Before W3C Trace Context, every tracing system had its own headers. The most common legacy format is B3 (Zipkin’s), with headers like X-B3-TraceId, X-B3-SpanId, X-B3-Sampled. B3 is still widely deployed because Istio’s Envoy proxies default to B3. W3C Trace Context is the future; B3 is the present you have to interoperate with.

Most tracing libraries support both. OTel’s propagator defaults to W3C but can be configured to emit B3 as well. Istio can be configured to propagate either or both.

The subtle thing: propagation is a per-hop contract. When service A calls service B via HTTP, A must inject the headers on the outgoing request, and B must extract them from the incoming request. If either end doesn’t know about the headers, the trace is broken. Every client library, every proxy, every message broker in the path has to participate. This is why auto-instrumentation matters so much — manually adding propagation everywhere is error-prone and forgettable.

95.4 Head-based vs tail-based sampling

Storing every span of every request is infeasible. A high-throughput API with 10k RPS producing 50 spans per request is 500k spans per second, which is petabytes per month. You have to sample.

Two strategies, fundamentally different:

Head-based sampling makes the keep/drop decision at the start of the trace, at the first service that sees the request. The decision propagates in the sampled flag in the traceparent header. If the flag says “keep,” every downstream service records its span; if “drop,” they skip recording.

Pro: Cheap. No buffering required. Each service decides independently.
Con: Random. You keep 1% of traces uniformly. You might miss every slow or failing trace if your sample fraction is low.

Tail-based sampling buffers every span of every trace for a short time (seconds to minutes), then makes the keep/drop decision once the trace is complete. The decision can be based on any property of the trace: “was it slow?”, “did any span fail?”, “did it touch the billing service?”, “is the user a VIP?”.

Pro: You can keep 100% of the interesting traces (errors, slow, rare paths) and throw away the boring ones.
Con: Expensive. The collector has to buffer all spans of all traces until the trace is “done” (usually inferred by a timeout after the last span arrives). Memory requirements are significant; cross-collector coordination is required if different services send to different collectors.

The hybrid strategy that works in production:

Head-based sample at a high rate (e.g., 100%) by default. Every service records.
Tail-based sample at the collector. The collector sees every span, buffers per trace, then keeps:
- 100% of traces with any error span.
- 100% of traces where root duration > p99.
- 1% of everything else (for baseline).
Forward only the kept traces to the backend.

This gets you all the interesting traces at a fraction of the cost of keeping everything. OpenTelemetry Collector supports tail-based sampling via the tail_sampling processor; Grafana Tempo and Jaeger both consume this output.

Tail-based sampling at the OTel Collector buffers every span until the trace is complete, then keeps 100% of error and slow traces while dropping 99% of boring ones — cost of head-based with coverage of full sampling.

95.5 The OTel collector pipeline

OpenTelemetry Collector is the de-facto ingest layer for traces (and increasingly metrics and logs too). It runs as a sidecar, DaemonSet, or gateway. Its job is to receive telemetry from applications, process it, and forward it to backends.

The config is a pipeline: receivers → processors → exporters.

receivers:
  otlp:
    protocols:
      grpc:
      http:
  zipkin:
  jaeger:

processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }
  resource:
    attributes:
      - key: cluster
        value: prod-us-east
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
  logging:

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger, zipkin]
      processors: [memory_limiter, tail_sampling, resource, batch]
      exporters: [otlp/tempo]

Walking through:

Receivers accept incoming telemetry in various protocols (OTLP gRPC and HTTP, plus legacy Jaeger and Zipkin for interop).
Processors run in order. memory_limiter is first to prevent OOMs under load. tail_sampling buffers and decides. resource adds cluster-level attributes. batch batches before export for efficiency.
Exporters forward to the backend (Tempo in this example).

The mental model: the collector is the plumbing. Applications emit raw telemetry; the collector transforms, samples, enriches, and routes it. Keep your application code simple and put all the tricky logic in the collector config.

95.6 Jaeger vs Tempo vs Zipkin

The three notable trace backends:

Zipkin (2012) — Twitter’s original. Simple data model, UI is functional but dated, ingest is moderate. Mostly a legacy choice now; stick with it if you’re already on it.

Jaeger (2016) — Uber’s, now CNCF graduated. Richer UI, better search, supports multiple storage backends (Cassandra, Elasticsearch, ScyllaDB). Good for moderate scale. Heavier operationally than Tempo because of the dependency on Cassandra/Elasticsearch.

Tempo (2020) — Grafana Labs’. Object-storage-backed, no search index. Looks up traces by trace ID only. The trade-off: drastically cheaper storage, but you can’t search “show me all traces that called /foo” directly — you find the trace ID via another source (a log query, a metric exemplar) and then fetch the trace by ID.

The Tempo pitch is analogous to Loki’s: don’t build a heavyweight index for something you can find via a label filter or a log query. If your workflow is “metric shows anomaly → click exemplar → get trace ID → view trace,” Tempo is perfect and 10× cheaper than Jaeger. If your workflow is “browse traces by arbitrary property,” you need an indexed backend.

Tempo added a limited search capability in later versions, but the core bet is still ID-based lookup. For most production observability workflows, that’s the right bet.

Which to pick? Tempo if you’re on Grafana and comfortable with the “find trace ID elsewhere” workflow. Jaeger if you need rich search and are willing to pay for it. Zipkin only for legacy.

95.7 Where instrumentation breaks

In theory, OTel auto-instrumentation covers everything. In practice, it breaks at specific boundaries. The failures, in order of severity:

(1) Async boundaries. When code goes from a synchronous call stack to an async callback (a thread pool, a future, a coroutine), the context — which is stored in thread-local or async-local storage — can be lost. The downstream span ends up with no parent, or the wrong parent. Symptoms: trace trees with orphaned branches.

The fix: explicit context propagation at every async handoff. Most runtimes have helpers (contextvars in Python, context.Context in Go, AsyncLocalStorage in Node). Auto-instrumentation handles the common cases (asyncio, Spring WebFlux) but breaks on custom thread pools or queue-based designs.

(2) Message queues. HTTP headers propagate cleanly. Message brokers (Kafka, RabbitMQ, SQS) are harder. You need to inject the traceparent into the message headers, and the consumer needs to extract it. OTel has propagators for this but they have to be wired up explicitly. Most shops get this wrong on at least one queue in their stack.

Even when wired up, the model is different: a message in a queue is a “follows-from” relationship rather than a child, because the producer returns before the consumer starts. The tracing system has to represent this (spans with links instead of parent_span_id) and many visualizations don’t show it well.

(3) Third-party libraries that don’t speak OTel. If you’re calling a service through a bespoke gRPC stub or an old HTTP client, auto-instrumentation may not wrap it. The span appears to end at the client call, with no downstream span, even though the downstream service is producing one (orphaned). The fix is to manually wrap the client call in an explicit span.

(4) Sampling mismatches. If service A samples at 10% and service B samples at 100%, you get lots of B spans without A spans. Traces look incomplete. Fix: use the same sampling decision at the root and propagate it. Tail-based sampling at the collector sidesteps this entirely.

(5) Clock skew. Spans from different hosts have timestamps from different clocks. If clocks are skewed by a few milliseconds, child spans can appear to start before their parents. Mostly cosmetic but confusing. Use NTP or PTP to keep clocks tight.

(6) High-cardinality span attributes. Same as Prometheus and Loki: don’t put user_id, request_id, or email into span attributes unless the backend can handle it. Some backends will cost you for every unique attribute value.

Instrumentation is the 80% of distributed tracing that isn’t about the backend. Plan to spend time on it.

95.8 Trace-to-log and trace-to-metric correlation

The glue claim from §95.1 becomes real via three correlation patterns:

Trace → log. Every log line emitted during a span should carry the trace ID and span ID. In Grafana, clicking a span pulls up the associated logs via a LogQL query {trace_id="abc..."}. Requires the logging library to be wired into the tracing context (standard OTel SDK helpers do this).

Log → trace. A log line with a trace ID can be clicked to show the full trace. Same setup, different direction.

Metric → trace (exemplars). A Prometheus histogram sample can carry an “exemplar” — a trace ID that is a representative example of the bucket. Click the slow bucket in a latency histogram, get a trace ID, jump to the trace. This is what makes Tempo’s “ID-only lookup” workflow viable: exemplars give you the IDs.

Exemplars require:

The application to emit histogram samples with exemplars (supported by OTel and Prometheus client libraries).
The backend to store exemplars (Prometheus since 2.26, Mimir, Grafana Cloud all support it).
The UI to surface them (Grafana does).

With all three wired up, an on-call engineer’s workflow becomes: see a latency spike in a dashboard → click into a slow exemplar → view the trace → jump into the logs of the slowest span → diagnose. Each step is one click. This is what “observability” is supposed to mean.

sequenceDiagram
  participant E as On-call engineer
  participant G as Grafana dashboard
  participant P as Prometheus (exemplar)
  participant T as Tempo (trace)
  participant L as Loki (logs)
  E->>G: sees p99 latency spike
  G->>P: click slow exemplar point
  P-->>G: trace_id = 4bf92f...
  G->>T: fetch trace by ID
  T-->>G: span tree (decode span = 11 s)
  G->>L: {trace_id="4bf92f..."} LogQL
  L-->>G: log lines from slow span
  G-->>E: root cause: KV cache eviction at 14:03

Every link in the correlation chain — exemplar on the metric, trace_id in the log line, span attributes linking back — must be wired up before this one-click workflow exists; missing any one link breaks the chain.

95.9 Trace data volume and retention

Traces are heavy. A single span in OTLP proto is around 500 bytes to 2 KB depending on attributes. A trace of 50 spans is 25-100 KB. At 10k RPS with 50 spans each and 100% sampling, that’s 250-1000 MB/s of trace data, or 21-86 TB/day.

Sampling brings this down dramatically. At 1% random sampling, 210-860 GB/day. Tail-based sampling with 100% errors + 100% slow + 1% baseline is usually 2-5% total, giving 420 GB - 4.3 TB/day.

For storage: Tempo on S3 with aggressive compression gets to roughly $0.01 per million spans/month stored. A daily volume of 10 billion spans retained for 30 days is around $3000/month. Not cheap, but it’s the right order of magnitude for a mid-sized production platform.

Retention policies usually look like:

Full traces: 7-30 days. Longer gets expensive fast.
Aggregate metrics derived from traces: 1+ year. Computed from traces at ingest (via OTel collector spanmetrics processor) and stored as Prometheus metrics, which are much cheaper.
Specific high-value traces: indefinite. Post-mortem investigations, compliance, audit.

The cost optimization lever: span metrics. The OTel collector can derive Prometheus metrics from spans (rate, error rate, duration per service, per operation). These metrics give you the “RED for every service in the trace graph” view automatically, and the metrics are cheap. Then you drop most raw traces and keep only the interesting ones.

This is the pattern: traces as the diagnostic primitive, metrics derived from traces for the aggregate view.

95.10 Production patterns

(1) Auto-instrument first, manual instrument second. Use the OTel auto-instrumentation SDKs for every supported language and framework. Only add manual spans where the auto-instrumentation misses something important (e.g., inside a hot loop in your own business logic).

(2) Propagate via service mesh when possible. An Envoy-based mesh can inject and extract traceparent headers automatically for HTTP/gRPC traffic. Takes the manual-propagation burden off application code for the common case.

(3) Tail-based sample at the gateway collector. Not at each pod’s sidecar, because a tail sampler needs to see all spans of a trace. Run a smaller number of “gateway” collectors that receive from everything and make decisions.

(4) Always include trace ID in logs. Either via the logging library’s OTel integration or a middleware that pulls from context and injects into the log line.

(5) Span metrics for every service. Configure the collector’s spanmetrics processor to emit RED-style metrics from every trace automatically. Frees you from having to manually instrument metrics in every service.

(6) Bound attribute cardinality. Disallow free-form user content in span attributes. Create an allowlist.

(7) Monitor the collector. The collector is a critical path for observability. Its own metrics (queue depth, drops, export failures) must themselves be monitored — by a separate Prometheus, or you have a chicken-and-egg problem when the collector fails.

(8) Test trace propagation in CI. When you ship a new service, add a test that verifies trace context propagates through it. This is the only way to prevent silent regressions.

These patterns are the difference between “tracing works in theory” and “tracing is useful on call.”

95.11 The mental model

Eight points to take into Chapter 96:

Traces are per-request graphs. Metrics aggregate, logs are per-line, traces show causal structure.
Span = one unit of work with attributes and events; trace = a set of spans with a common trace ID; context = the propagating identity.
W3C Trace Context is the standard (traceparent and tracestate headers). B3 is legacy-but-live.
Head-based sampling is cheap but random; tail-based keeps the interesting stuff at higher cost. Hybrid (head 100% + tail at collector) is the production default.
The OTel Collector is the plumbing. Receivers → processors → exporters, with tail sampling and span metrics in the middle.
Tempo for cheap ID-lookup storage; Jaeger for rich search. Pick based on workflow.
Instrumentation breaks at async boundaries, message queues, and third-party libraries. Plan for manual patching.
Correlation is the payoff: metric exemplar → trace ID → span → logs. Every link must be wired up for the workflow to exist.

In Chapter 96, we add the fourth pillar: continuous profiling.

Read it yourself

The OpenTelemetry specification, especially the data model and context propagation sections.
W3C Trace Context specification (w3.org/TR/trace-context/).
Distributed Tracing in Practice, Austin Parker et al. (O’Reilly, 2020). The textbook.
Ben Sigelman et al., Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (Google, 2010). The original paper.
Grafana Tempo documentation on the ID-lookup design and exemplar workflow.
The OTel Collector’s tail_sampling processor documentation.

Practice

Given a request through gateway → auth → router → vllm → kvcache, sketch the span tree with parent-child relationships.
Decode a traceparent header: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. What are the fields?
Design a tail-based sampling policy that keeps every error trace, every trace with root latency > 2s, and 1% of the rest. Write the OTel collector config.
A span in Python fails to propagate its context when the code does loop.run_in_executor(). Why? How do you fix it?
A Kafka consumer starts a span with no parent even though the producer had a trace context. What’s missing?
For a 100k RPS service with 30 spans per request and 1 KB per span, calculate the daily raw trace volume. What does 5% sampling leave?
Stretch: Instrument a two-service Python app (HTTP client + HTTP server) with OTel, run a local OTel Collector, and view the resulting trace in Jaeger. Add a deliberate async boundary and watch the trace break; then fix it.