Chapter 96: Continuous profiling: Pyroscope, Parca

Metrics, logs, and traces are the canonical three pillars of observability. For most of the 2010s, people said “that’s enough.” It wasn’t. There is a gap between “this function was on the critical path of a slow request” (which a trace tells you) and “this function was slow because it spent 40% of its time in a specific line doing a specific thing” (which only a profiler tells you). Traces show you the skeleton; profiling shows you the muscle.

This chapter is about continuous profiling — the practice of running profilers in production all the time, cheap enough to leave on, and aggregating the samples into a searchable store. The breakthrough technology is eBPF-based profilers (Parca, Pyroscope, Polar Signals, Grafana Agent profiles) that sample every process on the host from the kernel side without instrumenting the application code. Flame graphs become the new dashboard. And for ML systems, where hot loops in serving stacks hide behind framework abstractions, the ability to profile production at low overhead is transformative.

Outline:

The gap metrics, logs, and traces leave.
What profiling actually measures.
Why sampling profilers work.
Traditional profiling vs continuous profiling.
eBPF as the enabling technology.
Flame graphs as the canonical visualization.
Pyroscope, Parca, and Grafana Agent Profiles.
Profiling ML systems specifically.
Cost and overhead considerations.
Production patterns.
The mental model.

96.1 The gap metrics, logs, and traces leave

Walk through a real investigation. The p99 TTFT on a vLLM deployment jumps from 400 ms to 900 ms. Metrics show the spike is real and steady. Logs show nothing anomalous — no errors, no warnings, nothing interesting. Traces show the TTFT span is consistently 900 ms, and within it, the “prefill” span is the culprit, but the prefill span is a single unit with no sub-spans (you can’t trace into CUDA kernels).

You know where the latency lives (prefill) and that it’s 500 ms slower than expected, but not why. The possibilities: a new memory-layout bug causing more HBM traffic, a kernel selection regression after a CUDA upgrade, a Python-side tokenizer hotspot, a GPU driver change. Without a profile, you’re guessing.

This is the gap profiling fills. A profile samples the call stack many times per second and aggregates “how much time was spent in each function.” In seconds you can see: “70% of prefill time is in the tokenizer, not the GPU” or “40% of decode time is in a Python dict lookup that used to be a C extension.” With that data, the fix is obvious.

Traces tell you which operation is slow. Profiles tell you which line of code inside that operation is slow. They are complementary. A complete observability stack has both.

96.2 What profiling actually measures

A profile is a histogram of call stacks. For each sample, the profiler captures the current stack (the sequence of function calls from main down to the currently executing function) and increments a counter. Aggregated over many samples, you get a picture of how the program spends its time: which functions appear on the stack most often, and which are on the top (currently executing).

Different profiler types measure different things:

On-CPU profiler: samples only when a thread is actually running on a CPU. Shows where compute is spent.
Off-CPU profiler: samples when a thread is blocked (I/O wait, lock contention, sleeping). Shows where latency is spent waiting.
Wall-clock profiler: samples regardless of CPU/off-CPU state. The union of the above.
Memory allocation profiler: samples allocation events. Shows who allocates most, where leaks come from.
Lock contention profiler: samples lock acquisition/waiting events. Shows mutex hotspots.
GPU profiler (Nsight, CUPTI-based): samples kernel launches and SM activity on the GPU. Different beast, same idea.

For the canonical “my service is slow, where is the CPU going” question, on-CPU profiling is the default. For “my service is slow but CPU is low,” switch to off-CPU to see where it’s blocking.

The fundamental measurement is always call stack × count × duration. Everything else (flame graphs, function top-N lists, differential profiles) is a visualization of that one data structure.

The wide highlighted bar at row 4 — Encoding.__init__ appearing in 30% of all samples — is a tokenizer hotspot invisible to metrics or traces; fixing it requires swapping the Python tokenizer for its Rust-backed counterpart.

96.3 Why sampling profilers work

You might think “to know where time is spent, measure every function entry and exit.” That’s instrumenting profiling — and it’s expensive. Every function call incurs overhead, and you miss anything under the instrumentation granularity (inlined functions, intrinsics).

The insight is statistical sampling. If you sample the stack 100 times per second and aggregate over a minute, you have 6000 samples. The fraction of samples in which function f appears on the stack is an unbiased estimate of the fraction of time spent in f. The precision grows as sqrt(N); 6000 samples gives you enough resolution to identify the top 5-10 hotspots reliably.

Sampling is cheap because the overhead is bounded by the sample rate, not by the application’s activity. A 100 Hz profiler takes 100 stack walks per second per CPU regardless of whether the app is doing one thing or a million. On modern hardware, a stack walk is ~microseconds, so 100 Hz overhead is roughly 0.1% of CPU — low enough to leave on in production.

The downsides of sampling:

Rare events can be missed. A function that runs only occasionally might not appear in the sample set, even if it’s slow when it runs.
Sampling introduces noise — a flame graph from 10 seconds of data is noisier than one from 10 minutes.
Stack walking can be hard — in some languages (Python pre-3.12, old JVMs), walking the stack is not cheap and may require cooperation from the runtime.

For steady-state production workloads, sampling is the right choice. For debugging a specific short event, instrumented tools (perf record, py-spy, gdb) may be more appropriate.

96.4 Traditional profiling vs continuous profiling

Traditional profiling: you suspect a performance problem, you SSH into a host, run perf record for 30 seconds, download the output, and look at the flame graph on your laptop. One-off, reactive, human-driven.

Continuous profiling: a profiling agent runs on every host permanently. It samples stacks for every process constantly (~100 Hz), compresses the samples, and ships them to a profile storage service. You query the store by time range, service, and label — just like metrics.

The differences:

	Traditional	Continuous
When does it run	On demand	Always
Data scope	One host, one process, short window	Entire fleet, all processes, long window
Investigation	Reactive (after problem noticed)	Proactive (retroactive query)
Overhead	Short bursts of 5-20%	Steady ~0.5-1%
Tooling	`perf`, `py-spy`, `pprof`	Pyroscope, Parca, Grafana Agent
Storage	Local files	Object store

The productivity leap from continuous profiling is enormous. When a latency spike happens at 3 AM, you don’t have to ask “can someone reproduce it and run a profile?” You query the profile store for that specific time window, that specific service, and you have the flame graph in seconds. No reproduction needed.

The comparison that matters most: diff between time windows. Continuous profiling lets you ask “what did the CPU profile look like 10 minutes before the latency spike vs during it?” A differential flame graph shows exactly which functions got heavier. This is not possible with traditional profiling.

graph LR
  A[Every host: eBPF agent<br/>DaemonSet, 99 Hz] -->|compressed samples| B[Profile store<br/>Pyroscope / Parca]
  B --> C[Query by service + label + time]
  C --> D[Flame graph viewer]
  D --> E[Diff: before vs after spike]
  style A fill:var(--fig-surface),stroke:var(--fig-border)
  style E fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Traditional profiling is reactive — SSH in after the incident. Continuous profiling is retroactive — query the stored flame graph for any past time window on any service, without reproduction.

96.5 eBPF as the enabling technology

The historical reason continuous profiling didn’t exist is that profilers were expensive, language-specific, and hard to deploy. You needed a Python profiler for Python, a Java profiler for Java, a Go profiler for Go. Each had to attach to each process individually, each had its own stack-walking quirks, each had its own format.

eBPF (extended Berkeley Packet Filter) changes this. eBPF is a mechanism for loading tiny sandboxed programs into the Linux kernel that can be attached to kernel events — including a perf event that fires at a fixed frequency. An eBPF profiler:

Attaches to a perf_event at 99 Hz (99 so it’s coprime with 100 Hz timers to avoid aliasing).
On each fire, walks the kernel stack and the user-space stack.
Emits the stack to a ring buffer that a user-space agent reads.
Aggregates, compresses, ships.

This is per-CPU, not per-process. It samples whatever process is running on that CPU at that instant. It works for every process on the host — Python, Go, Rust, C++, Java — simultaneously, from a single agent.

The hard part is stack walking in user space for managed languages. eBPF has limited kernel resources and can’t run a Python interpreter to walk Python frames. The workarounds:

C/C++/Rust/Go: have frame pointers (if compiled with -fno-omit-frame-pointer) or DWARF unwinding. eBPF walks the native stack directly. Go has some quirks because of goroutines; modern profilers handle them.
Python: the profiler reads the PyFrameObject chain from the interpreter’s memory using the Python runtime’s known offsets. Requires matching the Python version. Pyroscope and Parca both have this.
JVM: uses perf-map files emitted by the JVM, or async-profiler’s JVM agent for richer stack info.
V8/Node: similar to Python — read the V8 frame pointer from known offsets.

The result: a single eBPF-based agent profiles every process on the host with native stack info, at ~0.5-1% overhead, with zero application-side changes. This is the breakthrough. Before eBPF, continuous profiling was infeasible; after eBPF, it’s operationally cheap.

96.6 Flame graphs as the canonical visualization

Brendan Gregg’s flame graphs (2011) became the universal visualization for profile data because they compress enormous amounts of stack information into one scannable chart. The axes:

X-axis: stack frames, sorted alphabetically (not time). Width of a bar = fraction of samples in that function.
Y-axis: stack depth. Root (main) at the bottom, leaf (currently executing) at the top.

Reading a flame graph:

Look for wide bars near the top. A wide bar at the top is a function that is frequently the currently executing one — a hotspot.
Look for wide towers. A tall column of wide bars is a call path that dominates runtime. Every frame under it is “paying for” the top.
Compare widths at the same level. The widest function at the root level is where most of the time goes first.
Look for unexpected functions. If lock_acquire is wide, you have contention. If json.loads is wide, you’re deserializing too much. If malloc is wide, you’re allocating too much.

A differential flame graph (two profiles subtracted) colors bars by change: red for “got wider,” blue for “got narrower.” This is how you see “the regression between yesterday and today is in this specific function.”

Flame graphs as dashboards: modern observability stacks embed them into the same UI as metrics and traces. Click a slow trace, see the flame graph for the service during that window. Click a metric spike, see the flame graph for the host. They become the last link in the diagnostic chain.

The subtle thing about flame graphs: they show where time is spent, not why. A wide __kernel_vsyscall bar means “we spent a lot of time in a syscall.” It doesn’t tell you which syscall without drilling in. A wide np.dot bar means “we spent a lot of time in numpy matmul.” It doesn’t tell you whether that’s because the matrix is huge or because it runs a million times. The flame graph is the start of the investigation, not the end.

96.7 Pyroscope, Parca, and Grafana Agent Profiles

The three open-source contenders:

Pyroscope (2020, acquired by Grafana 2023) — full-stack profiling system: agent, storage, query, UI. Language-specific agents (plus eBPF agent for everything). Integrates with Grafana. The “most polished” option.

Parca (2021, Polar Signals) — eBPF-first, storage + UI + query, similar architecture to Prometheus. Designed to work without any application changes. More focused than Pyroscope.

Grafana Agent Profiles / Grafana Alloy Profiles (2023+) — Grafana’s move to consolidate Pyroscope’s agent into the broader Grafana Agent. This is the direction for Grafana Cloud users.

The storage model (roughly identical across all three):

Agent ships profile samples to a backend.
Backend stores in object storage, indexed by (service, label, time).
Query fetches profiles by label filter and time range.
UI renders flame graphs.

The mental model is the same as metrics or traces: a label-indexed store over append-only data, with time-range queries. Label cardinality pitfalls apply — don’t label profiles by user_id.

Which to pick? For Grafana shops, Pyroscope (soon Grafana Cloud Profiles) is the default because of UI integration. For Kubernetes-first shops that want minimal instrumentation, Parca is slightly simpler. Functionally, they overlap heavily and the choice is less important than “are you running continuous profiling at all.”

96.8 Profiling ML systems specifically

Continuous profiling is transformative for ML systems because ML code is full of hot loops hidden behind framework abstractions. A few examples of what profiling catches:

Tokenizer hotspots in serving. A vLLM prefill that looks GPU-bound is sometimes CPU-bound because the tokenizer is running in Python, not C. A profile shows tokenizers.Encoding.__init__ at the top with 30% of time, and the fix is to replace the slow-path tokenizer with the Rust-backed tokenizers library. Savings: 20-30% of prefill latency.

Framework overhead in inference. PyTorch dispatch overhead, autograd checks that should be disabled, unnecessary tensor clones. Profiles reveal these as wide frames from torch/autograd/function.py or c10::impl::OperatorEntry::call. The fix: torch.inference_mode(), torch.compile, or manually removing clones.

Data loader bottlenecks in training. The classic symptom: GPU utilization is 60%, the engineer thinks “we need a bigger GPU.” A wall-clock profile shows the training loop is spending 40% of time waiting for the data loader, which is CPU-bound on image decoding. The fix: more workers, or move decoding to GPU with DALI. No bigger GPU needed.

Memory allocator pressure. A serving stack with frequent small allocations shows jemalloc or malloc at the top. The fix: pre-allocate buffers, use memory pools, use PagedAttention (Chapter 24) for KV cache.

GPU-side hotspots. Kernel-level profiling is a separate tool (Nsight Compute, Nsight Systems, CUPTI-based Nvidia Profiling Tools). eBPF profilers don’t see into the GPU. The usual workflow: get a high-level flame graph from Pyroscope/Parca to see which CUDA call is hot, then switch to Nsight to see why that kernel is slow.

The rule of thumb for ML systems: if you think you know where your time is going, run a profile anyway. You are usually wrong about one of the top three hotspots. The framework abstractions obscure the truth, and only profiling cuts through.

96.9 Cost and overhead considerations

Continuous profiling adds three costs:

(1) CPU overhead on every host. A well-implemented eBPF agent costs around 0.5-1% of CPU. For a fleet of 1000 hosts, that’s 5-10 host-equivalents of pure overhead. Acceptable but not free.

(2) Network and storage for the samples. A single raw sample is small (tens of bytes) but at 99 Hz × 100 CPUs × 1000 hosts, the volume adds up. Aggressive aggregation at the agent (roll up samples within a window before shipping) keeps it manageable. Typical volumes are 10-100 MB/host/day compressed.

(3) Backend storage and query. Profile data is smaller than trace data but larger than metric data. Plan for 1-10 TB/month on a medium fleet, at object-storage prices. Reasonable retention is 7-30 days for raw profiles; longer retention via aggregated roll-ups.

The cost math is favorable because the information density is enormous. A single week of continuous profiling on a production ML fleet can catch several performance regressions that would have taken weeks to diagnose traditionally. The engineering hours saved dwarf the infra cost.

The common configuration knobs to tune:

Sample rate (99 Hz default, can drop to 19 Hz for noisy environments).
Aggregation window (15s default — longer is cheaper, shorter is finer-grained).
Enabled profile types (on-CPU only is cheapest; adding off-CPU, memory, and lock profiles each roughly doubles the overhead).
Label allowlist (same cardinality concerns as everywhere else).

Start with on-CPU at 99 Hz and expand from there.

96.10 Production patterns

(1) Run one agent per node as a DaemonSet. Not one per pod. eBPF agents need host-level access (usually via privileged pod with hostPID: true).

(2) Label profiles by service, not by pod. You want to see “profile for the vllm service over the last hour,” not “profile for this specific pod that doesn’t exist anymore.”

(3) Link profiles to traces via labels. At minimum, the service label. Better: shared trace_id metadata if the profiler supports it (some do, some don’t).

(4) Alert on regression. A slowly expanding flame graph — same service, getting wider over weeks — indicates a gradual regression. Detect via automated diff-by-week comparisons.

(5) Run differential profiles in CI. A performance test that runs before/after on a PR should include a profile diff. Catches regressions before they land.

(6) Profile during load tests. Not just “look at latency during the load test” but “capture a profile during peak load.” The hotspots under load are often different from steady-state.

(7) Profile during incidents. Part of the incident response runbook: pull the profile for the affected service over the last 30 minutes. Jumps you past the guessing phase.

(8) Don’t profile secrets. Stack strings can include function arguments in some languages. Don’t include argument values in the profile payload, or scrub them.

96.11 The mental model

Eight points to take into Chapter 97:

Profiling fills the gap between “which span is slow” and “which line of code is slow.”
Sampling profilers count call stacks, not function calls — statistical, cheap, unbiased.
Continuous profiling runs always, ships to a store, answers retroactive queries.
eBPF is the enabling tech. One agent, every process on the host, ~0.5-1% overhead.
Flame graphs are the canonical visualization. Width = samples, height = stack depth.
Differential flame graphs show regressions by coloring wider/narrower frames.
ML systems have hidden hotspots in tokenizers, framework dispatch, data loaders. Always profile.
Profile is the last mile of diagnosis. Metrics → traces → logs → profile is the full chain.

In Chapter 97, we pivot from “how to measure” to “how to set goals on what we measure” — SLIs, SLOs, SLAs, and error budgets.

Read it yourself

Brendan Gregg, Systems Performance: Enterprise and the Cloud, 2nd ed (Pearson, 2020). The profiling bible.
Brendan Gregg, “Flame Graphs” original blog post and subsequent extensions.
The Pyroscope and Parca documentation, particularly their eBPF-profiler internals pages.
BPF Performance Tools, Brendan Gregg (Addison-Wesley, 2019).
The Polar Signals blog — some of the best writing on eBPF profiling in practice.
Ren et al., “Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers” (IEEE Micro, 2010) — the original continuous profiling system at Google that inspired the modern OSS tools.

Practice

Why can a sampling profiler achieve low overhead while still finding the top hotspots accurately? Explain the statistical argument.
What’s the difference between on-CPU and off-CPU profiling? When do you need each?
Describe what a wide bar near the top of a flame graph tells you, and what a wide tower from bottom to top tells you.
For a Python service with a known hot path in a C extension, would eBPF profiling show the C function or the Python wrapper? Why?
Compare differential flame graphs to a side-by-side comparison. Why is the differential easier to read?
For an ML serving stack at 60% GPU utilization, list three continuous-profiling experiments you’d run to find where the missing 40% is going.
Stretch: Run Parca or Pyroscope on your laptop profiling a local Python app. Introduce a deliberate hotspot (e.g., a busy loop). Capture a flame graph and verify the hotspot is visible. Then profile a PyTorch inference workload and find at least one non-obvious hotspot.