Chapter 93: Metrics: Prometheus internals

Prometheus won the metrics war. Most production ML platforms run Prometheus (or something wire-compatible: Thanos, Mimir, Cortex, VictoriaMetrics), and the interview expectation for any senior ML-systems role includes a working mental model of how it actually operates under the hood. This chapter is that model.

What’s covered: why pull-based scraping was a counterintuitive choice that turned out right, how scrape configs and service discovery fit together, how to reason about PromQL, the label-cardinality math that kills most Prometheus installations, recording rules and federation as scale levers, and the multi-node successors (Cortex, Thanos, Mimir) that Prometheus eventually needed. Chapter 92 laid out the frameworks. This chapter is how those frameworks become queryable data.

Outline:

Pull vs push, and why pull won.
The data model: time series, samples, labels.
Scrape configs and service discovery.
PromQL mental model.
The four metric types.
Histograms: the one tricky metric.
Label cardinality math and the explosion problem.
Recording rules.
Federation, remote write, and the scaling wall.
Multi-node Prometheus: Cortex, Thanos, Mimir.
The mental model.

93.1 Pull vs push, and why pull won

In 2012, most metrics systems were push-based: the application periodically sends a metric (“I handled 42 requests this second”) to a collector (StatsD, Graphite, Datadog agent). Prometheus went the other way. An application exposes a /metrics HTTP endpoint, and a central Prometheus server periodically scrapes that endpoint. The application is passive; the server pulls.

This looked wrong at the time. Push is the obvious architecture — the app knows what it’s doing, so it tells the collector. Why make the collector reach out?

The pull model won for several reasons:

The scraper has the authoritative list of targets. If a target disappears, it’s obvious — scrapes fail. In a push model, if a target disappears silently, you don’t know whether it’s broken or just quiet.
No application-side buffering problem. A push client has to decide what to do when the collector is unreachable: drop metrics, buffer, fail. A pull client just exposes the current state and doesn’t care when it’s read.
Service discovery falls out naturally. Prometheus queries a discovery backend (Kubernetes API, Consul, EC2, file SD), gets a list of targets, and scrapes them. Targets don’t need to register themselves.
Uniform scrape interval. Every target is scraped on the same cadence, so downsampling and rate calculations are clean.
Health as a side effect. A target that fails to be scraped is a target that’s unhealthy. The up metric is a free health signal.

The one thing pull is bad at: short-lived jobs. A batch job that runs for 30 seconds can’t be reliably scraped at a 60-second interval — you’ll miss it entirely. Prometheus’s answer is the Pushgateway, a separate component that accepts pushes from short-lived jobs and exposes the accumulated state for scraping. It’s a deliberate special case, and the Prometheus docs warn you not to use it for anything else.

For the long-running services that make up 99% of a production ML stack, pull is the right model. Every vLLM pod, every TEI pod, every Kubernetes node runs a /metrics endpoint and waits to be scraped.

In the pull model a disappearing target immediately shows up as a scrape failure — the up metric drops to 0 and an alert fires; in the push model a silent target is invisible until someone notices the gap.

93.2 The data model: time series, samples, labels

A time series in Prometheus is the unique combination of a metric name and a set of label key/value pairs:

http_requests_total{method="POST", handler="/api/chat", status="200", pod="llama-abc123"}

That whole thing — name plus labels — identifies one series. Each series is a stream of (timestamp, float64 value) pairs called samples.

Two series with the same name but different labels are different series:

http_requests_total{method="POST", status="200"}   # series A
http_requests_total{method="POST", status="500"}   # series B

This is the crucial fact about Prometheus: labels are not metadata, they are part of the identity of the series. Every unique label combination creates a new series. This is what makes PromQL expressive and it is also what makes cardinality blow up (§93.7).

Internally, Prometheus stores each series as a compressed chunk stream. The chunks live in an in-memory “head” for the most recent ~2 hours, then get flushed to disk as immutable block files. Each block covers a fixed time range (two hours by default). Blocks are compacted together over time to reduce the number of files. The result is a log-structured merge-tree on disk — a single-node TSDB optimized for write-heavy, append-only ingest with range queries.

The on-disk format is small. Thanks to Gorilla-style delta-of-delta timestamp encoding and XOR-based float compression, a single sample costs about 1.3 bytes on disk on average — not per byte of float64, per sample. This is what makes Prometheus’s single-node model viable: you can store tens of millions of samples per hour on a ~100 GB SSD without thinking too hard about storage.

Recent samples live in a compressed in-memory head block; after two hours they flush to immutable disk blocks that are compacted over time — the log-structured layout keeps writes cheap while range queries scan only the relevant blocks.

93.3 Scrape configs and service discovery

The configuration Prometheus cares about most is the scrape_configs section of prometheus.yml. A minimal example for a Kubernetes cluster:

scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

Walking through: kubernetes_sd_configs with role: pod tells Prometheus to query the Kubernetes API for the list of all pods in the cluster. For each pod, it generates a set of __meta_kubernetes_pod_* meta-labels. relabel_configs is a mini-programming-language that transforms those meta-labels into the actual scrape target and the labels that get attached to every sample from that target. The first rule filters to only pods with a prometheus.io/scrape: "true" annotation. The rest rewrite the target address and attach namespace and pod labels.

This relabel pipeline is the most confusing part of Prometheus for newcomers. The mental model: every target goes through a pipeline of transformations before Prometheus decides where to scrape it and what labels to attach. The pipeline runs once per discovery cycle, not per sample. Mistakes here are easy and they cause “why is this pod not being scraped” mysteries that can eat an afternoon.

The standard port split convention — 8080 for the app, 9090 for metrics — is not a Prometheus requirement, it’s a cultural convention that lets you run a metrics server on a different port from the app server. In Kubernetes, you expose both via the pod spec and annotate with prometheus.io/port: "9090". vLLM, TEI, and most modern serving stacks follow this split by default. Prometheus itself runs on 9090, which is how the convention got started.

93.4 PromQL mental model

PromQL is a query language over time series. Every query returns one of four types:

Instant vector — a set of time series, one sample per series, all at the same timestamp.
Range vector — a set of time series, each with a window of samples.
Scalar — a single float.
String — rarely used.

The critical distinction is instant vs range. Most PromQL functions and operators work on instant vectors. Rate-based functions (rate, irate, increase) require a range vector.

The canonical rate query:

rate(http_requests_total[5m])

Reads as: “for each series in http_requests_total, look at the last 5 minutes of samples, compute the average per-second rate of increase.” Output is an instant vector with one value per series. Multiply by 60 to get “per minute,” sum across a label to aggregate, and so on.

Aggregation operators collapse series along labels:

sum by (handler) (rate(http_requests_total[5m]))

“Compute the rate per series, then sum within each handler label, producing one series per handler.”

without is the inverse: aggregate away the listed labels, keep all others.

A few rules that trip people up:

rate always requires a counter (a monotonically non-decreasing metric). On a gauge, use the raw value or deriv. rate(cpu_usage[5m]) on a gauge is nonsense.
Use rate, not irate, for alerts. irate looks at just the last two samples and is too noisy for alerting. rate averages over the window and is stable.
The window must contain at least two samples. rate(metric[15s]) with a 30-second scrape interval is broken — you need a window ≥ 2× the scrape interval. The standard rule of thumb is [5m] for anything with a default 15-30 s scrape interval.
Counters can reset (process restart). Prometheus handles this — rate detects the reset and ignores the drop. You don’t have to worry about it.

The mental model to hold: PromQL is a functional language over vectors of time series. Every operation is “transform this vector into another vector.” Compose the transforms until you get the number you want.

93.5 The four metric types

Prometheus has four metric types, and they behave differently:

Counter — monotonically non-decreasing integer or float. Use for “things that only go up”: requests served, bytes sent, errors observed. Always query with rate or increase, never the raw value.
Gauge — a value that can go up or down. Use for “current state”: memory in use, queue depth, temperature. Query with the raw value, avg_over_time, max_over_time, etc.
Histogram — a collection of counters representing a distribution. Covered in detail in §93.6.
Summary — a client-side computed percentile. Mostly obsolete; use histograms instead. The reason: summaries compute percentiles on the client, so you can’t aggregate them across instances. Histograms expose the bucket counts and let you aggregate server-side.

The interview question “should this metric be a counter or a gauge” almost always has a clear answer: if it represents a count of events, make it a counter. If it represents a level, make it a gauge. Don’t mix them — a metric like active_connections that you occasionally “reset” is not a counter; it’s a gauge. Trying to use rate() on it gives wrong answers.

93.6 Histograms: the one tricky metric

A histogram is actually multiple series under the hood:

foo_bucket{le="0.005"} — count of observations ≤ 5 ms
foo_bucket{le="0.01"} — count of observations ≤ 10 ms
foo_bucket{le="0.025"} — count of observations ≤ 25 ms
… (more buckets)
foo_bucket{le="+Inf"} — count of all observations
foo_sum — sum of all observations
foo_count — count of observations (same as foo_bucket{le="+Inf"})

Each bucket is a cumulative counter. The le label says “less than or equal to this upper bound.” To compute a percentile, use histogram_quantile:

histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket{handler="/api/chat"}[5m])))

The inner expression computes the per-second rate of each bucket, sums across instances (preserving the le label), and histogram_quantile linearly interpolates inside the bucket containing the 99th percentile.

Two crucial things:

(1) The percentile is only as accurate as your buckets. If your buckets are [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] and 99% of your traffic lands between 10 ms and 25 ms, histogram_quantile(0.99, ...) is interpolating across a big gap and will be inaccurate. Pick bucket boundaries to match your workload. For LLM serving where TTFT lives between 50 ms and 5 seconds, your buckets should be dense in that range.

(2) Summing histograms across instances only works if bucket layouts match. If pod A has different bucket boundaries than pod B, you can’t sum by (le) them. Always define histogram buckets once, centrally, and use the same definition everywhere.

The newer native histograms (Prometheus 2.40+) solve both problems by storing the distribution with log-based auto-scaling buckets. They use less storage and give accurate percentiles without hand-tuning. As of 2026 they are stable but not yet universally adopted; most production installations still use classic histograms.

93.7 Label cardinality math and the explosion problem

This is the most important section in this chapter. Every Prometheus outage traces back to cardinality.

A time series is uniquely identified by its (name, labels) tuple. Every unique combination is a separate series. Prometheus’s head block stores a data structure per series, so the number of active series directly drives memory usage, and the count adds linearly for every unique combination.

The math: if your metric has k labels and each label has c_i distinct values, the total number of series is:

series = product of c_i for i in 1..k

So a metric http_requests_total with labels {method, handler, status, pod} where:

Adding a single unbounded label (user_id, request_id, trace_id) multiplies total series by millions — a cardinality explosion that OOMs Prometheus within minutes of deployment.

- `method` has 5 values (GET, POST, ...) - `handler` has 20 values - `status` has 10 values - `pod` has 50 values

has 5 × 20 × 10 × 50 = 50,000 series. Each series costs around 3-4 KB of RAM in the head block (index + chunks). That’s 150-200 MB for one metric. Fine.

Now add a user_id label with 1 million distinct values. You just created 50 billion series, of which maybe 5 million are active at once. That’s 15-20 GB of RAM for a metric that used to cost 200 MB. Prometheus will OOM.

The rule: never label a metric with anything high-cardinality. Specifically:

Never label with user_id, request_id, session_id, trace_id, email, or any other identifier that grows with users. Those belong in logs (Chapter 94) or traces (Chapter 95).
Be careful with pod labels. In a system with autoscaling, pod names change frequently; old series stay around for the retention window. A deployment that churns 20 pods an hour with a high-cardinality metric will leak series until memory explodes.
Avoid labels that encode free text — error messages, SQL statements, URLs with embedded IDs (/users/1234/orders/5678). Either normalize or drop.
Cap the number of labels per metric at ~5-8. More labels multiply the combinations.

The cardinality budget for a healthy single-node Prometheus is on the order of 2-5 million active series. Above that, query latency degrades and memory pressure becomes serious. You can push further with more RAM (Prometheus scales linearly), but the right fix is usually to drop the offending label.

Tools to keep you honest:

# Top 10 metrics by series count
topk(10, count by (__name__)({__name__=~".+"}))

# Cardinality of a specific label across all metrics
count(count by (pod) ({__name__=~".+"}))

Run these monthly. The top-series-by-name query almost always surfaces a metric you didn’t mean to label that aggressively.

93.8 Recording rules

A recording rule pre-computes a PromQL expression at scrape time and stores the result as a new metric. Example:

groups:
  - name: http.rules
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      - record: job:http_request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

Every 30 seconds, Prometheus evaluates these expressions and stores the result as new time series. Subsequent queries against job:http_requests:rate5m are trivial — they’re just reading a precomputed metric.

Why this matters:

Dashboards get fast. A Grafana dashboard querying a recording rule responds in milliseconds instead of evaluating an expensive expression against raw data.
Alerts get reliable. An alert rule that runs every minute and re-evaluates a 30-second histogram quantile across millions of series can miss evaluations or be inconsistent. A recording rule computes it once, deterministically.
Naming convention matters. The job: prefix is a convention: level:metric:operation. It tells you at a glance that this is an aggregated metric at the job level, and the original metric is http_requests. Use it.

The rule of thumb: any PromQL expression that appears in more than one dashboard or alert should be a recording rule.

93.9 Federation, remote write, and the scaling wall

A single Prometheus scales to a few million active series on a reasonably beefy machine. For a fleet of hundreds of clusters with billions of series, you need something else. The escape hatches, in order of complexity:

Federation — a “top-level” Prometheus scrapes aggregated metrics from many “leaf” Prometheuses. Each leaf runs on its own cluster with its own full-resolution data; the top-level only pulls a small set of aggregated recording rules. Good for dashboards that span clusters; bad if you need full-resolution data at the top.

Remote write — Prometheus forwards samples to a remote storage backend as they are scraped. The remote backend (Thanos Receiver, Mimir, Cortex, VictoriaMetrics, Grafana Cloud) does long-term storage, global query, and high availability. Local Prometheus becomes a write-ahead buffer; queries go to the remote backend.

Remote write is the dominant pattern today. Every production ML platform with more than a few clusters runs some form of remote write into a central store. The local Prometheus keeps ~2 hours of data for fast queries and forwarding; the central store keeps months or years.

The cost of remote write: every sample gets sent twice (to local disk and over the network). At high ingest rates this becomes bandwidth-significant. The remote_write config has queueing, batching, and sharding parameters you will eventually have to tune.

93.10 Multi-node Prometheus: Cortex, Thanos, Mimir

Three open-source projects emerged to solve the “more than one Prometheus” problem:

Thanos (2018) — sidecar-based. Each Prometheus runs a Thanos sidecar that uploads blocks to an object store (S3, GCS). A Thanos Querier at the front answers queries by fanning out to the sidecars (for recent data) and to a Thanos Store (for historical data in the object store). Deduplication is done at query time. Strengths: no ingest bottleneck, natural HA via running two identical Prometheuses and deduping. Weaknesses: query latency can be high because of fanout, the component set is large.

Cortex (2016) — the original multi-tenant Prometheus-as-a-service, built at Grafana/Weaveworks. Receives samples via remote_write, writes to a blob store for long-term data, and serves PromQL from a distributed query layer. Strengths: true multi-tenancy, horizontal ingest scaling. Weaknesses: architectural complexity; effectively superseded by Mimir.

Mimir (2022) — Grafana’s fork and rewrite of Cortex. Simpler operationally, higher ingest capacity per core, same wire protocol. The modern default if you’re building a central metrics store. Scales to billions of active series per cluster.

VictoriaMetrics (2018) — not strictly a Prometheus derivative, but wire-compatible. Claims higher write performance and lower storage cost than any of the above. Used in production at large scale by many teams. Single-binary deployment is a genuine operational win.

Which to pick? For small deployments, plain Prometheus with remote_write to Mimir or Grafana Cloud is the path of least resistance. For on-prem at scale, VictoriaMetrics or Mimir. For teams already on Thanos, don’t migrate unless you have a concrete problem with it.

The uniform property: all of them speak PromQL. Dashboards and alerts are portable. The multi-node architecture is an ingest and storage decision, not a query decision.

93.11 The mental model

Eight points to take into Chapter 94:

Pull beat push because it unifies discovery, health, and uniform scrape intervals under one model.
A time series is (name, labels). Every unique label combination is a separate series. Labels are identity, not metadata.
Scrape configs + relabeling are how Prometheus decides what to scrape and what labels to attach. The pipeline is the most error-prone part of the system.
PromQL operates over vectors of time series. Think functionally: each step transforms a vector. rate(counter[window]) is the atom of almost every query.
Histograms are multiple counters under the hood. histogram_quantile interpolates inside buckets. Bucket boundaries determine accuracy.
Cardinality is the killer. Never label with user IDs, request IDs, or anything unbounded. Cap labels per metric at ~5-8. Audit top series regularly.
Recording rules pre-compute expensive PromQL so dashboards and alerts are fast and deterministic. Use level:metric:operation naming.
Multi-node options (Thanos, Cortex, Mimir, VictoriaMetrics) all speak PromQL. Start with plain Prometheus + remote_write; graduate when the single node starts to hurt.

In Chapter 94, the same cardinality lessons reappear — this time for logs.

Read it yourself

Prometheus: Up & Running, 2nd edition, Brian Brazil (O’Reilly, 2023). The canonical reference.
The official Prometheus docs, especially the sections on storage and remote_write.
Tobias Schmidt, “Writing Exporters” — the Prometheus docs page that explains how to instrument an application correctly.
The Gorilla paper: Pelkonen et al., Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015). The compression scheme Prometheus uses.
The Grafana Mimir and Thanos docs for the scale-out story.
Björn Rabenstein’s KubeCon talks on Prometheus internals — best source for the TSDB mental model.

Practice

Why does pull beat push for a long-lived microservice but lose for a 30-second batch job? What does Prometheus do about the batch job?
Compute the number of active series for a metric db_queries_total with labels {db_name (10), table (500), operation (5), pod (30)}. Is this safe?
A team adds a trace_id label to their latency histogram to “correlate slow requests.” Within an hour Prometheus OOMs. Explain exactly why.
Write a recording rule that computes p99 TTFT per model per 5-minute window for an LLM serving stack, assuming the base metric is vllm_ttft_seconds_bucket with labels {model, pod, le}.
A new engineer uses rate(current_memory_bytes[5m]) on a gauge metric. What do they see and why is it wrong?
Sketch the data flow for a multi-region deployment using Prometheus + remote_write + Mimir. Where is HA? Where is long-term storage?
Stretch: Run a local Prometheus, expose a custom metric from a tiny Python app, and deliberately create a label with 100k unique values. Watch the memory footprint. Then kill the high-cardinality label and verify recovery.