Part VII · The Data Plane
Chapter 87 ~22 min read

Time-series databases: TimescaleDB, Prometheus TSDB, InfluxDB

"A time-series database is a regular database that gave up on half of its problems in exchange for being ten times better at the other half"

Time-series data is the data shape that eats ML systems alive. Every GPU exporter metric, every inference latency histogram, every token-throughput sample, every queue depth, every business KPI — all of it is a stream of (timestamp, series identity, value) triples, and all of it needs to be stored, queried, and downsampled for months to years. A general-purpose relational database will take this workload and fall over. It will fall over for very specific reasons: the B-tree indexes it uses for SELECT * WHERE time > X get slower as the table grows; it has no notion of chunked storage by time, so deleting old data is a full scan; it has no notion of compression tuned for monotonic time and slowly-changing values; and it has no notion of dropping the oldest chunks by simply unlinking files.

Time-series databases are databases that have specialized every design choice around the fact that the time dimension is special. Data arrives in time order. Old data is rarely updated. Old data is often deleted. Queries almost always have a time-range predicate. Aggregations over time windows are the common case. Writes vastly outnumber reads. Once you accept those constraints, you can build a database that is 10× faster and cheaper than a general-purpose one for this shape, and that is what TimescaleDB, Prometheus’s TSDB, and InfluxDB all are. They make different tradeoffs, and knowing when to pick which is the core skill this chapter teaches.

Outline:

  1. Why time-series workloads need a different storage shape.
  2. TimescaleDB: hypertables and chunks.
  3. Prometheus TSDB: the blocks, the WAL, the head.
  4. InfluxDB and the TSM engine.
  5. Compression: delta, delta-of-delta, Gorilla.
  6. Retention and downsampling.
  7. Query patterns and query languages.
  8. Cardinality: the silent killer.
  9. Picking one for an ML platform.
  10. The mental model.

87.1 Why time-series workloads need a different storage shape

A time-series workload has four defining properties. The first is write-heavy, append-only: a metric source writes one sample per series per scrape interval, indefinitely, and almost never updates a past sample. The second is predictable access patterns: almost every query is “give me this metric for this label set between time A and time B.” The third is huge raw volumes: a medium-sized Kubernetes cluster produces millions of unique time series with samples every 15 seconds, which is billions of samples per day. The fourth is time-bounded value: last week’s metrics matter, last year’s metrics are a historical curiosity, and the engine should let you expire them cheaply.

A general-purpose database does badly on each of these. A Postgres table with (ts, series_id, value) and a btree index on (series_id, ts) ingests fine at first, then slows as the index grows past RAM. Range deletes are expensive because they leave behind holes and require vacuum. Compression is row-oriented if any. Cardinality explosion (many unique series) bloats the index. The workload is theoretically supported but practically doomed.

A time-series database fixes each of these:

  • Partitioning by time. Data is split into time-bounded chunks. The chunk containing “last hour” is hot; the chunk containing “two years ago” is cold and compressed. Deletions are “drop this chunk,” which is a file unlink.
  • Column orientation within chunks. Values for the same series across consecutive timestamps are stored together, enabling delta compression and vectorized scan.
  • Specialized compression. Timestamps are monotonic and predictable (good for delta-of-delta encoding). Float values often change slowly (good for XOR/Gorilla encoding). Both compress to ~1-2 bytes per sample instead of 16.
  • Tag indexing. Series are identified by label sets. The DB maintains an inverted index from (label name, label value) to series id, which makes label-based queries fast.

The result is that a good time-series engine ingests ~1 million samples per second per CPU core on commodity hardware and serves range queries over billions of samples in milliseconds. The cost is that it is bad at everything that is not time-series: joins, updates, transactions. That is the tradeoff.

Time-partitioned chunk layout: data is split into time-bounded chunks; a query with a time predicate touches only the matching hot chunk rather than the full table. time → chunk 1 Apr 1–7 compressed chunk 2Apr 8–14compressed chunk 3 Apr 15–21 compressed chunk 4 Apr 22–28 compressed chunk 5 Apr 29–now HOT (in RAM) query: last 24 h touches 1 chunk only drop chunk by file unlink — no row scan
Time-partitioned chunks let the engine skip past chunks entirely via predicate pushdown, and old chunks are dropped with a single file unlink rather than an expensive row-by-row delete.

87.2 TimescaleDB: hypertables and chunks

TimescaleDB is an extension for PostgreSQL that turns a regular Postgres table into a time-series-aware one. The killer idea is that a time-series table is structurally a partitioned table, and Postgres already has decent partitioning; TimescaleDB just hides the partitioning behind a nicer interface and adds the specialized operations.

The core abstraction is a hypertable. From the application’s perspective, a hypertable is an ordinary Postgres table: you INSERT into it, you SELECT from it, you use regular SQL. Under the hood, TimescaleDB partitions it into chunks, where each chunk is a standard Postgres table covering a bounded time range (e.g., one day or one hour). New data is routed to the chunk for the current time range. Old chunks are immutable and can be compressed, moved to cheaper storage, or dropped.

Creating a hypertable:

CREATE TABLE metrics (
  ts        TIMESTAMPTZ NOT NULL,
  series_id BIGINT      NOT NULL,
  value     DOUBLE PRECISION
);
SELECT create_hypertable('metrics', 'ts', chunk_time_interval => INTERVAL '1 day');

From now on, inserts to metrics are transparently routed to daily chunks. Queries with a time predicate are planned against only the relevant chunks (chunk exclusion), so a query over the last hour touches one chunk, not the whole table.

TimescaleDB adds a native columnar compression policy. Compressed chunks are stored in a column-oriented format with delta-of-delta for timestamps and dictionary encoding for labels. Compression ratios of 10-25× are typical. The cost is that compressed chunks become read-only and slower to update; you compress chunks once they’re past the write window (e.g., “compress any chunk older than 7 days”).

Continuous aggregates are TimescaleDB’s materialized-view mechanism for downsampling. You define a view that says “average value per 1-minute window, grouped by series_id,” and the engine maintains it incrementally as new data arrives. Queries over the aggregate are orders of magnitude faster than over the raw data because they’re reading 1/60th the rows.

Retention policies are SQL-level jobs that drop chunks older than a threshold. SELECT add_retention_policy('metrics', INTERVAL '90 days') installs a job that, once a day, finds chunks entirely older than 90 days and drops them — a tablespace unlink, not a row delete. This is the single biggest operational improvement over plain Postgres for this workload.

The reason to pick Timescale is that it’s still Postgres. Your existing SQL, your ORMs, your backup tooling, your monitoring, your auth — all of it works. You can join time-series data to your regular relational tables. You can run analytical queries in the same engine. For teams where the entire data stack is already Postgres, Timescale is a drop-in upgrade.

87.3 Prometheus TSDB: the blocks, the WAL, the head

Prometheus is the de-facto metrics store for Kubernetes, and its storage engine is a purpose-built TSDB that is very different from Timescale in every way. It is embedded in the Prometheus server process, not a separate database. It has no SQL. It has no updates or deletes (except at block boundaries). It is aggressively optimized for one workload — scraping targets every 15 seconds and serving PromQL range queries over 15-30 days of data.

The storage layout has three components:

The Write-Ahead Log (WAL). Every incoming sample is appended to a WAL on disk before being stored in the head. The WAL survives crashes: on startup, Prometheus replays the WAL to rebuild the in-memory head. The WAL is rotated in segments of 128 MB and the old segments are deleted as blocks are persisted.

The head block. The most recent ~2 hours of data is held in memory, in a mutable structure that supports inserts and queries. This is where the engine is fastest and where label indexes are densest. Samples are appended to per-series chunks of up to 120 samples (~30 minutes at 15s scrape interval).

Persistent blocks. Every 2 hours, the head is flushed to an immutable on-disk block. A block is a directory containing:

  • index: the inverted index from (label, value) to series id and from series id to the locations of its chunks.
  • chunks/: the actual sample data, one file per chunk.
  • meta.json: metadata about the block’s time range and stats.
  • tombstones: for the rare deletion.

Blocks get compacted over time: two consecutive 2-hour blocks merge into a 4-hour block, which merges with its neighbor into an 8-hour block, and so on. Compaction reduces the number of blocks the query engine has to consult and rebuilds the index for better compression.

The label index is the heart of Prometheus’s query performance. When a query says rate(http_requests_total{job="api", status="5xx"}[5m]), Prometheus looks up the set of series with job=api and the set of series with status=5xx, intersects them, and reads the samples for each matching series over the 5-minute window. The inverted-index design means label intersection is nearly free, and the chunk design means the range scan is sequential on disk.

The catch with Prometheus TSDB is that it is not a long-term store. Default retention is 15 days. You can extend to 30-60 days on beefy hardware, but past that, the block count grows, compaction stalls, and memory pressure mounts. For long-term metrics, Prometheus pushes to a remote-write endpoint, and you use Thanos, Cortex, Mimir, or VictoriaMetrics to store years of data. The local TSDB stays hot and the remote store is the archive. That separation is intentional and a key part of how production Prometheus deployments actually work.

87.4 InfluxDB and the TSM engine

InfluxDB was for a long time the best-known standalone time-series database. It is a purpose-built engine with its own query language (originally InfluxQL, then Flux, and recent versions have re-added SQL). InfluxDB has gone through three major storage engine rewrites (TSM → IFQL → IOx), and the latest (InfluxDB 3.x, “IOx”) is built on Apache Arrow and Parquet and effectively turns the database into a columnar lakehouse over object storage.

The classic TSM (Time-Structured Merge tree) engine is worth understanding because it shaped the design of many other TSDBs. It is an LSM-tree adapted for time-series:

  • WAL for durability. Writes go to a WAL first.
  • In-memory cache. Recent writes live in a mutable in-memory structure.
  • TSM files on disk. The cache is periodically flushed to immutable TSM files, column-oriented, with compression per column.
  • Compaction. TSM files are compacted together over time to reduce file count and improve query efficiency.
  • Shards by time range. Data is partitioned into shards by time range, similar to TimescaleDB chunks.

The design is good for time-series ingestion but has a well-known weakness: high-cardinality tags cause memory and compaction blowups. InfluxDB’s original design held the entire series index in memory, and every unique label-set was a series. Customers who tagged metrics with userIds or request ids ran out of memory. This is the single most common InfluxDB failure mode and the reason cardinality is discussed in every time-series DB doc since.

InfluxDB 3 / IOx addresses this by shifting storage to Parquet on object storage with a DataFusion-based query engine. It becomes, effectively, a time-series lakehouse: data lives cheaply in object storage, queries are distributed over it, and cardinality limits relax. The tradeoff is operational complexity — you are running a distributed query engine, not a single-process database.

For most ML platforms today, InfluxDB is a reasonable choice if you want a standalone time-series store with a rich query language, but Prometheus + remote-write has more mindshare in the cloud-native world, and Timescale has more mindshare in teams already on Postgres. InfluxDB 3 is interesting but newer and less proven.

87.5 Compression: delta, delta-of-delta, Gorilla

The compression techniques that make time-series databases cheap are worth understanding because they are specific and clever.

Delta encoding stores the difference between consecutive values instead of the absolute values. For a timestamp series where samples are 15 seconds apart, the deltas are all 15,000 (ms) — a single value that compresses trivially. Even with jitter, the deltas are small numbers that pack into variable-length integers.

Delta-of-delta goes one level further. Store the difference between consecutive deltas. For perfectly periodic timestamps, every delta-of-delta is 0. Long runs of zeros compress to almost nothing. Facebook’s Gorilla paper showed that timestamps for Facebook’s monitoring data compressed to about 1.37 bits per sample on average using delta-of-delta plus variable-length coding. That is better than 16 bytes per 64-bit timestamp by a factor of nearly 100.

XOR encoding for float values is the Gorilla technique for values. Many metrics change slowly: GPU memory utilization is 73.2%, then 73.4%, then 73.1%. The bit patterns of those floats differ only in the low-order mantissa bits. XOR the new value with the previous: most of the result is zero. Store the number of leading and trailing zeros plus the meaningful middle chunk. The result is ~1-2 bytes per float for slowly-changing values, 4-8 bytes for fast-changing ones, averaged across a realistic workload at about 1.5 bytes per sample.

Combined: Gorilla-style compression achieves ~16 bytes per sample (timestamp + value) reduced to ~2 bytes per sample — a 6-8× reduction without losing any precision. This is why modern TSDBs can ingest and store billions of samples per day on a single node.

Gorilla compression chain: raw 8-byte timestamps become 1.37 bits per sample via delta-of-delta coding; raw 8-byte floats become 1.5 bytes via XOR encoding of consecutive values. Timestamps raw int64 8 bytes / sample delta e.g. always 15000 delta-of-delta mostly 0 → tiny varint ~1.37 bits per sample Float values raw float64 8 bytes / sample XOR with prev high bits cancel out leading/trailing zero count + payload ~1.5 bytes per sample (avg) Combined: 16 bytes raw → ~3 bytes compressed — 5-8× reduction, lossless
Gorilla-style compression exploits that timestamps are near-periodic and floats change slowly — delta-of-delta reduces timestamp bytes to ~1.37 bits per sample and XOR encoding reduces float bytes to ~1.5, together cutting storage 6-8×.

Chunk-level compression layers on top of this. TimescaleDB’s columnar compression adds dictionary encoding for repeated strings and run-length encoding for constant columns. Prometheus’s blocks add block-level compaction but not additional compression beyond the chunk format. InfluxDB’s TSM uses Snappy or ZSTD over its TSM files.

The lesson for a senior engineer: the compression is free at query time (the cost is on the ingest path), and it is the whole reason these engines are affordable at scale. If you build a home-grown TSDB and skip compression, your storage bill will be 10× what it should be.

87.6 Retention and downsampling

The lifecycle of a metric is roughly: arrive at high resolution, live at that resolution for days, be downsampled to lower resolution for weeks, be downsampled again for months, be archived or deleted after a year. This shape exists because nobody cares about the exact value of a 15-second sample from 18 months ago, but everyone cares about the hourly average over that period.

Downsampling is the operation that converts high-resolution data into lower-resolution data by aggregating over time windows. The aggregations that matter: avg, sum, count, min, max, and various percentiles (p50, p90, p99). For histograms, you preserve the bucket counts and downsample the buckets.

Each engine has its own downsampling mechanism:

  • TimescaleDB continuous aggregates. Define a continuous aggregate view that groups by time_bucket('1 hour', ts) and stores the bucketed result. The view is incrementally maintained. Query the view directly or union it with the raw table for seamless time ranges.
  • Prometheus recording rules. A recording rule is a PromQL expression that is evaluated on a schedule and the result is stored as a new time series. Typical usage: define a recording rule for rate(http_requests_total[5m]) and point dashboards at the pre-computed rate instead of computing it on every query. Not quite downsampling in the “reduce storage” sense, but the same idea applied to query cost.
  • Thanos/Mimir compaction. Long-term Prometheus stores apply 5-minute and 1-hour downsampling as background jobs, storing the aggregated result in separate blocks. Queries over ranges longer than 2 days automatically use the downsampled blocks, which are 60× and 3600× smaller than the raw.
  • InfluxDB continuous queries (legacy) or Flux tasks (modern) implement the same pattern.

Retention is simpler: a policy that drops data older than X. Every engine supports it, and the mechanism is always some variant of “drop the old chunks from the filesystem.” In production, typical retention tiers:

  • Raw, 15s resolution: 15-30 days.
  • 1-minute downsampled: 90 days.
  • 5-minute downsampled: 1 year.
  • 1-hour downsampled: 2-5 years.
  • Anything older: cold archive or deleted.

The storage cost of retention is dominated by the hottest tier. A 1-year retention of raw 15s samples for 1 million series is about 2 TB after compression. A 1-year retention of 1-hour downsampled samples for the same series is ~20 GB. The downsampling pays for itself.

graph LR
  Raw["Raw 15s samples<br/>retain 30 days"]
  M1["1-min downsample<br/>retain 90 days"]
  M5["5-min downsample<br/>retain 1 year"]
  H1["1-hr downsample<br/>retain 5 years"]
  Raw -->|"aggregate avg/max/p99"| M1
  M1 -->|"aggregate"| M5
  M5 -->|"aggregate"| H1
  style Raw fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Each downsampling tier is 12-60× smaller than the one above it — the storage cost of long-term metrics retention is dominated by the coarsest tier, not the raw data.

87.7 Query patterns and query languages

The three engines have three different query languages, and the differences are real.

TimescaleDB: SQL (PostgreSQL dialect) plus Timescale-specific functions.

SELECT
  time_bucket('1 minute', ts) AS minute,
  series_id,
  avg(value) AS avg_value
FROM metrics
WHERE series_id IN (42, 43, 44)
  AND ts > NOW() - INTERVAL '1 hour'
GROUP BY minute, series_id
ORDER BY minute;

The time_bucket function is the idiomatic Timescale extension for windowing. Beyond that, it’s standard SQL. Joins with regular tables, CTEs, window functions, all available.

Prometheus: PromQL.

rate(http_requests_total{job="api", status=~"5.."}[5m])

This is a range query over http_requests_total, matched by label, computing the per-second rate of the counter over a 5-minute window. PromQL is a functional query language designed for monitoring. It is extremely terse and extremely good at the monitoring workload. It is bad at anything else — joins across metrics are awkward, arbitrary analytics are painful, and the lack of persistent state or subqueries (well, limited subqueries) means complex queries are hard to express.

InfluxDB: Flux (or the return of SQL in 3.x).

from(bucket: "metrics")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "http_requests")
  |> filter(fn: (r) => r.status =~ /^5/)
  |> aggregateWindow(every: 1m, fn: mean)

Flux is a dataflow-style language. It’s more expressive than PromQL but also more verbose and has a smaller community. InfluxDB 3’s SQL support is a hedge on this.

For an ML platform, the query-language choice is dictated by what your observability stack already speaks. If you’re on Prometheus/Grafana, PromQL is non-negotiable. If you’re building analytics dashboards that mix metrics with business data, SQL (Timescale) is the path of least resistance.

87.8 Cardinality: the silent killer

Every time-series database has a cardinality limit, and crossing it causes catastrophic failures.

Cardinality is the number of unique (metric_name, label_set) combinations. Every unique combination is a separate time series that the engine must track. If you have a metric http_requests_total with labels {job, instance, status, method, path}, and you have 2 jobs, 50 instances, 5 statuses, 5 methods, and 100 paths, that’s 2 × 50 × 5 × 5 × 100 = 250,000 series for that one metric. Add another metric with similar cardinality and you’re at half a million. Still fine.

Now someone adds a user_id label with a million distinct values. Every metric with that label explodes by 1,000,000×. The engine now has a billion series to track. The inverted index grows, the in-memory structures swell, query planning slows, and eventually the process OOMs or grinds to a halt.

This is not a hypothetical. It is the single most common cause of Prometheus outages. The fix is always the same: never put unbounded cardinality into labels. user_id, request_id, trace_id, session_id — these belong in logs and traces, not metrics. Labels should be bounded values: job name, environment, region, status code, HTTP method, endpoint name (curated, not the raw path).

TimescaleDB handles cardinality better because its design is not inverted-index-first — series identity is just a column in a row, and high cardinality just means more rows, which a Postgres table handles fine until you hit storage limits. But the query patterns that make time-series fast (predicate pushdown on series, chunk scan) assume modest cardinality too. A million distinct series per hypertable still works; a hundred million strains everything.

InfluxDB’s cardinality story has been its achilles heel for a decade. Watch it closely.

The senior-engineer discipline: audit label cardinality before adding a label. If the cardinality is bounded and useful, add it. If it’s unbounded or could be, don’t. Move it to logs or traces.

87.9 Picking one for an ML platform

For an ML platform, the realistic choices map to three deployment patterns:

Pattern 1: Prometheus-centric metrics. Use Prometheus for all infrastructure and application metrics, scraped from Kubernetes via ServiceMonitor CRDs. Use Thanos or Mimir for long-term storage. Grafana in front for dashboards. PromQL is the query language. This is the default for cloud-native ML platforms and the right choice if you’re already on Kubernetes and want the operator ecosystem. Alertmanager integrates cleanly. Every vLLM, TEI, and Kubernetes exporter you deploy will already emit Prometheus metrics.

Pattern 2: TimescaleDB for structured time-series plus Prometheus for runtime metrics. Use Prometheus for the “system health” layer and TimescaleDB for application-level time-series that need SQL joins with business data: per-tenant token usage over time, per-model latency percentiles over time, cost-per-request over time. The split is “infra goes to Prometheus, billing and business time-series go to Timescale.” This is a common production setup.

Pattern 3: Everything in a single store. Some teams push everything to a single backend, typically VictoriaMetrics (a high-performance Prometheus-compatible store) or a managed service like Datadog or Grafana Cloud. This is simpler operationally but less flexible and usually more expensive at scale.

For new ML platforms, the default recommendation is Prometheus + Thanos for runtime metrics and Postgres + TimescaleDB for billing, usage, and long-term business time-series. Skip InfluxDB unless you have a specific reason to bring it in.

87.10 The mental model

Eight points to take into Chapter 88:

  1. Time-series workloads are defined by append-only writes, time-range reads, and retention. General-purpose databases are wrong for this shape.
  2. Partition by time. Chunks/blocks/shards bounded by time interval. Old chunks are dropped by unlink.
  3. TimescaleDB = hypertables on Postgres. You get SQL, joins, continuous aggregates, retention policies, columnar compression.
  4. Prometheus TSDB = WAL + in-memory head + 2-hour persistent blocks. Local retention is days; long-term needs remote-write (Thanos/Mimir).
  5. Gorilla-style compression delta-of-deltas timestamps and XOR-encodes float values, giving ~2 bytes per sample.
  6. Downsampling reduces storage for older data by aggregating into coarser time windows. Every engine has a mechanism.
  7. Cardinality is the silent killer. Never put unbounded labels (user_id, request_id) on metrics.
  8. Default ML platform stack: Prometheus + Thanos for runtime, Timescale for business time-series.

In Chapter 88 the focus turns from stored time-series to data in motion: Kafka and the streaming log.


Read it yourself

  • Pelkonen et al., Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015). The canonical paper on delta-of-delta and XOR compression for time-series.
  • The Prometheus documentation on the TSDB format, especially the “TSDB format” section in the developer docs.
  • Mike Freedman and the TimescaleDB team’s blog posts on hypertables, continuous aggregates, and columnar compression.
  • The Thanos and Mimir docs on long-term Prometheus storage and downsampling.
  • InfluxDB’s blog series on the IOx engine, for the Parquet-lakehouse direction.
  • The Cortex paper (from Weaveworks / Grafana Labs) on horizontally scalable Prometheus-compatible storage.

Practice

  1. A Kubernetes cluster with 200 nodes, 50 pods per node, each exporting ~50 metrics at 15s intervals. Compute the samples-per-second ingestion rate. At 2 bytes per compressed sample, how much storage does 30 days take?
  2. Design a retention strategy for ML inference latency metrics: raw at 15s, downsampled at 1m, 5m, 1h. What are reasonable retention windows for each tier? Estimate storage for 1 million series.
  3. Write a PromQL query that computes p99 latency for a http_request_duration_seconds_bucket histogram over 5-minute windows.
  4. Write the equivalent query in TimescaleDB SQL. Assume the bucket counts are stored as rows in a metrics table.
  5. Explain how Gorilla compression handles a value series that jumps wildly (e.g., a queue depth that oscillates between 0 and 10000). What’s the compression ratio?
  6. A new metric is added with a request_id label. Why is this a disaster and how would you catch it before it hits production?
  7. Stretch: Stand up TimescaleDB locally, create a hypertable with 30 days of synthetic data (1 million rows), enable compression, measure query latency for a “show me the hourly average for the last 24 hours” query before and after compression.