Chapter 94: Logs: Loki, structured logging, sampling, retention economics

If metrics tell you that something broke, logs tell you what happened around the break. They are the human-readable trail that lets you reconstruct the story of a request or a pod five minutes ago. Unfortunately, they are also the single largest line item in many observability bills, and the part of the stack where “just log more” runs headfirst into the economics of object storage.

This chapter is about how logs fit into a modern observability stack without bankrupting you. Structured logging as a non-negotiable baseline, the LogQL mental model that Loki inherits from Prometheus, the cardinality pitfalls that also come with that inheritance, the shipping layer (Vector vs Fluent Bit), sampling strategies, and the cost math for retention. If Chapter 93 was “metrics are cheap because of compression,” this chapter is “logs are expensive because they aren’t compressible the same way.”

Outline:

Why structured logs are non-negotiable.
The log as a queryable record.
LogQL mental model.
Loki’s “index by labels, store full text” design.
Label cardinality pitfalls (again).
Shippers: Vector vs Fluent Bit vs Promtail.
Sampling strategies.
Retention cost math.
Loki vs Elasticsearch — the architectural divide.
Production patterns.
The mental model.

94.1 Why structured logs are non-negotiable

A log line is a string. The question is what kind of string.

Unstructured log:

2026-04-10 15:23:11 ERROR Failed to load model llama-3-70b from s3://bucket/path: connection reset by peer

Structured log (JSON):

{"ts":"2026-04-10T15:23:11Z","level":"error","msg":"failed to load model",
 "model":"llama-3-70b","source":"s3://bucket/path","error":"connection reset by peer",
 "attempt":3,"trace_id":"abc123","pod":"vllm-5f4c8"}

Both carry the same information. The second one is queryable and the first one is not. If you want “show me all failed model loads with attempt ≥ 3 in the last hour,” the second one is a one-line LogQL query; the first is a regex against tens of millions of lines that will take forever and probably get it wrong.

The rule: logs that a machine needs to read should be JSON. Every production system, every language, every framework should emit JSON by default. Python’s structlog, Go’s slog, Rust’s tracing with JSON output, Node’s pino — all do the right thing. The legacy printf("hello %s", name) style is fine for scripts and early development, but in production it creates tech debt that someone will have to pay.

Beyond JSON, three additional conventions matter:

Use a consistent key schema. Pick names for level, timestamp, message, error, and trace_id, and use the same names everywhere. When a team standardizes on level: "error" and another on severity: "ERROR", the resulting queries have to special-case both forever.
Include the trace ID. Every log line emitted while handling a request should carry the current trace ID (Chapter 95). This is what lets you jump from a trace to its associated logs, and vice versa.
Don’t log secrets. Ever. Token values, API keys, raw auth headers — scrub at the logging library level, not at the query level. Someone will paste a log line into a chat channel.

The return on structured logging is massive and the cost is small. It’s the easiest observability lift a team can make, and the most often delayed.

Structured JSON logs turn every field into a queryable dimension at zero extra storage cost — the only cost is switching the logging library, which is a one-line change in every mainstream runtime.

94.2 The log as a queryable record

Think of a log stream as an append-only table with three columns:

timestamp (ns) | labels (small set of key=value) | line (opaque string or JSON blob)

The labels are identifiers: job name, pod name, namespace, cluster, log level. They are small in count and relatively stable. The line is everything else — the actual content, which could be megabytes.

The query pattern you want:

Filter by labels to find the right streams (e.g., all logs from pods in the inference namespace with level=error).
Filter inside the line by regex or JSON field to find the specific event.
Return the matching lines, optionally aggregated or counted.

This is what LogQL does. It’s also what Elasticsearch does, except Elasticsearch indexes every word of the line rather than just the labels. That architectural divide is covered in §94.9.

94.3 LogQL mental model

Loki’s query language is LogQL, and it is deliberately modeled after PromQL. Every LogQL query has two parts:

{label_filter} |= "line filter" | logfmt | json | ...

The label filter in curly braces picks which streams to read. The pipeline of transformations after the pipe does line-level filtering, parsing, and projection.

Concrete examples:

# All error logs from a specific deployment
{namespace="inference", app="vllm"} |= "error"

# Errors containing a specific model name, parsed as JSON
{namespace="inference"} |= "error" | json | model="llama-3-70b"

# Rate of error logs per minute, by pod
sum by (pod) (rate({namespace="inference"} |= "error" [1m]))

The last query is the thing to internalize: LogQL can produce metrics from logs. rate({stream} |= "error" [1m]) computes “error-lines per second” as a Prometheus-compatible metric. You can then alert on it, graph it, combine it with actual Prometheus metrics.

This is the “metrics from logs” capability that Splunk and ELK have always had. The difference is that Loki makes it cheap because the label-based index means the rate query only has to scan the error stream, not every log line in the system.

Operator reference:

|= — line contains substring
!= — line does not contain
|~ — line matches regex
!~ — line does not match regex
| json — parse line as JSON, expose fields as labels for the rest of the pipeline
| logfmt — parse key=value logfmt
| line_format "{{.field}}" — reformat output lines
| label_format new=old — rename labels

The pipeline is evaluated per line, left to right. The label filter at the start is what determines the cost — it limits the streams that need to be scanned. Everything after the pipe is “free” in the sense that it processes only matched lines.

94.4 Loki’s “index by labels, store full text” design

Here is the architectural pitch for Loki, in one sentence: index only the labels, store the log lines as opaque compressed chunks in object storage. Labels go into a small index. Lines go into S3 or GCS.

Why this works:

Object storage is cheap. S3 Standard is on the order of $0.023/GB/month, S3 Glacier is ~$0.004/GB/month. Compared to Elasticsearch’s SSD requirement ($0.10/GB/month or more), this is a 5-25× cost advantage.
Compression is excellent. Gzipped log chunks compress 10-20×. A GB of raw logs becomes 50-100 MB on S3. Multiply the cost advantage again.
Most queries only need a few streams. If you filter by labels to a narrow set of streams, you only decompress the chunks for those streams. The total amount of data scanned per query is small.

What this costs:

Full-text search is not indexed. A query like “find the stack trace containing this substring” scans every byte of every stream you named in the label filter. If your label filter matches a lot of streams, the query is slow and possibly expensive.
Line-level filters are O(bytes scanned). No inverted index on words.
The label filter does all the work. If you can’t filter by labels, you can’t scope the query down.

This is the trade. Loki says “your index should only contain what you already know how to filter on” (service, pod, level, namespace) and leaves full-text search as a brute-force scan over compressed chunks. For most observability use cases where you know which service is broken before you query the logs, this is the right trade.

The internal architecture:

Distributors receive pushes from shippers, hash streams by label set, forward to ingesters.
Ingesters hold recent data in memory, flush to object storage in time- or size-based chunks.
Queriers read from both ingesters (for recent) and object storage (for historical), apply the pipeline, stream results back.
Compactor rewrites small chunks into larger ones over time, reducing the number of S3 objects.

Each component is horizontally scalable. A single Loki deployment can handle terabytes per day if the label cardinality is controlled.

graph LR
  A[App pod] -->|JSON log lines| S[Vector DaemonSet<br/>parse · enrich · filter]
  S -->|push API| D[Loki Distributor]
  D -->|hash stream| I[Loki Ingester<br/>in-memory chunks]
  I -->|flush| OBJ[(S3 / GCS<br/>compressed chunks)]
  Q[Loki Querier] -->|recent| I
  Q -->|historical| OBJ
  G[Grafana / LogQL] --> Q
  style I fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Log lines flow from pods through a per-node shipper (Vector), into Loki’s ingest path — only the label set is indexed; the line content lands as compressed chunks in object storage, which is 5-25× cheaper than Elasticsearch’s SSD-backed inverted index.

94.5 Label cardinality pitfalls (again)

Everything said about Prometheus cardinality in Chapter 93 applies here, doubly. A Loki label is also a stream identifier, and every unique label combination creates a new stream. Streams are stored and queried individually; too many streams means too many index entries, too many small chunks in object storage, and slow queries.

The rules:

Never put trace_id, request_id, user_id in labels. Put them in the log line itself as JSON fields. You can still query them via | json | trace_id="abc" without paying the cardinality cost.
Labels should be service, namespace, pod, level, cluster, and maybe a handful of others. That’s it. Stop before you add a fifth.
Pod churn still hurts. A deployment rolling every 10 minutes with 100 pods creates a new set of streams every 10 minutes. Over a day, that’s 14,400 pod-streams. If each pod has ~5 log levels, you have 72,000 streams. This is manageable, but it’s the ceiling to watch.
Kubernetes labels can leak into log labels. If you auto-inject every pod label as a log label, and someone adds a high-cardinality pod label (git_sha, release_tag), your Loki cardinality explodes. Allowlist the labels you promote, don’t auto-promote everything.

A healthy Loki deployment has on the order of 100K-1M active streams. Above that, you pay in query latency and storage overhead. The controlled-labels discipline is what keeps you below the ceiling.

94.6 Shippers: Vector vs Fluent Bit vs Promtail

A log shipper is the agent that runs on each node, reads log files (or stdout streams), and ships them to Loki or another backend. The three mainstream options:

Fluent Bit — C, very small footprint (~10 MB memory), mature, the default in most Kubernetes clusters. Configuration is INI-style with a plugin pipeline. Rich ecosystem of input/output plugins. The “if in doubt, run Fluent Bit” choice.

Vector — Rust, larger footprint (~30-100 MB), very fast, configured via TOML or YAML with a DSL called VRL (Vector Remap Language) for transformations. VRL is expressive enough to do JSON parsing, field extraction, enrichment, and reshaping in one file. A Vector pipeline can replace both a log shipper and a small logstash. My recommendation for new deployments.

Promtail — Grafana’s Loki-specific agent. Simple, lightweight, zero-config for Loki. Being gradually deprecated in favor of Grafana Alloy (which is effectively a rebranded merger of Promtail + other Grafana agents).

Key patterns any shipper needs to get right:

Parse at the edge. If the logs are JSON, parse them into structured fields at the shipper, not at query time. Loki accepts structured logs and will use them.
Add pod/namespace labels from Kubernetes metadata. The shipper should attach the right labels automatically by querying the kubelet.
Rate-limit per stream. A runaway container logging 100 MB/s will DOS the backend if the shipper doesn’t throttle it. Most shippers support per-stream rate limits.
Handle backpressure. If Loki is slow or unreachable, the shipper must buffer (to disk ideally) and retry, not drop. Both Vector and Fluent Bit support disk buffers.
Drop noise early. If you know certain log streams are useless (health-check logs, for example), drop them at the shipper. Every byte that reaches Loki costs money.

For a production ML platform, a common setup is Vector as a DaemonSet on every node, parsing and enriching logs, then forwarding to Loki via the Loki push API. Vector’s VRL lets you do scrubbing (strip auth tokens), filtering (drop health checks), and routing (send certain streams to a separate bucket) in one place.

94.7 Sampling strategies

Not every log line is worth keeping. Sampling reduces volume — and thus cost — by keeping a subset of lines that are statistically representative or semantically important.

Sampling debug and info logs 90–99% while keeping all errors costs almost nothing in diagnostic coverage but cuts log volume — and storage bills — by 80–95%.

Strategies:

(1) Level-based filtering. The simplest: drop debug and trace in production. Keep info and above. This alone cuts volume by 80-95% for a verbose service.

(2) Random sampling. Keep 1 in N lines for a given stream. Useful for high-volume info logs where you just need “a sample of what’s happening.” Keep 100% of errors and warnings; sample infos.

(3) Dynamic rate-limited sampling. Keep the first K events per second, then drop until the next second. Prevents a single noisy loop from dominating.

(4) Tail-based sampling. Like tail-based trace sampling (Chapter 95). Buffer logs for a request in memory; if the request ultimately failed or was slow, keep all its logs; otherwise drop them. Hard to implement correctly and requires the shipper to know about request boundaries.

(5) Per-tenant sampling. Different tenants get different retention and sampling policies. Free-tier tenants get 1% sampling, paid tenants get 100%. Requires tenant identification at the shipper.

For ML serving specifically: the pattern that works best is “keep all errors, sample info logs at 1-10%, drop debug.” With structured logging, you can always reconstruct the context of a slow request from the errors + the sampled infos around it. If you need more, you turn up sampling temporarily during an incident.

One warning: sampling hides low-rate events. A bug that affects 1 in 10,000 requests will disappear if you sample at 1%. Error-level and warning-level logs should never be sampled. That 1-in-10,000 bug needs to show up in the logs.

94.8 Retention cost math

Logs are expensive because retention is expensive. Do the math.

Assume:

A 1000-pod cluster.
Each pod produces 10 KB/s of logs (realistic for a structured-logging LLM serving stack).
Raw ingest: 10 MB/s = 864 GB/day = 26 TB/month.

At S3 Standard pricing ($0.023/GB/month), with 10× compression, 26 TB uncompressed becomes 2.6 TB compressed. One month’s retention costs:

2.6 TB × $0.023/GB = $60/month

Not bad. Now add 12 months of retention and you’re storing 31 TB compressed at any time, for $713/month. Still modest. Now add Loki’s indexing overhead (another ~5-10%) and the cost of the querier/ingester compute, and you’re maybe at $1-2k/month for logs from a 1000-pod cluster. Elasticsearch doing the same job would cost 10× that.

Where it gets expensive:

No compression. If you ingest raw text without structured compression, the 10× multiplier disappears and your bill does 10×.
Long retention on hot storage. S3 Standard for a year of logs you query once is wasteful. Use S3 Intelligent-Tiering or a lifecycle policy that moves chunks older than 30 days to Glacier. Loki supports this.
High cardinality. High label cardinality means lots of tiny chunks, which means lots of S3 GET requests during query. S3 GET costs ($0.0004/1000) add up when your query hits 100k chunks.
Expensive queries. A query that filters only by namespace and scans every line of every pod in that namespace over a week is reading terabytes of data. Loki’s query cost is proportional to bytes scanned. Run these rarely.

The retention policy discussion: most organizations keep hot logs for 14-30 days and archive anything older to Glacier or delete it. The legal hold and compliance requirements are what push retention longer, not debugging needs. Engineering almost never needs logs older than two weeks; compliance sometimes needs 7 years. Split the retention: hot 30 days, archive the rest at a tenth the cost.

94.9 Loki vs Elasticsearch — the architectural divide

The two dominant approaches to logs:

Elasticsearch (ELK stack): index every word of every log line in an inverted index. Any full-text query is fast because the index answers it directly. Storage is on hot SSD because index structures need random access.

Pro: arbitrary full-text search is fast.
Con: ingest is expensive (indexing is CPU-heavy). Storage is expensive (SSDs, plus the index is often 2-10× the raw data size). Cluster management is operationally heavy.

Loki: index only the labels. Store lines as compressed chunks in object storage. Full-text queries scan compressed chunks.

Pro: ingest is cheap. Storage is 5-25× cheaper. Operationally simpler.
Con: full-text queries are slower, especially over wide time ranges or broad label filters.

When to pick which:

Elasticsearch if you have analysts or SREs who run ad-hoc full-text queries constantly, if you need sub-second full-text response, or if you’re doing log analytics (aggregations, faceting) as a primary use case.
Loki if you know which service you’re investigating before you query, and if cost matters. Almost every SRE workflow fits this pattern.

A common hybrid: Loki for everything, with a small Elasticsearch (or OpenSearch) for security logs that need full-text search. Keeps the bulk of the data cheap while still giving the security team their tool.

Note: Loki is not the only label-indexed store. Grafana Alloy, VictoriaLogs, Quickwit, and a few others occupy similar territory. They all share the “index only labels” design, differing in compression, query speed, and operational complexity. Loki is the default because it integrates cleanly with Grafana and Prometheus.

94.10 Production patterns

A few patterns from real deployments:

(1) Standardize on a single JSON schema cluster-wide. Field names: ts, level, msg, service, pod, trace_id, span_id, error. Enforce at the language library level.

(2) Inject trace IDs automatically. Use OTel instrumentation (Chapter 95) to propagate the trace context into the logging library so every log line for a request carries the trace ID without manual effort.

(3) Drop health checks and debug at the shipper. Not at the backend. Every byte you drop early is a byte you don’t pay to store.

(4) Rate-limit per pod at the shipper. A panicking pod that logs 1 GB/s can take out your Loki if the shipper doesn’t throttle.

(5) Query with label filters first, line filters second. A LogQL query like {namespace="x"} |= "error" | json | model="y" is efficient because the label filter scopes it before the line filter runs.

(6) Short retention on hot, long on cold. 14-30 days in S3 Standard, the rest in Glacier or deleted.

(7) Alert on log rate, not individual lines. rate({level="error"}[5m]) > 10 is a better alert than “fire on every error line.” Rate-based alerts don’t spam during a real outage.

(8) One log per event, not per line of code. A single JSON blob with all the context beats 10 lines that need to be correlated.

These patterns are what separate “logging works” from “logging costs a fortune and nobody can query it.”

94.11 The mental model

Eight points to take into Chapter 95:

Structured JSON logs are non-negotiable. Machine-queryable is the whole point.
A log stream is (labels, lines). Filter by labels to scope, filter by lines to find.
LogQL mirrors PromQL and can produce metric-shaped outputs from log data.
Loki indexes labels, not words. Cheap storage in object stores in exchange for slower full-text queries.
Cardinality is still the killer. Never put IDs in labels. Put them in the JSON line.
Vector at the shipper, Loki at the backend is a strong default combination for new deployments.
Sample infos, keep all errors. Level-based plus random sampling cuts cost without hiding real issues.
Retention cost is linear in volume, multiplied by time. Short hot retention, long cold retention, aggressive compression.

In Chapter 95, we move from “what happened in this service” to “what happened across services” — distributed tracing.

Read it yourself

Grafana Loki documentation, especially the LogQL and cardinality sections.
Observability Engineering (Majors, Fong-Jones, Miranda), chapters on structured data and events.
Vector documentation and VRL (Vector Remap Language) reference.
The Elastic blog post series on log indexing internals — useful for understanding what Loki deliberately doesn’t do.
Google SRE Workbook, chapter on “Non-Abstract Large System Design,” for the cost discussion around log retention.
Honeycomb’s blog on structured events — a related (events-first) view on the same problem.

Practice

Convert an unstructured log line to JSON with a consistent schema. Include trace ID, level, service, and error fields.
Write a LogQL query that counts error rate per pod per minute over the last hour for all pods in the inference namespace.
A team adds user_id as a Loki label. Describe exactly what breaks and why.
Calculate the monthly cost of 26 TB/month uncompressed logs stored in S3 Standard for 30 days with 10× compression. Now repeat for 12 months.
For a vLLM pod logging 100 info lines per request at 50 RPS, what is the line rate per pod? Should you sample? At what ratio?
Compare the architectural trade-offs of Loki vs Elasticsearch for an on-call engineer’s workflow of “find the errors around this broken request.” Which wins?
Stretch: Stand up a local Loki with Vector as the shipper. Point it at a small app that emits JSON logs. Query it with LogQL. Deliberately add a high-cardinality label and watch query latency degrade.