Part VI · Distributed Systems & Request Lifecycle
Chapter 84 ~22 min read

Telemetry firehoses: Kafka as the backbone

"Kafka is not a message queue. It is a replicated, ordered, append-only log. Everything that makes it good at telemetry — and everything that makes it confusing — follows from that sentence"

Every high-volume ML platform has a telemetry firehose — a stream carrying every request, every usage event, every log line, every trace span, every metric emission from the data plane to the places that aggregate, store, and analyze them. The firehose is typically Kafka. Some shops use Kinesis, Pulsar, Redpanda, or PubSub, but the semantics are nearly identical and the design considerations transfer. “Kafka” has become the generic name for “the log at the center of the data plane.”

After this chapter, the firehose pattern is clear end to end. Why a log is the right abstraction for telemetry. How to pick partition keys and what happens when you pick wrong. Consumer groups and the parallelism model. At-least-once vs “exactly-once” semantics and the honest version of each. Dead-letter queues. Schema evolution with Avro, Protobuf, and JSON Schema. Why Kafka won. And because this chapter closes Part VI, the final paragraph recaps the whole request lifecycle from Chapters 71–82.

Outline:

  1. Why a log is the right abstraction.
  2. Kafka in one section.
  3. Topics, partitions, offsets.
  4. Partition keys and how to pick them.
  5. Consumer groups and parallelism.
  6. Delivery semantics — at-least-once, at-most-once, exactly-once.
  7. Dead-letter queues and poison pills.
  8. Schema evolution — Avro, Protobuf, JSON Schema.
  9. Operational realities.
  10. Why Kafka won.
  11. The mental model and Part VI recap.

84.1 Why a log is the right abstraction

A log is an append-only sequence of records. New records go on the end; existing records are immutable; consumers read from a position in the log and advance their position as they process records. That is the entire data model.

It sounds trivial and it is load-bearing for telemetry. Every useful property of a telemetry firehose falls out of “append-only log”:

  • Multiple consumers read the same data. Real-time dashboards, batch aggregators, audit archivers, anomaly detectors, ML training pipelines — all consume the same stream independently. Each has its own read position. The producer does not know or care who is reading.
  • Ordering is preserved (within a partition). Events come out in the same order they went in, which matters for any consumer that cares about sequence — state reconstruction, time-windowed aggregation, change data capture.
  • Replay is cheap. Need to re-aggregate the last 24 hours because you found a bug? Rewind the consumer and replay. The log is still there (subject to retention).
  • Decoupling in time. The producer writes at its own rate; the consumer reads at its own rate; a temporary mismatch is absorbed by the log buffer. A slow consumer doesn’t block the producer; a brief producer spike doesn’t lose data if the consumer catches up.
  • Decoupling in space. The producer and consumer don’t know about each other. New consumers show up without the producer changing. Old consumers disappear without the producer noticing.
Kafka topics as shared logs: one producer writes to the log; multiple independent consumer groups each maintain their own read position and cannot block each other. Producer Kafka topic (append-only log) 0 1 2 3 4 5 offsets → billing-aggregator trace-archiver anomaly-detector each consumer group has its own offset — they never block each other, and each can replay independently
Multiple independent consumers read the same events without interfering; a slow billing aggregator does not affect the trace archiver, and either can replay from any offset.

Compare to a traditional message queue (RabbitMQ, SQS): messages are consumed and deleted; fan-out requires duplicating messages; replay requires re-injection. The log abstraction is strictly more general — you can simulate a queue on a log (one consumer, commit offsets as you go) but you can’t simulate a log on a queue.

Jay Kreps’ essay “The Log: What every software engineer should know about real-time data’s unifying abstraction” (2013) is the canonical statement of this argument. It is worth reading at least once for the way it reframes data pipelines as “the log plus whatever derives state from it.”

84.2 Kafka in one section

Kafka is an open-source distributed log system originally built at LinkedIn and now an Apache project. The key parts:

  • Broker: a Kafka server that stores partitions and serves producers/consumers. Brokers are clustered; a cluster typically has 3-9+ brokers.
  • Producer: a client that writes records to topics.
  • Consumer: a client that reads records from topics.
  • Topic: a named log. Think “usage events,” “trace spans,” “audit logs.”
  • Partition: a topic is split into N partitions for parallelism. Each partition is an ordered log; a topic is a collection of parallel logs.
  • Offset: the position of a record within a partition. Monotonically increasing integer.
  • Replication: each partition is replicated N-way (typically 3x) across brokers. One replica is the leader; the others are followers.
  • Consumer group: a set of consumers that share the work of reading a topic. Kafka assigns partitions to group members so each partition is read by exactly one consumer in the group.

Kafka’s coordination was originally done by ZooKeeper; since Kafka 2.8 (KIP-500) the “KRaft” mode provides a built-in Raft consensus, becoming the default in Kafka 3.3. New deployments should use KRaft. Old deployments still run ZooKeeper.

The wire protocol is binary over TCP. Producers batch records, compress them, and send them to the partition leader. Consumers poll the leader for new records starting from an offset. All of this happens over long-lived connections with sane defaults for batching, compression (Snappy or LZ4 are typical), and backpressure.

Kafka’s durability story: a record is “acked” only after it’s written to the leader’s log and replicated to N followers. The producer’s acks setting controls this — acks=0 is fire-and-forget, acks=1 waits for the leader, acks=all waits for all in-sync replicas. For telemetry, acks=all is standard even though it costs a few ms of latency.

84.3 Topics, partitions, offsets

A topic is the unit of organization. You pick one topic per logical data stream: usage_events, request_logs, model_outputs, tokenization_errors. Topics are cheap; do not be stingy with them.

A partition is the unit of parallelism. One partition is written and read by one producer/consumer at a time (in each group). Multiple partitions allow multiple concurrent writers and readers. Parallelism in Kafka is bounded by partition count: if your topic has 12 partitions, you can have at most 12 consumers in a group reading from it simultaneously. Adding a 13th gives you one idle consumer.

Partition count is a design decision that is hard to change after the fact (you can add partitions, but not delete them, and adding breaks any guarantees that depend on the old partition assignment). Rules of thumb:

  • Minimum 6-12 partitions per topic for production-scale traffic, even if you only have 2 consumers today, so you have room to scale.
  • Don’t blow past ~1000 partitions per broker because each partition has overhead (file handles, metadata).
  • Partition count should be at least as high as your target parallelism.

Offsets are the position within a partition. They are monotonically increasing integers, never reused, and are unique per (topic, partition). A consumer tracks its position as (topic, partition, offset) — when you say “consumer X is at 47382 in partition 3 of topic usage_events,” you’ve fully specified its state.

Offsets are stored by the consumer group in a special Kafka topic called __consumer_offsets. The consumer commits its offset periodically or on specific events (processing complete, batch flushed). On restart, the consumer reads the committed offset and resumes from there.

Retention is time-based, size-based, or both. A topic might be configured to keep records for 7 days or 1 TB, whichever comes first. Old records are deleted by segment (partitions are stored as fixed-size files on disk, and expiring one segment is a file delete). For telemetry that feeds many consumers, retention should be longer than the slowest consumer’s worst-case lag plus some buffer — typically 3-7 days.

84.4 Partition keys and how to pick them

Each record produced to a topic can carry a key. The key is hashed (by default, with MurmurHash2) to pick a partition. All records with the same key go to the same partition — the critical property.

Why it matters: ordering is guaranteed within a partition, not across partitions. If you want all events for a given tenant (or user, or operation, or device) to be processed in order, they must land on the same partition, which means they must share a key.

How to pick the key:

Use the natural unit of ordering. For usage events, the tenant id is usually right — all events for one tenant go to the same partition, so per-tenant aggregation is trivial. For trace spans, the trace id — all spans of a trace are colocated. For user-scoped events, the user id.

Watch out for hot partitions. If one tenant generates 10% of your traffic, that tenant’s partition handles 10% of all load while other partitions are idle. This is the “hot key” or “celebrity problem” — your biggest customer becomes your biggest scalability headache. Mitigations:

  • Use a compound key (tenant_id + user_id) that spreads within a tenant.
  • Rebalance keys periodically with a prefix.
  • For truly hot keys, use “sub-keying” — the producer appends a random salt for the hottest tenants.

Use null keys when ordering doesn’t matter. A null key distributes records round-robin (or sticky-batched, since newer clients) across partitions. This maximizes parallelism and load balance. Use for logs, metrics, or events where per-key ordering is not needed.

Don’t use sequential keys. If your key is a monotonically increasing integer, the hash still distributes, but the distribution can be skewed and you lose the locality benefit of keying. Time-based keys (event timestamp) are particularly bad — they all go to one partition at a time.

The wrong partition key choice is the most common Kafka mistake. You deploy with key = user_id, a viral user shows up, one partition saturates, consumer lag goes to the moon, pagers fire. The fix (re-keying) requires downtime or a careful migration. Pick the key carefully up front.

Hot partition problem: a single high-traffic tenant key routes all its events to one partition, saturating it while other partitions are idle. 4 partitions of topic usage_events P0 tenant-A 60% of all traffic SATURATED HOT P1 tenant-B · 15% P2 · tenant-C · 14% P3 · others · 11% Fix: compound key tenant_id+user_id, or sub-key with random salt for the hottest tenants
A hot partition saturates one broker and one consumer thread while others sit idle; compound keys or random salting for celebrity tenants prevents this imbalance.

84.5 Consumer groups and parallelism

A consumer group is a set of consumers that cooperate to read a topic. Each partition in the topic is assigned to exactly one consumer in the group. If you have 12 partitions and 3 consumers, each consumer reads 4 partitions. If you have 12 partitions and 12 consumers, each reads 1. If you have 12 partitions and 20 consumers, 8 are idle.

When a consumer joins or leaves the group (scale up, scale down, crash, deploy), Kafka rebalances: it reassigns partitions among the current members. The rebalance is a brief stop-the-world for the group — all consumers pause, the assignment is recomputed, then consumption resumes. Modern Kafka (with incremental cooperative rebalancing) minimizes the stop, but it’s still real. For latency-sensitive consumers, frequent rebalances are painful.

Different consumer groups are independent. Topic usage_events can be consumed by group billing-aggregator for billing, group trace-archiver for audit, and group anomaly-detector for monitoring. Each group has its own offsets, its own parallelism, its own lag. They don’t interfere with each other. This is how you get “N independent consumers of the same stream” — distinct group ids.

Two patterns of consumer group usage:

Worker pool: scale consumers horizontally for throughput. 12 partitions, 12 consumer pods, each handling its share. This is the telemetry pipeline default.

Exclusive singleton: 1 partition, 1 consumer in the group. Used when you need one process to see every record in order — e.g., a state machine that reacts to a sequence of events. The downside is zero horizontal scalability; the partition is a bottleneck.

Consumer lag — the difference between the latest offset and the consumer’s committed offset — is the key operational metric for a Kafka-based pipeline. A growing lag means the consumer is falling behind; a shrinking lag means it’s catching up. Dashboards for every consumer group’s lag are mandatory. Alerts on “lag growing faster than processing rate” catch problems before they become outages.

84.6 Delivery semantics

Kafka offers three delivery semantics; the honest version of each is:

At-most-once: records might be lost, but never duplicated. Achieved by committing offsets before processing. If the consumer crashes after committing but before processing, the record is skipped. Rarely used for telemetry because you don’t want to lose events.

At-least-once: records are never lost, but might be duplicated. Achieved by committing offsets after processing. If the consumer crashes after processing but before committing, the next run reprocesses. This is the default for telemetry and is correct as long as the consumer is idempotent (downstream dedup, upsert semantics, event_id-based dedup).

Exactly-once: records are processed exactly once. Kafka Transactions plus the Idempotent Producer get you “exactly once between a Kafka topic and another Kafka topic (or a transactional sink) within the same Kafka cluster.” The moment an external system is involved — your Postgres billing table, your S3 archive, your Redis cache — exactly-once becomes “effectively-once if you code it carefully,” which means at-least-once plus end-to-end idempotency.

The mistake to avoid: believing Kafka’s exactly-once flag means exactly-once end-to-end. It does not. It means Kafka’s producer-to-broker path is deduplicated and multi-topic writes are atomic. That is useful but it is not the same as “my aggregator cannot over-count.” If your aggregator writes to Postgres and the connection drops mid-transaction, Kafka still sees the write as complete; you might have double-counted in Postgres. Application-level idempotency is mandatory.

For a telemetry firehose, the correct model is:

  • Producer side: at-least-once with the idempotent producer enabled (so same-producer retries don’t duplicate within Kafka).
  • Consumer side: at-least-once, commit offsets after the downstream write is durable.
  • Downstream: idempotent via event_id (Chapter 83) or natural upsert keys.

This is simpler and more reliable than fighting exactly-once.

84.7 Dead-letter queues and poison pills

A “poison pill” is a record that a consumer cannot process — malformed schema, invalid data, a bug that panics on it. If the consumer just fails and retries forever, the partition is stuck: it can never advance past the bad record, and every record behind it starves.

stateDiagram-v2
  [*] --> Processing: poll record
  Processing --> Commit: success
  Commit --> [*]
  Processing --> TransientRetry: network / rate-limit error
  TransientRetry --> Processing: backoff + retry
  TransientRetry --> DLQ: max retries exhausted
  Processing --> DLQ: permanent error (schema / parse)
  DLQ --> [*]: publish to dlq topic + advance offset

The DLQ is not a discard bin — it is an action queue; every DLQ message should have an alert so the team investigates before the partition count grows.

The dead-letter queue (DLQ) pattern is the standard fix. When a consumer hits a record it cannot process after N retries, it publishes the record (with error metadata) to a separate DLQ topic and advances past it. The main pipeline continues; the bad record is saved for human investigation.

def consume(record):
    for attempt in range(MAX_RETRIES):
        try:
            process(record)
            return
        except TransientError as e:
            time.sleep(2 ** attempt)
        except PermanentError as e:
            publish_to_dlq(record, error=str(e))
            return
    # If we exhausted retries on transient errors, treat as poison
    publish_to_dlq(record, error="max retries exhausted")

Distinguishing transient from permanent errors is the hard part. Network timeouts, downstream rate limits, temporary DB unavailability — transient, worth retrying. Schema validation failures, missing required fields, serialization errors — permanent, go straight to DLQ.

The DLQ itself needs to be processed. A human reviews it (for low-volume bugs) or a separate tool categorizes failures and either fixes and replays, escalates, or archives. DLQs that nobody reads are a smell — it means real errors are piling up invisibly.

For high-volume telemetry, the DLQ is also a useful monitoring signal. Sudden DLQ growth almost always means either a producer bug (someone shipped a bad schema), a consumer bug, or an incident upstream.

84.8 Schema evolution — Avro, Protobuf, JSON Schema

Telemetry events are serialized with some schema. Over time the schema changes — fields are added, defaults updated, enums extended. The challenge is that producers and consumers deploy at different times, so at any moment you have producers on schema v4 and consumers on schema v3 (or vice versa) all talking to the same topics. Schema evolution is the discipline of evolving types so old and new code stay compatible.

Avro is the historical Kafka default. Schemas are JSON files; records are serialized as compact binary. Avro requires the schema at deserialization time, so Confluent’s Schema Registry is the standard setup: producers register schemas, assign them an id, and write records as {schema_id}{avro_bytes}. Consumers fetch the schema by id and deserialize.

Avro’s forward/backward compatibility rules:

  • Add a field → compatible only if the new field has a default value.
  • Remove a field → compatible only if the removed field had a default.
  • Rename a field → use aliases.
  • Change a type → usually broken unless it’s a documented promotion (int → long is ok; string → int is not).

Protobuf is increasingly popular for Kafka telemetry, especially shops that already use Protobuf for gRPC. Confluent’s Schema Registry supports Protobuf in the same way as Avro. Protobuf’s compatibility rules:

  • Add a field → compatible, as long as you don’t reuse field numbers.
  • Remove a field → compatible, mark the field number as reserved.
  • Never reuse a field number; never change a field’s type.
  • Wire format is smaller and faster to parse than Avro for most workloads.

JSON Schema is the third option, and is increasingly used when the telemetry consumers are polyglot and ease of human reading matters. JSON Schema with the Schema Registry gives you validation but not compact serialization — the records are still JSON bytes, with all the overhead. For low-volume, internal, human-readable streams, it’s fine. For high-volume, it burns bandwidth.

The rule is: pick one and stick with it. Mixing formats in the same topic is a footgun. The Schema Registry enforces compatibility: when a producer registers a new schema version, the registry checks it against the compatibility level for the topic (BACKWARD, FORWARD, FULL, NONE) and rejects incompatible changes. This is the one tool that makes schema evolution manageable — without it, you rely on discipline, and discipline fails.

For the metering pipeline (Chapter 83), Protobuf with the Schema Registry is the modern choice. Small wire size, good tooling, strong compatibility discipline.

84.9 Operational realities

A few realities that only show up in production:

Brokers fill up. Disks fill faster than you expect because retention is generous and compression is imperfect. Alert on broker disk utilization at 70% and rebalance before you hit 85%.

Rebalancing is expensive. Rolling restart a broker in a 12-broker cluster and partition rebalancing pauses consumer groups for seconds. For latency-sensitive consumers, rolling restarts at off-hours.

Network partitions split brains. Kafka’s quorum model (under KRaft) requires majority availability to make progress. A network partition that isolates the leader from the followers will stall writes until healed. For cross-region telemetry, local clusters plus cross-region replication (MirrorMaker 2 or Confluent Replicator) is more reliable than a stretched cluster.

Consumer-side slowness kills throughput. If the consumer’s processing time per record grows, lag grows. Profile the hot path. Batch writes to downstream systems (Postgres, ClickHouse) to amortize fixed costs. Parallelize within a partition using worker threads (while keeping offsets ordered).

Offsets sometimes reset. A consumer bug or misconfiguration can cause auto.offset.reset=earliest to replay an entire topic. Protect against this with downstream idempotency (again) and with alerts that fire when offsets go backwards.

Schema Registry becomes load-bearing. Every producer and consumer depends on it to serialize/deserialize. If it goes down, the whole pipeline stalls. Run it in HA mode.

Exactly-once is not worth it for telemetry. We covered this in 82.6, but it bears repeating. Use at-least-once plus end-to-end idempotency; don’t enable Kafka Transactions for telemetry unless you have a specific need.

Compression matters. With LZ4 or Snappy, typical JSON telemetry compresses 3-5x. With default no-compression, your broker network and disk costs are 3-5x higher. Always compress.

84.10 Why Kafka won

The alternatives exist. Apache Pulsar has a similar model with arguably better multi-tenancy. Amazon Kinesis is the AWS-managed equivalent. Google Cloud PubSub is PubSub with a Kafka-ish shim. Redpanda is a Kafka-wire-compatible reimplementation in C++ with lower latency and smaller operational burden. Apache RocketMQ is dominant in China. Yet Kafka has won the mindshare for “telemetry firehose” in most of the world.

The reasons:

  • First-mover advantage and ecosystem. Connectors, clients, stream processors, operators, docs, Stack Overflow answers — Kafka has more of everything.
  • The log abstraction was right. Every alternative is some variation on “log with extras.” The core idea holds up.
  • Operational knowledge is widespread. Hiring someone who has run Kafka is easier than hiring someone who has run Pulsar.
  • Confluent. A well-funded company pushing Kafka forward aggressively, including the Schema Registry, KSQL, Connect, and Cloud.
  • Good enough everywhere. Kafka is not the fastest (Redpanda), not the most elegant (Pulsar), not the most managed (Kinesis). It is solid enough in every dimension that picking it is rarely wrong.

For new projects in 2026, Kafka is the default and Redpanda is the drop-in replacement if you want the same API with less operational overhead. Pulsar is worth considering for specific multi-tenant use cases. Kinesis or PubSub are worth considering if you are all-in on AWS or GCP and don’t want to run a cluster. Every one of these has the same conceptual model — topic, partition, offset, consumer group, delivery semantics — so the vocabulary transfers.

84.11 The mental model and Part VI recap

Eight points to take into Chapter 85 (Part VII: The Data Plane):

  1. A log is the right abstraction for telemetry. Append-only, ordered within partitions, multi-consumer, replayable, time- and space-decoupled.
  2. Topics organize; partitions parallelize; offsets track position. Pick partition count for future parallelism, not just current.
  3. Partition keys decide locality and hot spots. Tenant/user/trace id are common; watch out for celebrity keys.
  4. Consumer groups are the parallelism unit. Partitions to consumers is 1-to-1; extra consumers go idle; rebalances are brief stops.
  5. At-least-once plus end-to-end idempotency is the honest design. Kafka’s “exactly-once” is intra-Kafka only.
  6. Dead-letter queues handle poison pills. Distinguish transient from permanent errors; alert on DLQ growth.
  7. Schema evolution with a registry is mandatory for any non-trivial pipeline. Protobuf or Avro, with BACKWARD compatibility as the default.
  8. Kafka won because the log abstraction is right and the ecosystem is deep. The alternatives have the same concepts — learn the vocabulary once.

Part VI recap — Chapters 71 through 82. The request lifecycle of a modern ML platform runs through a stack of patterns. Chapter 73 put a unified gateway in front of many backends. Chapter 74 split authn from authz and explained why interviewers grill the distinction. Chapter 75 propagated identity across service boundaries so every downstream call knows who it’s for. Chapter 76 limited how fast any caller can push, with token bucket, leaky bucket, sliding window, and GCRA. Chapter 77 applied backpressure and queue theory to keep the system from tipping over under load. Chapter 78 made idempotency keys the contract that allows safe retries and avoids double-billing. Chapter 79 mapped the five execution modes — sync, async, SSE, WebSocket, batch — and showed when each is correct. Chapter 80 went deep on workflow orchestration, with Temporal vs Airflow vs Step Functions vs Cadence and the event-sourced, deterministic-replay model underneath all of them. Chapter 81 layered inter-service trust — mTLS, signed URLs, JWT chains, gateways, NetworkPolicies — into a defense-in-depth architecture. Chapter 82 made the operations service its own system of record, independent of the workflow engine it runs on. Chapter 83 plumbed metering and billing from the sidecar to the invoice, with per-token LLM billing as the specific case. And this chapter, 82, showed that the whole telemetry and billing pipeline rides on Kafka — the log abstraction at the center of the data plane, the firehose that everything else subscribes to. Put together, the twelve chapters are the full picture of how a request flows from client to model and how the byproducts — audit, billing, telemetry — flow the other direction. Part VII now turns to the data plane itself: the storage systems that hold models, objects, documents, time-series, and streaming state.


Read it yourself

  • Jay Kreps, The Log: What every software engineer should know about real-time data’s unifying abstraction (LinkedIn engineering blog, 2013). The founding essay.
  • Neha Narkhede, Gwen Shapira, and Todd Palino, Kafka: The Definitive Guide (2nd edition, O’Reilly). The canonical book.
  • The Confluent Kafka documentation, especially the pages on “Idempotent Producer” and “Transactions.”
  • Tyler Akidau et al., Streaming Systems (O’Reilly) — the broader theory of windowed stream processing that Kafka underpins.
  • The Apache Avro specification and the Confluent Schema Registry documentation.
  • Redpanda’s blog series on Kafka internals — the C++ reimplementation team has excellent writeups of why certain design choices exist.

Practice

  1. A telemetry topic has 12 partitions and is keyed by user_id. One user generates 40% of traffic. Diagnose the problem and propose two fixes.
  2. Design a Kafka-backed usage event pipeline for a per-token LLM product doing 50k requests/sec. How many partitions, what key, what compression, what retention?
  3. A consumer group’s lag is growing at 5k records/sec. The processing rate is 40k/sec and the incoming rate is 45k/sec. What would you check, in order?
  4. Explain why Kafka’s “exactly-once” is not actually exactly-once when the consumer writes to Postgres. Walk through the failure mode.
  5. Write a DLQ handler for a consumer that deserializes protobuf usage events. What counts as a transient error, what counts as permanent?
  6. A schema change renames user_id to caller_id. What’s the safe rollout sequence for producers and consumers? How do you avoid breaking either side?
  7. Stretch: Run a local Kafka (or Redpanda) cluster with 3 brokers. Produce 1M records across 6 partitions with a keyed tenant_id distribution, consume with 3 consumers in a group, and measure per-partition throughput. Induce a rebalance by killing one consumer and observe the pause.