Streaming data: Kafka, Strimzi, the broader log model
"The log is the simplest possible data structure: an ordered, immutable sequence of records. Everything else in distributed systems is either built on a log or trying to be one"
Kafka is the most important data infrastructure project of the last fifteen years. It is also frequently misunderstood, over-used, under-configured, and operated by teams who have never read the docs. For ML systems it shows up everywhere: training event pipelines, inference request logging, telemetry fan-out, feature freshness bridges, change-data-capture from transactional stores, and anywhere you need to decouple producers from consumers without a synchronous RPC tying them together. The reason Kafka keeps appearing is not that it’s trendy — it’s that the abstraction it exposes, the distributed append-only log, is the right primitive for a surprisingly broad class of problems, and everything competing with it is either a more-complicated version of the same idea or a compromise on durability.
This chapter goes from first principles to operational reality: what a Kafka topic actually is, how partitions and consumer groups work, what replication and ISR mean, the progression from at-least-once to exactly-once semantics, how Strimzi runs Kafka on Kubernetes, and when to pick Kafka versus Pulsar, Redpanda, or Kinesis. The goal is to give you enough depth that you can justify the architecture decisions in an interview and not miss any traps in production.
Outline:
- The log abstraction from first principles.
- Topics, partitions, offsets.
- Producers, consumers, consumer groups.
- Replication and the in-sync replica set.
- Delivery semantics: at-most, at-least, exactly-once.
- Transactions and idempotent producers.
- Strimzi: Kafka on Kubernetes.
- Storage, retention, and compaction.
- When Kafka is right (and when it’s not).
- The mental model.
88.1 The log abstraction from first principles
A log is an ordered, immutable, append-only sequence of records. Writers append. Readers scan from a position. Nothing ever gets modified. Nothing ever gets reordered. Time is baked into the structure: the position of a record is its identity.
That shape is what makes a log the right primitive for distributed systems. If two systems agree on the order of records in a log, they agree on the order of events, which is the entire game in a distributed system. A log is how Raft, Paxos, and every real consensus algorithm work under the hood. A log is how databases implement WAL-based replication. A log is how event sourcing, CQRS, and change-data-capture all communicate. The log isn’t just a Kafka thing — it’s the theoretical substrate, and Kafka is just a very good distributed, partitioned implementation of it that you can rent or host.
The properties of a log that make it different from a queue:
- Durable and replayable. A consumer can start at the beginning and re-read everything. Queues typically delete messages after delivery.
- Multi-subscriber without copies. Many independent consumer groups can read the same log at their own rates. Queues fan out by copying messages per consumer.
- Ordered within a partition. The sequence of records in a single partition is fixed and ordered.
- Positional, not acknowledgement-based. A consumer tracks its position (offset) in the log rather than ACKing individual messages. Positional semantics are simpler and more fault-tolerant.
Put differently, a queue is about delivery; a log is about history. A queue answers “what should I process next and then forget?” A log answers “what has happened, in what order, and who has read how far?“
88.2 Topics, partitions, offsets
In Kafka terminology:
- A topic is a named log, logically speaking. Producers write to a topic. Consumers read from a topic.
- A topic is split into partitions. Each partition is an independent, ordered log. Records within a partition are totally ordered by their offset. Records across partitions have no global order.
- An offset is a 64-bit integer that identifies a record’s position within its partition. Offsets are monotonically increasing, start at 0, and never go backward.
- A record is a byte blob with an optional key, a value, a timestamp, and headers.
Partitioning is the scale mechanism. A topic with 100 partitions can absorb 100× the write throughput of a topic with 1 partition, distributed across the cluster. The partition a record lands in is determined by the partition key — typically a hash of the record key. Records with the same key always go to the same partition, which means they’re always ordered relative to each other. This is how you get “per-user ordering” without global ordering: partition by user id.
The choice of partition count is a capacity planning decision. A partition is the unit of parallelism for consumers. You cannot have more consumers per consumer group than partitions in the topic. You can, however, have fewer: multiple partitions per consumer. A rough heuristic: partitions = max expected consumer count × 2-4. Kafka supports thousands of partitions per topic, but each partition has per-partition overhead (memory, file descriptors, leadership management), so “millions of partitions” is bad.
Records in a partition are stored as segment files on disk. Each segment covers a range of offsets. Old segments are deleted as retention expires. Segment files are append-only: producers append records, the OS page cache handles everything in RAM until it gets flushed, and sequential reads from the page cache are what makes Kafka so fast.
88.3 Producers, consumers, consumer groups
Producers write records. The producer API accepts a (topic, key, value) tuple, routes it to a partition (via the key hash or an explicit partitioner), and sends it to the broker that leads that partition. Producers buffer records in memory and send them in batches, which is critical for throughput — an unbatched producer is 10-100× slower than a batched one. The batch size is tunable via batch.size and linger.ms.
The producer API surfaces three key config knobs that every engineer gets wrong at least once:
acks: how many in-sync replicas must acknowledge a write before the producer considers it successful.acks=0means fire-and-forget (fastest, least durable).acks=1means the leader has it (fast, can lose data on leader failure).acks=allmeans all in-sync replicas have it (slowest, most durable). The right default isacks=all.enable.idempotence: ensures that retries don’t produce duplicates within a single producer session. Always true in production.compression.type:none,gzip,snappy,lz4, orzstd. zstd or lz4 are typical. Compression happens at the batch level, and compressed batches are stored compressed on disk, which saves storage and network bandwidth dramatically.
Consumers read records. A consumer polls for new records from an assigned set of partitions and processes them. The consumer tracks its offset — the position in each partition it has read up to — and periodically commits the offset back to Kafka (to a special internal topic called __consumer_offsets). On restart, the consumer resumes from the last committed offset.
Consumer groups are how Kafka implements parallelism and load balancing. A consumer group is a set of consumers that share the work of reading a topic. Each partition is assigned to exactly one consumer in the group at any time. If you have 12 partitions and 4 consumers in a group, each consumer gets 3 partitions. If a consumer dies, its partitions are reassigned to the survivors. If a new consumer joins, partitions are rebalanced. This rebalance is orchestrated by the group coordinator, a broker that holds the group’s metadata.
Multiple consumer groups can read the same topic independently. Each group has its own committed offsets. Group A can be processing live data while Group B is replaying history from 3 days ago. This is the killer feature of a log versus a queue: multiple independent readers.
Rebalances are operationally painful. During a rebalance, no partition is assigned, so no consumer in the group is making progress. Cooperative rebalancing (since Kafka 2.4) makes this better by reassigning only the partitions that need to move, but rebalances still happen and still cause latency spikes. Tune session.timeout.ms and heartbeat.interval.ms to balance detection speed versus spurious rebalances.
88.4 Replication and the in-sync replica set
Every partition has a configurable replication factor (RF). For RF=3, each partition has three copies: one leader and two followers. The leader handles all reads and writes for the partition. Followers fetch records from the leader and catch up asynchronously.
The in-sync replica set (ISR) is the set of replicas that are “caught up” with the leader, where “caught up” means the follower has fetched all records up to some recent point and is not lagging by more than replica.lag.time.max.ms. Only ISR replicas are eligible to be elected leader if the current leader fails.
The ISR mechanism is how Kafka trades off availability and durability. Two knobs control the tradeoff:
min.insync.replicas: the minimum number of ISR replicas required for a write withacks=allto succeed. If ISR shrinks below this, writes fail. For RF=3,min.insync.replicas=2is the standard setting — it tolerates one replica failure and refuses writes when two have failed (because you’d be down to a single point of failure for the data).unclean.leader.election.enable: if true, a replica outside the ISR can be elected leader when no ISR replica is available. This preserves availability at the cost of data loss (the elected replica may be missing the most recent writes). Default isfalse. Flip it totrueonly if you accept data loss and must have availability.
With acks=all and min.insync.replicas=2 on RF=3, Kafka offers strong durability: a successfully acknowledged write is on at least two replicas and cannot be lost as long as one of them survives. This is the production-standard configuration.
Leader election happens via the controller, one broker that acts as the cluster’s metadata manager. In older Kafka, the controller coordinated through ZooKeeper. Since Kafka 2.8 (KIP-500), Kafka has its own internal consensus protocol called KRaft (Kafka Raft) and no longer needs ZooKeeper. KRaft is now the default in Kafka 3.x and mandatory in Kafka 4.x. If you are standing up a new Kafka cluster in 2026, it is KRaft.
88.5 Delivery semantics: at-most, at-least, exactly-once
The three classic delivery guarantees in messaging systems, and where Kafka falls on each:
At-most-once. Each record is delivered zero or one times. Duplicates impossible, losses possible. Kafka provides this trivially: consumer reads, commits the offset, then processes. If processing fails, the offset is already committed and the record is lost. Not useful for most ML workloads.
At-least-once. Each record is delivered one or more times. Losses impossible, duplicates possible. Kafka provides this by default: consumer reads, processes, then commits the offset. If processing succeeds but the commit fails, the record is reprocessed on restart. This is the default and the right mental model for most consumer code.
Exactly-once. Each record is delivered exactly once. Kafka provides this via transactions (KIP-98, Kafka 0.11+) combined with idempotent producers. The mechanism:
- The producer is idempotent: each (producer id, sequence number) is unique, and the broker deduplicates retries.
- Producers can begin a transaction spanning multiple sends across multiple partitions.
- Consumers in a transactional read position can be configured to only see committed transactional writes (
isolation.level=read_committed). - A consumer-then-producer pattern (the “read-process-write” pipeline) can commit its offset and its output writes in a single transaction, so either both happen or neither does.
Exactly-once in Kafka is real and well-engineered, but it has two important caveats:
- It only works within Kafka. If your “process” step writes to a Postgres database or calls an external API, the transaction does not span those. You still have to deal with duplicates on the external side. This is why idempotent consumers matter regardless.
- It has a latency cost. Transactions involve extra coordinator round trips and a two-phase commit. Throughput with transactions is typically 15-30% lower than without.
For most ML workloads, at-least-once with idempotent consumers is the right design. Use a unique request id to dedupe on the consumer side, and accept that Kafka’s exactly-once is useful only for internal stream-processing pipelines, not end-to-end.
88.6 Transactions and idempotent producers
A deeper look, because this is interview catnip.
An idempotent producer is enabled by enable.idempotence=true. The broker assigns the producer a Producer ID (PID) on first connect. Each batch the producer sends includes the PID and a per-partition monotonic sequence number. If the broker has already seen a batch with the same PID and sequence, it ACKs without storing. This eliminates duplicates caused by retries inside a single producer session.
Idempotence alone doesn’t give you cross-partition atomicity. For that you need transactions, enabled by setting a transactional.id on the producer. The producer calls:
producer.initTransactions();
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.sendOffsetsToTransaction(consumedOffsets, consumerGroupId);
producer.commitTransaction();
The commitTransaction call writes a transaction marker to each involved partition, atomically making the records visible to read_committed consumers. If anything fails, abortTransaction writes abort markers instead and the records are filtered out.
The sendOffsetsToTransaction call is the magic: it lets you atomically commit consumer offsets as part of a producer transaction. This is how you get end-to-end exactly-once for a Kafka-to-Kafka pipeline. The read offset and the write records commit together. Either the entire step happens, or none of it does.
For ML feature pipelines that transform an input stream and produce an output stream, this is genuinely useful. For inference request logging, it’s overkill — you just want at-least-once with a dedup key.
88.7 Strimzi: Kafka on Kubernetes
Strimzi is the Kubernetes operator for running Kafka. It is CNCF-incubating, actively maintained, and the de-facto way to run Kafka on K8s in 2026. If you are adding Kafka to a new ML platform, Strimzi is the answer.
The Strimzi model:
- Kafka CRD: you declare a
Kafkaresource describing the cluster (broker count, storage, version, config). The operator reconciles the actual StatefulSets and services to match. - KafkaTopic CRD: you declare topics as K8s resources. The operator creates them in Kafka and keeps them in sync.
- KafkaUser CRD: you declare users and their ACLs. The operator manages SASL/SCRAM or mTLS credentials as Kubernetes Secrets.
- KafkaConnect CRD: managed Kafka Connect clusters for running connectors (source, sink, transform).
- KafkaMirrorMaker2 CRD: managed cross-cluster replication.
A minimal cluster declaration:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: events
namespace: kafka
spec:
kafka:
version: 3.7.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
config:
default.replication.factor: 3
min.insync.replicas: 2
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
storage:
type: persistent-claim
size: 500Gi
class: gp3
entityOperator:
topicOperator: {}
userOperator: {}
Strimzi generates a StatefulSet with three broker pods, each with a 500 GB PVC. It handles upgrades, scaling, TLS cert rotation, broker configuration, and user management. It supports KRaft mode (KafkaNodePools in newer versions) and ZooKeeper mode (deprecated).
The operational patterns to know:
- Rolling updates. Broker upgrades require rolling each broker one at a time, waiting for the new broker to catch up before moving to the next. Strimzi handles this automatically.
- Rack awareness. Configure
spec.kafka.rack.topologyKey=topology.kubernetes.io/zoneso that Strimzi spreads brokers across AZs and Kafka’s rack-aware replica assignment ensures replicas are on different AZs. - Storage class choice. Kafka is very sensitive to disk throughput. On EKS, prefer
gp3with provisioned IOPS (say, 10,000 IOPS and 1000 MB/s throughput per broker) over defaultgp2. - Resource limits. Brokers should have a tuned JVM heap (typically 4-8 GB), and the rest of the pod memory should go to the page cache. Do not set memory limits so tight that the page cache is starved.
Strimzi plus KRaft gives you a self-managed Kafka on Kubernetes with declarative configuration, automated ops, and no ZooKeeper. It is the serious option for teams that want to own their streaming infrastructure.
88.8 Storage, retention, and compaction
Kafka retention policies control how long records live:
- Time-based retention: records older than
retention.msare deleted. Typical values range from 1 day (for high-volume logs) to 7-30 days (for streams you want to replay). - Size-based retention: records beyond
retention.bytesper partition are deleted. Useful as a safety valve. - Both: either limit triggers deletion. The safer option for predictable storage cost.
Deletion happens at the segment level, not the record level: a segment is deleted when all its records are past retention. This is why segment size matters — smaller segments mean finer-grained retention but more files; larger segments mean less granular deletion but less overhead.
Log compaction is a different retention mode. Instead of deleting old records by time, compacted topics keep the latest record per key. All older records for the same key are garbage-collected. The result is a topic that acts like a changelog of the current state: if you replay it from the beginning, you get the current value of every key, nothing more.
Compaction is the foundation of Kafka’s use as a state store. The offsets topic (__consumer_offsets) is compacted — it keeps the latest committed offset per (group, topic, partition). Kafka Streams materializes state stores as compacted topics. Debezium CDC streams are typically compacted so they can be replayed as “current state of the source DB.” For ML, a feature store’s hot path can be backed by a compacted topic that holds the current feature values for each entity.
The tradeoff of compaction is that records in a compacted topic can be arbitrarily old — there is no retention by time. A key that is rarely updated stays in the topic forever. This is the intended behavior for a state store but unexpected for an ops team that thinks of Kafka topics as time-bounded. Set min.cleanable.dirty.ratio and monitor compaction latency.
Compaction and time-based retention can be combined (cleanup.policy=compact,delete) so you get both.
88.9 When Kafka is right (and when it’s not)
Kafka is the right choice when:
- You need durable, replayable, multi-subscriber event streams.
- You have multiple consumers reading the same data for different purposes (real-time dashboards, batch ETL, CDC sinks, ML feature pipelines).
- You need high write throughput (100k+ messages/sec) with strong ordering per key.
- You need the data itself to be the integration point between services, not a synchronous RPC.
- You want a back-pressure-friendly buffer between fast producers and slow consumers.
Kafka is not the right choice when:
- You need single-message delivery with ack, not positional offsets. Use SQS or RabbitMQ.
- You need extremely low latency per message (sub-millisecond). Kafka’s batching adds latency.
- You need a queue with priority, delayed delivery, or per-message TTL. These are not Kafka features. Use RabbitMQ, or Temporal for workflow use cases.
- Your message volume is low (< 1000 msg/sec) and you don’t need replay. Kafka’s operational cost overwhelms the value.
- You need transactional consistency between the log and an external store. Kafka transactions don’t help; you need the outbox pattern.
Alternatives to Kafka worth knowing:
Apache Pulsar. Log-based messaging from Yahoo, now an Apache project. Two-layer architecture with brokers stateless and BookKeeper as the storage layer. Supports tiered storage (older segments on S3). Multi-tenant by design. More operationally complex than Kafka but cleaner for multi-tenant deployments.
Redpanda. A Kafka-compatible rewrite in C++ with the Seastar framework. Single binary, no JVM, no ZooKeeper, claimed lower latency and better throughput per core. API-compatible with Kafka, so clients work unchanged. Good choice when you want Kafka semantics with less operational overhead. Licensed under BSL / SSPL hybrid.
Amazon Kinesis. AWS’s managed streaming service. Simpler than running Kafka. Throughput is provisioned per shard; shards can be split/merged for scaling. Lower maximum throughput and retention than Kafka (up to 365 days), and more expensive at scale. The right choice if you’re on AWS, your volume is modest, and you don’t want to run Kafka yourself.
Google Pub/Sub. GCP’s managed pub/sub. At-least-once delivery, no ordering by default (ordering keys are opt-in), horizontal scale, consumer acks. A queue, not a log. Good for event fan-out, not for replayable history.
For ML platforms, the typical decision is Kafka (via Strimzi if self-hosted) for the main event bus, plus a managed alternative for workloads where operational overhead is prohibitive. Redpanda is the modern challenger and worth a serious look for new deployments.
88.10 The mental model
Eight points to take into Chapter 89:
- A log is an ordered, immutable, append-only sequence of records. It is the right primitive for distributed systems.
- Topics are split into partitions. Partitions are the unit of ordering, parallelism, and scale. Records with the same key go to the same partition.
- Consumer groups are how multiple consumers share work. Each partition is owned by exactly one consumer in the group. Multiple groups can read the same topic independently.
- Replication uses an in-sync replica set (ISR).
acks=all+min.insync.replicas=2on RF=3 is the production-standard durability setting. - Delivery semantics: at-least-once is the default and right choice; exactly-once is real but only inside Kafka, and it has a throughput cost.
- KRaft replaces ZooKeeper. New clusters should be KRaft-only.
- Strimzi is the Kubernetes operator for Kafka. Declarative, robust, and production-ready.
- Log compaction makes a topic into a key-latest-value state store, which is the foundation of Kafka-as-database patterns.
In Chapter 89 the focus shifts from the durable streaming layer to the volatile caching layer: Redis and the classic cache patterns.
Read it yourself
- Jay Kreps, The Log: What every software engineer should know about real-time data’s unifying abstraction (2013). The essay that shaped how the industry thinks about logs.
- Neha Narkhede, Gwen Shapira, Todd Palino, Kafka: The Definitive Guide, 2nd edition. The standard Kafka reference.
- Apache Kafka documentation, especially the sections on producer semantics, consumer groups, and KRaft.
- Strimzi documentation, including the KafkaNodePools and KRaft migration guides.
- Martin Kleppmann, Designing Data-Intensive Applications, Chapter 11 on stream processing.
- The KIP-98 (transactions) and KIP-500 (KRaft) proposals on the Kafka wiki.
Practice
- Why is partition count a capacity planning decision? What goes wrong with too few? With too many?
- Compute the minimum durability guarantee for
acks=all,min.insync.replicas=2, RF=3. How many brokers can fail without losing data? - A consumer group has 10 consumers reading a topic with 7 partitions. What happens? Is adding an 11th consumer useful?
- Explain the read-process-write exactly-once pattern step by step, including what
sendOffsetsToTransactiondoes. - When is a compacted topic the right choice, and when is it wrong? Give three examples of each.
- You’re migrating from ZooKeeper-based Kafka to KRaft. What’s the high-level procedure and what are the risks?
- Stretch: Stand up a 3-broker Strimzi cluster in a local kind cluster (with KRaft), create a topic with RF=3 and
min.insync.replicas=2, write a Python producer and consumer, and verify that killing one broker does not cause data loss. Measure throughput with and withoutenable.idempotence.