StaffSignal
Technology Guide

Apache Kafka

Distributed event streaming platform for high-throughput, fault-tolerant pub/sub and stream processing. Core to every messaging and pipeline question.

Design with Apache Kafka — Staff-Level Technology Guide

The 60-Second Pitch

Apache Kafka is a distributed event streaming platform — a partitioned, replicated, append-only log that handles millions of events per second with single-digit millisecond latency. In system design interviews, Kafka is the default choice for durable messaging, event sourcing, stream processing, change data capture, and async service decoupling. When you need messages that survive broker failures, that can be replayed from any point in time, and that maintain ordering within a partition, Kafka is the answer.

The Staff-level insight: Kafka is not a message queue — it is a distributed commit log. Messages are not "consumed" and removed; they are read at offsets and retained by policy. This means Kafka supports multiple independent consumers reading the same data at different speeds, replay from any historical offset, and both real-time streaming and batch processing from a single source of truth. The tradeoff: Kafka is operationally complex, requires careful partition planning upfront, and ordering is only guaranteed within a single partition — not across a topic.


Architecture & Internals

Brokers, Topics, and Partitions

A Kafka cluster is a set of brokers — servers that store and serve data. Data is organized into topics (logical categories — user-events, order-updates, payment-completed), and each topic is divided into partitions. A partition is a totally ordered, immutable, append-only log. Every message written to a partition gets a monotonically increasing offset (0, 1, 2, ...) that serves as its unique identifier within that partition.

The partition is the fundamental unit of both parallelism and ordering. Within a single partition, messages are strictly ordered — message at offset 42 was written before message at offset 43, always. Across partitions, there is no ordering guarantee. This is the single most important Kafka concept for interviews: if you need messages for entity X to be processed in order, all messages for entity X must go to the same partition. The mechanism for this is key-based partitioning.

Each partition has exactly one leader broker that handles all reads and writes, and zero or more follower brokers that replicate the data. Clients interact only with the leader. If the leader fails, one of the in-sync followers is elected as the new leader by the controller. This leader-follower model means partition count determines both the maximum parallelism and the maximum number of brokers that can fail simultaneously without data loss.

A typical production cluster runs 3-12 brokers with replication factor 3, meaning every partition has its data on 3 brokers. With acks=all and min.insync.replicas=2, a write succeeds only when at least 2 of the 3 replicas have acknowledged — tolerating exactly 1 broker failure without data loss. This configuration is the Staff default for any system where message loss is unacceptable: payment events, order state changes, audit logs.

Rendering diagram...

Replication and ISR

Every partition maintains an In-Sync Replica (ISR) set — the set of replicas that are fully caught up with the leader within a configurable lag threshold (replica.lag.time.max.ms, default 30 seconds). When a follower falls behind — due to network congestion, disk I/O saturation, or GC pauses — it is removed from the ISR. When it catches up, it is added back. The ISR is the set of replicas that are safe targets for leader election.

The durability guarantee depends on the acks producer setting and min.insync.replicas broker setting. acks=0 means fire-and-forget (fastest, no durability). acks=1 means the leader acknowledged (fast, survives follower failures but not leader failure before replication). acks=all means all ISR members acknowledged (slowest, survives any failure except total ISR loss). With min.insync.replicas=2 and acks=all, writes fail if fewer than 2 replicas are in sync — trading availability for durability.

The controller (historically a ZooKeeper-dependent component, now KRaft-based in modern Kafka) manages broker liveness, partition leadership, and topic metadata. KRaft (Kafka Raft, GA since Kafka 3.3) replaces ZooKeeper with an internal Raft-based consensus protocol, eliminating the ZooKeeper dependency. In interviews, mention KRaft to signal awareness of modern Kafka — but the ISR mechanics are unchanged regardless of the metadata management layer.

Log Compaction vs. Time-Based Retention

Kafka supports two retention strategies that serve fundamentally different use cases. Time-based retention (default) deletes messages older than retention.ms (typically 7-30 days). This is correct for event streams where historical events lose relevance over time — user click streams, metrics, logs.

Log compaction retains the latest value for each key indefinitely, deleting only superseded values. A compacted topic with key user:123 and three historical values (v1, v2, v3) will eventually compact to just v3. This makes compacted topics function as a distributed, replicated key-value store — a materialized view of the latest state. Use cases: configuration propagation, CDC changelogs, latest-state snapshots for stream processing state stores.

The Staff insight is knowing when to use each. Time-based retention for events ("user clicked button at T"). Compaction for state ("user's current address is X"). A common interview pattern: use a regular topic for the event stream and a compacted topic for the derived materialized view.

Rendering diagram...

Core Concepts for Interviews

Producers

The producer decides which partition a message goes to. Three strategies: key-based (default when key is set — hash(key) % partition_count, guarantees ordering per key), round-robin (default when key is null — distributes load evenly, no ordering), and custom partitioners (application logic — route by region, priority, or tenant).

Idempotent producer (enable.idempotence=true, default since Kafka 3.0): the broker deduplicates retries using a producer ID and sequence number. Without idempotency, a network timeout after a successful write causes the producer to retry, creating a duplicate message. With idempotency, the broker detects the duplicate and discards it. This is free — no performance penalty — and should always be enabled.

acks levels control the durability-latency tradeoff:

acksLatencyDurabilityWhen to Use
0Lowest (~0.5ms)None — fire and forgetMetrics, logs where loss is acceptable
1Low (~2ms)Survives follower failureLow-latency event ingestion
allMedium (~5ms)Survives any single failurePayment events, order state, audit logs

Consumers and Offset Management

A consumer reads from one or more partitions at tracked offsets. Consumer groups provide the scaling model: each partition within a topic is assigned to exactly one consumer within the group. If you have 6 partitions and 3 consumers, each consumer reads 2 partitions. If you add a 4th consumer, one consumer gives up a partition. If you have 7 consumers for 6 partitions, one consumer sits idle. This is the fundamental scaling constraint: maximum consumer parallelism = partition count.

Offset management determines delivery semantics. enable.auto.commit=true (default) commits offsets periodically (every 5 seconds by default). If the consumer crashes between processing a message and the next auto-commit, the message is reprocessed — at-least-once delivery. Manual commit (commitSync() or commitAsync() after processing) gives finer control but is still at-least-once unless combined with transactional processing.

The critical mental model: Kafka guarantees at-least-once delivery natively. Exactly-once requires the consumer to be idempotent (can safely process the same message twice) or to use Kafka's transactional API (consume-process-produce atomically within Kafka). For downstream systems outside Kafka (databases, APIs), exactly-once requires idempotency at the consumer level — there is no way around this.

Consumer Rebalancing

When consumers join or leave a group, Kafka reassigns partitions — a rebalance. The traditional eager rebalance stops all consumers, revokes all partition assignments, and reassigns from scratch. During this window (seconds to minutes), no messages are processed. This is the "rebalance storm" problem: a single consumer restart causes a cluster-wide processing pause.

Cooperative rebalancing (default since Kafka 3.0) is incremental: only the affected partitions are revoked and reassigned. Other consumers continue processing their existing partitions without interruption. The result: rebalances affect milliseconds of processing on affected partitions rather than seconds of cluster-wide pause.

Static group membership (group.instance.id) prevents rebalancing on transient disconnects. Without static membership, a consumer that temporarily loses connectivity triggers a rebalance; when it reconnects, it triggers another rebalance. With static membership, the broker holds the consumer's partition assignment for session.timeout.ms, eliminating rebalance storms from rolling deployments and brief network hiccups.

Exactly-Once Semantics

Exactly-once in Kafka has three layers, each solving a different problem:

  1. Idempotent producer — deduplicates retries from producer to broker. Solves: duplicate messages from network retries. Free, always enable.
  2. Transactional producer — atomic writes across multiple partitions and topics. Solves: partial failure during produce-to-multiple-topics. Cost: ~10% throughput overhead. Use for consume-transform-produce pipelines.
  3. Exactly-once consumer — read + process + commit offset as an atomic transaction. Only works when the output goes back to Kafka (Kafka Streams). For external sinks (database, API), you need consumer-side idempotency.

The honest interview answer: "Kafka provides exactly-once within the Kafka ecosystem — producer to broker to Kafka Streams consumer. For external systems, we need idempotent consumers with deduplication keys. True end-to-end exactly-once across heterogeneous systems is an application-level concern, not a Kafka feature."

Compacted Topics

A compacted topic retains the latest value for each message key indefinitely. Old values for the same key are garbage-collected in the background. A message with a null value (a "tombstone") marks a key for deletion.

Use cases that come up in interviews:

  • CDC changelog: database changes streamed as key=row_id, value=row_state. The compacted topic is a snapshot of the current database state.
  • Configuration propagation: key=service_name, value=config_json. Every consumer gets the latest configuration on startup, with updates streamed in real time.
  • Stream processing state store: Kafka Streams uses compacted topics as the backing store for KTable state. When a stream processor restarts, it rebuilds its state by replaying the compacted changelog.

The materialized view pattern: produce events to a regular topic (event log), consume them into a compacted topic (latest state). The event log is the source of truth; the compacted topic is the queryable derived view.


Data Modeling with Kafka

Topic Design

Two schools: one topic per event type (user-created, user-updated, user-deleted) or one topic per domain aggregate (user-events containing all user lifecycle events). The Staff recommendation: one topic per domain aggregate. Reasons: (1) ordering across event types for the same entity is preserved (all events for user-123 in one partition), (2) fewer topics to manage (100 aggregates × 3 event types = 300 topics vs 100), (3) consumers that need the full lifecycle subscribe to one topic.

The exception: high-throughput event types that would dominate a shared topic. Click-stream events at 100K/sec should not share a topic with user-profile-updated at 10/sec — the click events would consume all partition bandwidth and delay profile updates.

Key Selection

The message key determines partition assignment and therefore ordering scope. Choose the key based on the entity that requires ordered processing:

  • Order events: key = order_id — all events for one order processed in sequence
  • User events: key = user_id — all events for one user processed in sequence
  • Payment events: key = payment_id — idempotent processing per payment
  • Multi-tenant: key = tenant_id — but beware hot partitions if one tenant generates 90% of traffic

A null key means round-robin distribution — maximum throughput, zero ordering. Use null keys only for events where ordering is irrelevant: metrics, logs, analytics events where independent processing is safe.

Schema Evolution

Production Kafka deployments use Apache Avro with Confluent Schema Registry (or equivalent). The schema registry stores versioned schemas for each topic and enforces compatibility rules:

CompatibilityRuleUse Case
BackwardNew schema can read old dataDefault — safe for consumer upgrades
ForwardOld schema can read new dataProducer upgrades before consumers
FullBoth backward and forwardMaximum safety, most restrictive
NoneNo compatibility checkBreaking changes (requires coordinated deployment)

The Staff default is backward compatibility: add optional fields with defaults, never remove or rename required fields. This allows consumers to upgrade independently of producers. Schema Registry is a single point of failure — in an interview, mention this and note that it should be deployed with HA (multi-node, persistent storage).


Scaling Kafka

Partition Scaling

You can add partitions to a topic at any time — but you can never reduce them. Adding partitions changes the key-to-partition hash mapping. If you had key user:123 on partition 2 with 6 partitions, it may land on partition 7 after expanding to 12 partitions. This breaks ordering guarantees for in-flight consumers: events for user:123 are now split across two partitions with no cross-partition ordering.

The safe expansion pattern: (1) create a new topic with the desired partition count, (2) dual-write to both topics during migration, (3) drain consumers from the old topic, (4) switch consumers to the new topic, (5) delete the old topic. This is operationally complex — which is why partition count is a planning-phase decision, not a runtime adjustment.

Rule of thumb for initial partition count: max(expected_throughput_MB/s ÷ 10MB/s_per_partition, expected_max_consumers). For a topic expecting 60MB/s throughput with up to 12 consumers, start with 12 partitions. Overprovisioning partitions has diminishing costs (metadata overhead, file handles) but underprovisioning requires the painful migration above.

Consumer Group Scaling

Maximum consumer parallelism equals partition count. With 6 partitions, adding a 7th consumer to a group wastes resources — it sits idle. To increase parallelism beyond current partition count, you must add partitions (with the migration caveats above).

Within the partition limit, scaling is seamless: add a consumer to the group, cooperative rebalancing assigns it partitions from existing consumers, processing parallelism increases. Remove a consumer, its partitions redistribute. No data loss, no message duplication (assuming proper offset management).

Consumer scaling anti-pattern: one consumer group per consumer instance. This makes every instance receive every message — fan-out, not load distribution. Use a single consumer group for parallel processing, multiple groups for independent consumers (e.g., one group for real-time processing, another for analytics).

When Kafka Becomes the Bottleneck

Kafka bottlenecks are almost always one of four things:

  1. Disk I/O: Kafka is disk-bound, not CPU-bound. Sequential write throughput determines maximum ingestion rate. SSDs with high sequential write speed are the standard production choice. Monitor disk_write_bytes and io_wait.
  2. Network saturation: Replication factor 3 means every byte written is transmitted 3x over the network. A 100MB/s ingestion rate generates 300MB/s of internal network traffic. Monitor per-broker network utilization.
  3. Too many partitions per broker: Each partition has a segment file, an index file, and a time index file. A broker with 10,000 partitions opens 30,000+ file handles and the controller takes longer to perform leader elections. Rule of thumb: keep partitions per broker under 4,000.
  4. Consumer lag: Not a Kafka bottleneck per se, but the symptom of consumers falling behind. Root cause is usually slow processing logic, not Kafka throughput limits.
Rendering diagram...

Failure Modes

Consumer Lag

Symptom: Derived data is stale. Search index is minutes behind the primary database. Notification delivery is delayed. Dashboard metrics are outdated.

Detection: kafka_consumer_lag per consumer group per partition. Alert when lag exceeds N seconds of wall-clock time (not just offset count — message rate varies). Burrow or Kafka's built-in kafka-consumer-groups.sh --describe for lag monitoring.

Business Impact: Users see stale data. Search results miss recent items. Notifications arrive late. In payment systems, delayed processing means orders stuck in "processing" state.

Staff Response: L5 says "the consumer is slow, we need more instances." L6 says: "First, identify whether lag is from slow processing or insufficient parallelism. If processing time per message exceeds max.poll.interval.ms, the consumer is too slow — optimize the processing logic, batch database writes, or move to async processing with back-pressure. If processing is fast but parallelism is insufficient, scale consumers up to partition count. If at partition limit, consider a new topic with more partitions."

Rebalancing Storm

Symptom: Repeated consumer group rebalances every few minutes. Processing throughput drops to zero during each rebalance (eager) or stutters (cooperative). Consumer lag grows steadily.

Detection: kafka_consumer_rebalances_total counter increasing rapidly. Consumer logs showing frequent onPartitionsRevoked / onPartitionsAssigned callbacks.

Business Impact: Intermittent processing pauses cause downstream delays. During each rebalance, committed offsets may cause message skipping or reprocessing depending on timing.

Staff Response: L5 says "something keeps disconnecting." L6 says: "Switch to cooperative rebalancing if still using eager. Enable static group membership with group.instance.id to prevent rebalances during rolling deployments. Increase session.timeout.ms to 45s to tolerate transient network hiccups. Check for consumers with processing time exceeding max.poll.interval.ms — that's the most common trigger for involuntary rebalances."

Broker Failure

Symptom: Partitions led by the failed broker become temporarily unavailable. Producers receive NOT_LEADER_FOR_PARTITION errors. Consumers lose their connection to those partitions.

Detection: Under-replicated partitions (kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions). Controller event log showing leader elections. Broker liveness monitoring (health checks, JMX metrics).

Business Impact: Brief write unavailability for affected partitions (seconds during leader election). With acks=all and min.insync.replicas=2, writes to partitions with only 1 remaining ISR member will fail until the replacement catches up.

Staff Response: L5 says "Kafka has replication, it handles failures." L6 says: "Leader election is automatic but not instant — 5-15 seconds typically. During this window, producers with acks=all get errors and should retry with backoff. The real risk is cascading failure: if the replacement broker is already overloaded and falls behind on replication, we lose the ISR safety net. Monitor under-replicated partitions as a Sev-1 alert — it means your durability guarantee is degraded."

Disk Full

Symptom: Broker stops accepting writes. Producers receive CORRUPT_MESSAGE or timeout errors. Log segments cannot be created.

Detection: disk_utilization per broker — alert at 75%, page at 85%. Monitor kafka.log:type=Log,name=Size for unexpected growth.

Business Impact: Complete write outage for all partitions on the affected broker. If multiple brokers fill simultaneously (common with uniform retention policies), the entire cluster goes read-only.

Staff Response: "Immediate mitigation: reduce retention on non-critical topics (kafka-configs.sh --alter --entity-type topics --entity-name logs --add-config retention.ms=86400000). Enable compression if disabled. For structural fix: add brokers and redistribute partitions, or move to tiered storage (Kafka 3.6+) which offloads old segments to S3."

Poison Pill Messages

Symptom: Consumer crashes on a specific message, restarts, reads the same message, crashes again — an infinite crash loop. The partition is effectively stuck; consumer lag grows indefinitely.

Detection: Consumer restart frequency spikes. Consumer lag for a specific partition grows while other partitions are healthy. Application error logs show the same exception repeatedly.

Business Impact: One bad message blocks processing of all subsequent messages on that partition. Downstream systems see stale data for entities whose events are on the stuck partition.

Staff Response: L5 says "fix the bug in the consumer." L6 says: "Implement a dead letter queue (DLQ) pattern. After N processing failures for the same message, publish it to a DLQ topic and advance the offset. This unblocks the partition while preserving the problematic message for investigation. The DLQ should have its own alerting — a growing DLQ is a data quality signal."

Rendering diagram...

When to Use vs. Alternatives

Use CaseKafkaSQSRedis StreamsRabbitMQ
Durable messaging (event sourcing)✅ Best⚠️ No replay❌ Limited durability⚠️ No replay
High throughput (>100K msg/s)✅ Best⚠️ Scales but costly⚠️ Memory-bound❌ Not designed for
Strict ordering✅ Per partition❌ Best-effort FIFO✅ Per stream⚠️ Per queue
Replay / reprocessing✅ Native❌ Not supported⚠️ Limited❌ Not supported
Simple task queue❌ Overkill✅ Best✅ Good✅ Good
Fan-out to N consumers✅ Consumer groups⚠️ SNS+SQS pattern⚠️ Consumer groups✅ Exchange routing
Ops complexity❌ High✅ Managed✅ Part of Redis⚠️ Moderate
Latency (<10ms end-to-end)⚠️ 5-50ms typical⚠️ 10-100ms✅ Sub-ms✅ Sub-ms

Decision rule: Use Kafka when you need (1) replay, (2) high throughput, (3) multiple independent consumers, or (4) event sourcing. Use SQS when you need a simple managed queue with no replay requirement. Use Redis Streams when messages are ephemeral and you already have Redis. Use RabbitMQ when you need complex routing patterns (exchange-based) and sub-ms latency matters more than throughput.


Deployment Topologies

Rendering diagram...

Multi-AZ is the Staff default: replicas spread across 3 availability zones with min.insync.replicas=2 ensures surviving any single AZ failure. The cost is cross-AZ replication latency (~1-2ms) and cross-AZ data transfer charges.

Multi-region uses MirrorMaker 2 (or Confluent Cluster Linking) to replicate topics between clusters. This is async replication — RPO is the replication lag between regions (typically seconds to minutes). Use for disaster recovery (active-passive) or geo-local consumption (active-active with conflict resolution).


Staff-Level Operational Concerns

Under-replicated partitions is the most important Kafka metric. A non-zero value means at least one partition has fewer in-sync replicas than the configured replication factor. This directly degrades your durability guarantee — if the leader fails while replicas are out of sync, you lose data. Alert immediately, investigate root cause (slow disk, network, GC pauses on follower), and resolve before another failure compounds the risk.

Consumer lag monitoring must be per consumer group, per partition, not just aggregate. Aggregate lag of 1,000 messages hides the reality that 999 of those messages are on one stuck partition while the other 11 partitions are at zero lag. Use Burrow, Confluent Control Center, or custom metrics that report per-partition lag in both offset count and time (seconds behind).

Partition count planning affects the cluster for the lifetime of the topic. Rule of thumb: 6-12 partitions per topic for moderate throughput (<50MB/s), 20-50 for high throughput (50-200MB/s), 100+ only for exceptional cases (>200MB/s with high parallelism requirements). Keep total partitions per broker under 4,000 to avoid metadata overhead and slow leader elections. Monitor kafka.controller:type=KafkaController,name=ActiveControllerCount — if this fluctuates, the controller is unstable.

Retention vs. cost is a business decision. 7-day retention at 100MB/s ingestion = ~60TB of disk across the cluster (with replication factor 3). 30-day retention = ~260TB. Tiered storage (Kafka 3.6+, Confluent Tiered Storage) moves cold segments to object storage (S3), reducing local disk requirements by 80-90% while maintaining full replay capability. This is the Staff answer to "how do you handle long retention without unbounded disk costs?"

Schema Registry availability: Schema Registry is a single point of failure for producers and consumers that use schema validation. If it is down, producers cannot serialize messages and consumers cannot deserialize them. Deploy Schema Registry with HA (multiple instances behind a load balancer, backed by a Kafka compacted topic for schema storage). Cache schemas in producers and consumers so that transient Registry outages do not block processing of already-known schemas.


Interview Application

Which Playbooks Use Kafka

PlaybookHow Kafka Is UsedKey Pattern
Message QueuePrimary message brokerPartitioned consumption with consumer groups
Stream ProcessingEvent ingestion and stream sourceKafka Streams or Flink consuming from topics
Notification SystemsEvent-driven notification dispatchTopic per notification channel, consumer per delivery type
Web CrawlingURL frontier as a Kafka topicPartitioned by domain for politeness
Order MatchingOrder event stream for audit trailCompacted topic for latest order state
Feed GenerationFan-out event distributionUser activity events consumed by feed builder

How to Introduce Kafka in an Interview

Use this template: "I'd use Kafka here for [specific capability] because [specific reason]. The alternative would be [simpler option] but it lacks [critical feature we need]."

Example: "I'd use Kafka for the order event stream because we need replay capability — when we deploy a new analytics consumer, it needs to reprocess the last 30 days of order events. SQS would be simpler operationally but doesn't support replay, which means we'd need a separate backfill mechanism."

What L5 Says vs. What L6 Says

TopicL5 SaysL6 Says
Choice"We'll use Kafka for messaging""We'll use Kafka because we need multi-consumer replay — three independent services consume the same order stream at different rates. SQS would require three separate queues with duplicated publishing."
Ordering"Kafka preserves message order""Kafka preserves order within a partition. We key messages by order_id so all events for one order are sequenced. Cross-order ordering is not required and would limit us to a single partition."
Durability"Kafka is durable""With acks=all and min.insync.replicas=2 on a 3-broker cluster, we tolerate one broker failure with zero message loss. The cost is ~3ms additional write latency, acceptable for our order pipeline."
Scaling"We'll add more partitions""We start with 12 partitions — enough for our projected 6 consumers with room to double. Adding partitions later breaks key-based ordering, so we overprovision upfront."
Failure"Kafka handles failures automatically""Leader election takes 5-15 seconds. During this window, producers with acks=all receive errors and must retry with exponential backoff. We set retries=5 and retry.backoff.ms=100. The dead letter queue handles poison pills that crash consumers."
Exactly-once"Kafka supports exactly-once""Kafka provides exactly-once within Kafka-to-Kafka pipelines via the transactional API. For our database sink, we use at-least-once delivery with idempotent upserts keyed on event_id. True end-to-end exactly-once requires idempotency at the consumer."

Common Interview Mistakes

MistakeWhy It's WrongWhat to Say Instead
"Kafka for everything"Kafka is operationally expensive for simple task queues"Kafka for event streaming with replay. SQS or Redis for simple task dispatch."
"Kafka guarantees message ordering"Only within a single partition, not across a topic"Ordering is per-partition. We key by entity ID to guarantee per-entity ordering."
"Just add more partitions"Cannot reduce partitions; adding breaks key routing"Partition count is planned upfront. 12 partitions with room to grow to 24."
"Kafka has exactly-once"Only within the Kafka ecosystem, not to external sinks"Exactly-once for Kafka-to-Kafka. Idempotent consumers for external systems."
"We'll use Kafka as a database"Kafka is a log, not a query engine. No indexes, no joins, limited point lookups"Kafka as the event log, PostgreSQL or Elasticsearch as the queryable derived view."
"Consumer lag is not a big deal"Lag = stale data downstream. Search index 10 min behind = missing results"Consumer lag SLA: <30s for real-time features, <5m for analytics. Alert on violation."

Staff Insight