Apache Kafka | StaffSignal Technology Guide

Why This Matters

Kafka is the technology that transformed system design from "request-response between services" to "events as the source of truth." In interviews, reaching for Kafka signals that you understand data flow at scale — not just how to build a feature, but how to build a system where services evolve independently, data can be replayed and reprocessed, and failures in one consumer do not cascade to others. That is the architectural thinking Staff interviews test.

The gap between L5 and L6 on Kafka is not knowing that it exists — it is knowing its boundaries. L5 candidates say "we'll put Kafka between the services" as if that solves the problem. L6 candidates explain why Kafka is the right choice: "We need replay for reprocessing after schema changes, we need three independent consumers reading at different rates, and we need 7-day retention for the audit trail. SQS cannot do any of these." That justification — naming what Kafka gives you that alternatives do not — is the Staff signal.

If you can explain why your partition key is order_id and not user_id, what happens during the 15 seconds of leader election, and why exactly-once semantics end at the Kafka boundary, you have demonstrated the distributed systems reasoning that separates Staff engineers from senior ones.

The 60-Second Pitch

Apache Kafka is a distributed event streaming platform — a partitioned, replicated, append-only log that handles millions of events per second with single-digit millisecond latency. In system design interviews, Kafka is the default choice for durable messaging, event sourcing, stream processing, change data capture, and async service decoupling. When you need messages that survive broker failures, that can be replayed from any point in time, and that maintain ordering within a partition, Kafka is the answer.

The Staff-level insight: Kafka is not a message queue — it is a distributed commit log. Messages are not "consumed" and removed; they are read at offsets and retained by policy. This means Kafka supports multiple independent consumers reading the same data at different speeds, replay from any historical offset, and both real-time streaming and batch processing from a single source of truth. The tradeoff: Kafka is operationally complex, requires careful partition planning upfront, and ordering is only guaranteed within a single partition — not across a topic.

Architecture & Internals

Brokers, Topics, and Partitions

A Kafka cluster is a set of brokers — servers that store and serve data. Data is organized into topics (logical categories — user-events, order-updates, payment-completed), and each topic is divided into partitions. A partition is a totally ordered, immutable, append-only log. Every message written to a partition gets a monotonically increasing offset (0, 1, 2, ...) that serves as its unique identifier within that partition.

The partition is the fundamental unit of both parallelism and ordering. Within a single partition, messages are strictly ordered — message at offset 42 was written before message at offset 43, always. Across partitions, there is no ordering guarantee. This is the single most important Kafka concept for interviews: if you need messages for entity X to be processed in order, all messages for entity X must go to the same partition. The mechanism for this is key-based partitioning.

Each partition has exactly one leader broker that handles all reads and writes, and zero or more follower brokers that replicate the data. Clients interact only with the leader. If the leader fails, one of the in-sync followers is elected as the new leader by the controller. This leader-follower model means partition count determines both the maximum parallelism and the maximum number of brokers that can fail simultaneously without data loss.

A typical production cluster runs 3-12 brokers with replication factor 3, meaning every partition has its data on 3 brokers. With acks=all and min.insync.replicas=2, a write succeeds only when at least 2 of the 3 replicas have acknowledged — tolerating exactly 1 broker failure without data loss. This configuration is the Staff default for any system where message loss is unacceptable: payment events, order state changes, audit logs.

Rendering diagram...

Replication and ISR

Every partition maintains an In-Sync Replica (ISR) set — the set of replicas that are fully caught up with the leader within a configurable lag threshold (replica.lag.time.max.ms, default 30 seconds). When a follower falls behind — due to network congestion, disk I/O saturation, or GC pauses — it is removed from the ISR. When it catches up, it is added back. The ISR is the set of replicas that are safe targets for leader election.

The durability guarantee depends on the acks producer setting and min.insync.replicas broker setting. acks=0 means fire-and-forget (fastest, no durability). acks=1 means the leader acknowledged (fast, survives follower failures but not leader failure before replication). acks=all means all ISR members acknowledged (slowest, survives any failure except total ISR loss). With min.insync.replicas=2 and acks=all, writes fail if fewer than 2 replicas are in sync — trading availability for durability.

The controller (historically a ZooKeeper-dependent component, now KRaft-based in modern Kafka) manages broker liveness, partition leadership, and topic metadata. KRaft (Kafka Raft, GA since Kafka 3.3) replaces ZooKeeper with an internal Raft-based consensus protocol, eliminating the ZooKeeper dependency. In interviews, mention KRaft to signal awareness of modern Kafka — but the ISR mechanics are unchanged regardless of the metadata management layer.

Log Compaction vs. Time-Based Retention

Kafka supports two retention strategies that serve fundamentally different use cases. Time-based retention (default) deletes messages older than retention.ms (typically 7-30 days). This is correct for event streams where historical events lose relevance over time — user click streams, metrics, logs.

Log compaction retains the latest value for each key indefinitely, deleting only superseded values. A compacted topic with key user:123 and three historical values (v1, v2, v3) will eventually compact to just v3. This makes compacted topics function as a distributed, replicated key-value store — a materialized view of the latest state. Use cases: configuration propagation, CDC changelogs, latest-state snapshots for stream processing state stores.

The Staff insight is knowing when to use each. Time-based retention for events ("user clicked button at T"). Compaction for state ("user's current address is X"). A common interview pattern: use a regular topic for the event stream and a compacted topic for the derived materialized view.

Rendering diagram...

Core Concepts for Interviews

Producers

The producer decides which partition a message goes to. Three strategies: key-based (default when key is set — hash(key) % partition_count, guarantees ordering per key), round-robin (default when key is null — distributes load evenly, no ordering), and custom partitioners (application logic — route by region, priority, or tenant).

Idempotent producer (enable.idempotence=true, default since Kafka 3.0): the broker deduplicates retries using a producer ID and sequence number. Without idempotency, a network timeout after a successful write causes the producer to retry, creating a duplicate message. With idempotency, the broker detects the duplicate and discards it. This is free — no performance penalty — and should always be enabled.

acks levels control the durability-latency tradeoff:

acks	Latency	Durability	When to Use
`0`	Lowest (~0.5ms)	None — fire and forget	Metrics, logs where loss is acceptable
`1`	Low (~2ms)	Survives follower failure	Low-latency event ingestion
`all`	Medium (~5ms)	Survives any single failure	Payment events, order state, audit logs

Consumers and Offset Management

A consumer reads from one or more partitions at tracked offsets. Consumer groups provide the scaling model: each partition within a topic is assigned to exactly one consumer within the group. If you have 6 partitions and 3 consumers, each consumer reads 2 partitions. If you add a 4th consumer, one consumer gives up a partition. If you have 7 consumers for 6 partitions, one consumer sits idle. This is the fundamental scaling constraint: maximum consumer parallelism = partition count.

Offset management determines delivery semantics. enable.auto.commit=true (default) commits offsets periodically (every 5 seconds by default). If the consumer crashes between processing a message and the next auto-commit, the message is reprocessed — at-least-once delivery. Manual commit (commitSync() or commitAsync() after processing) gives finer control but is still at-least-once unless combined with transactional processing.

The critical mental model: Kafka guarantees at-least-once delivery natively. Exactly-once requires the consumer to be idempotent (can safely process the same message twice) or to use Kafka's transactional API (consume-process-produce atomically within Kafka). For downstream systems outside Kafka (databases, APIs), exactly-once requires idempotency at the consumer level — there is no way around this.

Consumer Rebalancing

When consumers join or leave a group, Kafka reassigns partitions — a rebalance. The traditional eager rebalance stops all consumers, revokes all partition assignments, and reassigns from scratch. During this window (seconds to minutes), no messages are processed. This is the "rebalance storm" problem: a single consumer restart causes a cluster-wide processing pause.

Cooperative rebalancing (default since Kafka 3.0) is incremental: only the affected partitions are revoked and reassigned. Other consumers continue processing their existing partitions without interruption. The result: rebalances affect milliseconds of processing on affected partitions rather than seconds of cluster-wide pause.

Static group membership (group.instance.id) prevents rebalancing on transient disconnects. Without static membership, a consumer that temporarily loses connectivity triggers a rebalance; when it reconnects, it triggers another rebalance. With static membership, the broker holds the consumer's partition assignment for session.timeout.ms, eliminating rebalance storms from rolling deployments and brief network hiccups.

Exactly-Once Semantics

Exactly-once in Kafka has three layers, each solving a different problem:

Idempotent producer — deduplicates retries from producer to broker. Solves: duplicate messages from network retries. Free, always enable.
Transactional producer — atomic writes across multiple partitions and topics. Solves: partial failure during produce-to-multiple-topics. Cost: ~10% throughput overhead. Use for consume-transform-produce pipelines.
Exactly-once consumer — read + process + commit offset as an atomic transaction. Only works when the output goes back to Kafka (Kafka Streams). For external sinks (database, API), you need consumer-side idempotency.

The honest interview answer: "Kafka provides exactly-once within the Kafka ecosystem — producer to broker to Kafka Streams consumer. For external systems, we need idempotent consumers with deduplication keys. True end-to-end exactly-once across heterogeneous systems is an application-level concern, not a Kafka feature."

Compacted Topics

A compacted topic retains the latest value for each message key indefinitely. Old values for the same key are garbage-collected in the background. A message with a null value (a "tombstone") marks a key for deletion.

Use cases that come up in interviews:

CDC changelog: database changes streamed as key=row_id, value=row_state. The compacted topic is a snapshot of the current database state.
Configuration propagation: key=service_name, value=config_json. Every consumer gets the latest configuration on startup, with updates streamed in real time.
Stream processing state store: Kafka Streams uses compacted topics as the backing store for KTable state. When a stream processor restarts, it rebuilds its state by replaying the compacted changelog.

The materialized view pattern: produce events to a regular topic (event log), consume them into a compacted topic (latest state). The event log is the source of truth; the compacted topic is the queryable derived view.

Data Modeling with Kafka

Topic Design

Two schools: one topic per event type (user-created, user-updated, user-deleted) or one topic per domain aggregate (user-events containing all user lifecycle events). The Staff recommendation: one topic per domain aggregate. Reasons: (1) ordering across event types for the same entity is preserved (all events for user-123 in one partition), (2) fewer topics to manage (100 aggregates × 3 event types = 300 topics vs 100), (3) consumers that need the full lifecycle subscribe to one topic.

The exception: high-throughput event types that would dominate a shared topic. Click-stream events at 100K/sec should not share a topic with user-profile-updated at 10/sec — the click events would consume all partition bandwidth and delay profile updates.

Key Selection

The message key determines partition assignment and therefore ordering scope. Choose the key based on the entity that requires ordered processing:

Order events: key = order_id — all events for one order processed in sequence
User events: key = user_id — all events for one user processed in sequence
Payment events: key = payment_id — idempotent processing per payment
Multi-tenant: key = tenant_id — but beware hot partitions if one tenant generates 90% of traffic

A null key means round-robin distribution — maximum throughput, zero ordering. Use null keys only for events where ordering is irrelevant: metrics, logs, analytics events where independent processing is safe.

Schema Evolution

Production Kafka deployments use Apache Avro with Confluent Schema Registry (or equivalent). The schema registry stores versioned schemas for each topic and enforces compatibility rules:

Compatibility	Rule	Use Case
Backward	New schema can read old data	Default — safe for consumer upgrades
Forward	Old schema can read new data	Producer upgrades before consumers
Full	Both backward and forward	Maximum safety, most restrictive
None	No compatibility check	Breaking changes (requires coordinated deployment)

The Staff default is backward compatibility: add optional fields with defaults, never remove or rename required fields. This allows consumers to upgrade independently of producers. Schema Registry is a single point of failure — in an interview, mention this and note that it should be deployed with HA (multi-node, persistent storage).

Scaling Kafka

Partition Scaling

You can add partitions to a topic at any time — but you can never reduce them. Adding partitions changes the key-to-partition hash mapping. If you had key user:123 on partition 2 with 6 partitions, it may land on partition 7 after expanding to 12 partitions. This breaks ordering guarantees for in-flight consumers: events for user:123 are now split across two partitions with no cross-partition ordering.

The safe expansion pattern: (1) create a new topic with the desired partition count, (2) dual-write to both topics during migration, (3) drain consumers from the old topic, (4) switch consumers to the new topic, (5) delete the old topic. This is operationally complex — which is why partition count is a planning-phase decision, not a runtime adjustment.

Rule of thumb for initial partition count: max(expected_throughput_MB/s ÷ 10MB/s_per_partition, expected_max_consumers). For a topic expecting 60MB/s throughput with up to 12 consumers, start with 12 partitions. Overprovisioning partitions has diminishing costs (metadata overhead, file handles) but underprovisioning requires the painful migration above.

Consumer Group Scaling

Maximum consumer parallelism equals partition count. With 6 partitions, adding a 7th consumer to a group wastes resources — it sits idle. To increase parallelism beyond current partition count, you must add partitions (with the migration caveats above).

Within the partition limit, scaling is seamless: add a consumer to the group, cooperative rebalancing assigns it partitions from existing consumers, processing parallelism increases. Remove a consumer, its partitions redistribute. No data loss, no message duplication (assuming proper offset management).

Consumer scaling anti-pattern: one consumer group per consumer instance. This makes every instance receive every message — fan-out, not load distribution. Use a single consumer group for parallel processing, multiple groups for independent consumers (e.g., one group for real-time processing, another for analytics).

When Kafka Becomes the Bottleneck

Kafka bottlenecks are almost always one of four things:

Disk I/O: Kafka is disk-bound, not CPU-bound. Sequential write throughput determines maximum ingestion rate. SSDs with high sequential write speed are the standard production choice. Monitor disk_write_bytes and io_wait.
Network saturation: Replication factor 3 means every byte written is transmitted 3x over the network. A 100MB/s ingestion rate generates 300MB/s of internal network traffic. Monitor per-broker network utilization.
Too many partitions per broker: Each partition has a segment file, an index file, and a time index file. A broker with 10,000 partitions opens 30,000+ file handles and the controller takes longer to perform leader elections. Rule of thumb: keep partitions per broker under 4,000.
Consumer lag: Not a Kafka bottleneck per se, but the symptom of consumers falling behind. Root cause is usually slow processing logic, not Kafka throughput limits.

Rendering diagram...

Failure Modes

Consumer Lag

Symptom: Derived data is stale. Search index is minutes behind the primary database. Notification delivery is delayed. Dashboard metrics are outdated.

Detection: kafka_consumer_lag per consumer group per partition. Alert when lag exceeds N seconds of wall-clock time (not just offset count — message rate varies). Burrow or Kafka's built-in kafka-consumer-groups.sh --describe for lag monitoring.

Business Impact: Users see stale data. Search results miss recent items. Notifications arrive late. In payment systems, delayed processing means orders stuck in "processing" state.

Staff Response: L5 says "the consumer is slow, we need more instances." L6 says: "First, identify whether lag is from slow processing or insufficient parallelism. If processing time per message exceeds max.poll.interval.ms, the consumer is too slow — optimize the processing logic, batch database writes, or move to async processing with back-pressure. If processing is fast but parallelism is insufficient, scale consumers up to partition count. If at partition limit, consider a new topic with more partitions."

Rebalancing Storm

Symptom: Repeated consumer group rebalances every few minutes. Processing throughput drops to zero during each rebalance (eager) or stutters (cooperative). Consumer lag grows steadily.

Detection: kafka_consumer_rebalances_total counter increasing rapidly. Consumer logs showing frequent onPartitionsRevoked / onPartitionsAssigned callbacks.

Business Impact: Intermittent processing pauses cause downstream delays. During each rebalance, committed offsets may cause message skipping or reprocessing depending on timing.

Staff Response: L5 says "something keeps disconnecting." L6 says: "Switch to cooperative rebalancing if still using eager. Enable static group membership with group.instance.id to prevent rebalances during rolling deployments. Increase session.timeout.ms to 45s to tolerate transient network hiccups. Check for consumers with processing time exceeding max.poll.interval.ms — that's the most common trigger for involuntary rebalances."

Broker Failure

Symptom: Partitions led by the failed broker become temporarily unavailable. Producers receive NOT_LEADER_FOR_PARTITION errors. Consumers lose their connection to those partitions.

Detection: Under-replicated partitions (kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions). Controller event log showing leader elections. Broker liveness monitoring (health checks, JMX metrics).

Business Impact: Brief write unavailability for affected partitions (seconds during leader election). With acks=all and min.insync.replicas=2, writes to partitions with only 1 remaining ISR member will fail until the replacement catches up.

Staff Response: L5 says "Kafka has replication, it handles failures." L6 says: "Leader election is automatic but not instant — 5-15 seconds typically. During this window, producers with acks=all get errors and should retry with backoff. The real risk is cascading failure: if the replacement broker is already overloaded and falls behind on replication, we lose the ISR safety net. Monitor under-replicated partitions as a Sev-1 alert — it means your durability guarantee is degraded."

Disk Full

Symptom: Broker stops accepting writes. Producers receive CORRUPT_MESSAGE or timeout errors. Log segments cannot be created.

Detection: disk_utilization per broker — alert at 75%, page at 85%. Monitor kafka.log:type=Log,name=Size for unexpected growth.

Business Impact: Complete write outage for all partitions on the affected broker. If multiple brokers fill simultaneously (common with uniform retention policies), the entire cluster goes read-only.

Staff Response: "Immediate mitigation: reduce retention on non-critical topics (kafka-configs.sh --alter --entity-type topics --entity-name logs --add-config retention.ms=86400000). Enable compression if disabled. For structural fix: add brokers and redistribute partitions, or move to tiered storage (Kafka 3.6+) which offloads old segments to S3."

Poison Pill Messages

Symptom: Consumer crashes on a specific message, restarts, reads the same message, crashes again — an infinite crash loop. The partition is effectively stuck; consumer lag grows indefinitely.

Detection: Consumer restart frequency spikes. Consumer lag for a specific partition grows while other partitions are healthy. Application error logs show the same exception repeatedly.

Business Impact: One bad message blocks processing of all subsequent messages on that partition. Downstream systems see stale data for entities whose events are on the stuck partition.

Staff Response: L5 says "fix the bug in the consumer." L6 says: "Implement a dead letter queue (DLQ) pattern. After N processing failures for the same message, publish it to a DLQ topic and advance the offset. This unblocks the partition while preserving the problematic message for investigation. The DLQ should have its own alerting — a growing DLQ is a data quality signal."

Rendering diagram...

When to Use vs. Alternatives

Use Case	Kafka	SQS	Redis Streams	RabbitMQ
Durable messaging (event sourcing)	✅ Best	⚠️ No replay	❌ Limited durability	⚠️ No replay
High throughput (>100K msg/s)	✅ Best	⚠️ Scales but costly	⚠️ Memory-bound	❌ Not designed for
Strict ordering	✅ Per partition	❌ Best-effort FIFO	✅ Per stream	⚠️ Per queue
Replay / reprocessing	✅ Native	❌ Not supported	⚠️ Limited	❌ Not supported
Simple task queue	❌ Overkill	✅ Best	✅ Good	✅ Good
Fan-out to N consumers	✅ Consumer groups	⚠️ SNS+SQS pattern	⚠️ Consumer groups	✅ Exchange routing
Ops complexity	❌ High	✅ Managed	✅ Part of Redis	⚠️ Moderate
Latency (<10ms end-to-end)	⚠️ 5-50ms typical	⚠️ 10-100ms	✅ Sub-ms	✅ Sub-ms

Decision rule: Use Kafka when you need (1) replay, (2) high throughput, (3) multiple independent consumers, or (4) event sourcing. Use SQS when you need a simple managed queue with no replay requirement. Use Redis Streams when messages are ephemeral and you already have Redis. Use RabbitMQ when you need complex routing patterns (exchange-based) and sub-ms latency matters more than throughput.

Deployment Topologies

Rendering diagram...

Multi-AZ is the Staff default: replicas spread across 3 availability zones with min.insync.replicas=2 ensures surviving any single AZ failure. The cost is cross-AZ replication latency (~1-2ms) and cross-AZ data transfer charges.

Multi-region uses MirrorMaker 2 (or Confluent Cluster Linking) to replicate topics between clusters. This is async replication — RPO is the replication lag between regions (typically seconds to minutes). Use for disaster recovery (active-passive) or geo-local consumption (active-active with conflict resolution).

Staff-Level Operational Concerns

Under-replicated partitions is the most important Kafka metric. A non-zero value means at least one partition has fewer in-sync replicas than the configured replication factor. This directly degrades your durability guarantee — if the leader fails while replicas are out of sync, you lose data. Alert immediately, investigate root cause (slow disk, network, GC pauses on follower), and resolve before another failure compounds the risk.

Consumer lag monitoring must be per consumer group, per partition, not just aggregate. Aggregate lag of 1,000 messages hides the reality that 999 of those messages are on one stuck partition while the other 11 partitions are at zero lag. Use Burrow, Confluent Control Center, or custom metrics that report per-partition lag in both offset count and time (seconds behind).

Partition count planning affects the cluster for the lifetime of the topic. Rule of thumb: 6-12 partitions per topic for moderate throughput (<50MB/s), 20-50 for high throughput (50-200MB/s), 100+ only for exceptional cases (>200MB/s with high parallelism requirements). Keep total partitions per broker under 4,000 to avoid metadata overhead and slow leader elections. Monitor kafka.controller:type=KafkaController,name=ActiveControllerCount — if this fluctuates, the controller is unstable.

Retention vs. cost is a business decision. 7-day retention at 100MB/s ingestion = ~60TB of disk across the cluster (with replication factor 3). 30-day retention = ~260TB. Tiered storage (Kafka 3.6+, Confluent Tiered Storage) moves cold segments to object storage (S3), reducing local disk requirements by 80-90% while maintaining full replay capability. This is the Staff answer to "how do you handle long retention without unbounded disk costs?"

Schema Registry availability: Schema Registry is a single point of failure for producers and consumers that use schema validation. If it is down, producers cannot serialize messages and consumers cannot deserialize them. Deploy Schema Registry with HA (multiple instances behind a load balancer, backed by a Kafka compacted topic for schema storage). Cache schemas in producers and consumers so that transient Registry outages do not block processing of already-known schemas.

Interview Application

Which Playbooks Use Kafka

Playbook	How Kafka Is Used	Key Pattern
Message Queue	Primary message broker	Partitioned consumption with consumer groups
Stream Processing	Event ingestion and stream source	Kafka Streams or Flink consuming from topics
Notification Systems	Event-driven notification dispatch	Topic per notification channel, consumer per delivery type
Web Crawling	URL frontier as a Kafka topic	Partitioned by domain for politeness
Order Matching	Order event stream for audit trail	Compacted topic for latest order state
Feed Generation	Fan-out event distribution	User activity events consumed by feed builder

How to Introduce Kafka in an Interview

Use this template: "I'd use Kafka here for [specific capability] because [specific reason]. The alternative would be [simpler option] but it lacks [critical feature we need]."

Example: "I'd use Kafka for the order event stream because we need replay capability — when we deploy a new analytics consumer, it needs to reprocess the last 30 days of order events. SQS would be simpler operationally but doesn't support replay, which means we'd need a separate backfill mechanism."

What L5 Says vs. What L6 Says

Topic	L5 Says	L6 Says
Choice	"We'll use Kafka for messaging"	"We'll use Kafka because we need multi-consumer replay — three independent services consume the same order stream at different rates. SQS would require three separate queues with duplicated publishing."
Ordering	"Kafka preserves message order"	"Kafka preserves order within a partition. We key messages by order_id so all events for one order are sequenced. Cross-order ordering is not required and would limit us to a single partition."
Durability	"Kafka is durable"	"With acks=all and min.insync.replicas=2 on a 3-broker cluster, we tolerate one broker failure with zero message loss. The cost is ~3ms additional write latency, acceptable for our order pipeline."
Scaling	"We'll add more partitions"	"We start with 12 partitions — enough for our projected 6 consumers with room to double. Adding partitions later breaks key-based ordering, so we overprovision upfront."
Failure	"Kafka handles failures automatically"	"Leader election takes 5-15 seconds. During this window, producers with acks=all receive errors and must retry with exponential backoff. We set retries=5 and retry.backoff.ms=100. The dead letter queue handles poison pills that crash consumers."
Exactly-once	"Kafka supports exactly-once"	"Kafka provides exactly-once within Kafka-to-Kafka pipelines via the transactional API. For our database sink, we use at-least-once delivery with idempotent upserts keyed on event_id. True end-to-end exactly-once requires idempotency at the consumer."

Common Interview Mistakes

Mistake	Why It's Wrong	What to Say Instead
"Kafka for everything"	Kafka is operationally expensive for simple task queues	"Kafka for event streaming with replay. SQS or Redis for simple task dispatch."
"Kafka guarantees message ordering"	Only within a single partition, not across a topic	"Ordering is per-partition. We key by entity ID to guarantee per-entity ordering."
"Just add more partitions"	Cannot reduce partitions; adding breaks key routing	"Partition count is planned upfront. 12 partitions with room to grow to 24."
"Kafka has exactly-once"	Only within the Kafka ecosystem, not to external sinks	"Exactly-once for Kafka-to-Kafka. Idempotent consumers for external systems."
"We'll use Kafka as a database"	Kafka is a log, not a query engine. No indexes, no joins, limited point lookups	"Kafka as the event log, PostgreSQL or Elasticsearch as the queryable derived view."
"Consumer lag is not a big deal"	Lag = stale data downstream. Search index 10 min behind = missing results	"Consumer lag SLA: <30s for real-time features, <5m for analytics. Alert on violation."

In the Wild

LinkedIn — Kafka as the Central Nervous System

LinkedIn created Kafka to solve their own data integration problem: dozens of systems needed access to the same user activity data, and point-to-point integrations had become unmanageable. Kafka became the single source of truth for all data movement at LinkedIn — 7+ trillion messages per day across hundreds of thousands of partitions. Every piece of data — profile updates, connection requests, job applications, ad impressions — flows through Kafka before reaching any downstream system.

The key architectural insight: LinkedIn uses Kafka as an organizational boundary, not just a technical one. Teams publish events to well-defined topics with enforced schemas. Consumers are decoupled from producers — the search team can reindex from Kafka without involving the profile team. This decoupling allowed LinkedIn to scale to thousands of engineers without the coordination cost of point-to-point integrations.

Uber — Kafka for Real-Time Pricing and ETL

Uber's real-time pricing engine consumes ride request events from Kafka, combines them with supply data (driver availability), and computes surge pricing within milliseconds. The system processes millions of events per second with strict ordering per geographic region (partition key = geohash prefix). Kafka's replay capability is critical: when Uber updates the pricing algorithm, they replay historical events through the new algorithm to validate pricing outcomes before deploying to production.

Uber also uses Kafka as the backbone of their streaming ETL pipeline. Raw events flow into Kafka, stream processors (Apache Flink) enrich and transform them, and the results land in Kafka topics consumed by Hive, Presto, and Elasticsearch. This dual use — real-time serving AND batch analytics from the same Kafka topics — eliminated the need for separate ingestion pipelines.

The New York Times — Kafka for Publishing Pipeline

The New York Times uses Kafka as the backbone of their content publishing pipeline. When an editor publishes or updates an article, the event flows through Kafka to multiple downstream systems: the website CMS, the mobile app API, the search index, the recommendation engine, and the archive system. Each consumer group processes independently — the website renders in seconds while the archive system can take minutes, and neither blocks the other.

The critical operational insight: NYT chose Kafka over a traditional message queue because they needed the same content events to reach 5+ independent systems, and they needed replay for disaster recovery — if the search index corrupts, they can rebuild it from the Kafka topic without re-publishing every article. Log compaction on the article topic means the latest version of every article is always available for new consumers.

Practice Drill

Scenario: Design the event processing pipeline for a food delivery platform. Events: order placed, restaurant accepted, driver assigned, pickup confirmed, delivery completed, order cancelled. Requirements: strict per-order ordering, 3 independent consumers (order tracking UI, analytics warehouse, restaurant billing), 50K orders/hour peak, 30-day event retention for dispute resolution, and the ability to replay events for a specific order.

Staff-Caliber Answer Shape

Expand

Topic design: Single topic order-events containing all order lifecycle events. One topic per domain aggregate (not per event type) preserves per-order ordering. Schema: Avro with Schema Registry, backward-compatible evolution. Fields: order_id, event_type, timestamp, payload (varies by event type).
Partition key: order_id. All events for order-12345 land on the same partition, guaranteeing chronological processing. 50K orders/hour = ~14 orders/second. With 12 partitions, each partition handles ~1.2 orders/second — well within capacity with room for 10x growth.
Three consumer groups: order-tracking-ui (real-time, <2s lag SLA), analytics-warehouse (batch, <5m lag acceptable), restaurant-billing (near-real-time, <30s lag). Each group reads independently at its own pace. The analytics consumer can fall behind without affecting UI latency.
Retention: 30 days time-based retention for dispute resolution. At 50K orders/hour × 6 events/order × 1KB/event ≈ 300MB/hour → ~216GB over 30 days × 3 replication factor ≈ 650GB cluster storage. Modest — no need for tiered storage.
Replay for specific order: Kafka does not support key-level replay natively. Two approaches: (a) consumer-side filtering — replay from a timestamp offset and skip events for other orders (wasteful but simple for low-volume replay), or (b) maintain a secondary compacted topic order-latest-state keyed by order_id for the latest state, and store the full event history in a queryable store (PostgreSQL or Elasticsearch) for dispute resolution.
Failure handling: Dead letter queue topic order-events-dlq for poison pills. Idempotent consumer for restaurant billing (dedup by event_id) — duplicate delivery must not double-charge. acks=all with min.insync.replicas=2 on 3 brokers. Static group membership for rolling deployments.

Why This Matters

The 60-Second Pitch

Architecture & Internals

Brokers, Topics, and Partitions

Replication and ISR

Log Compaction vs. Time-Based Retention

Core Concepts for Interviews

Producers

Consumers and Offset Management

Consumer Rebalancing

Exactly-Once Semantics

Compacted Topics

Data Modeling with Kafka

Topic Design

Key Selection

Schema Evolution

Scaling Kafka

Partition Scaling

Consumer Group Scaling

When Kafka Becomes the Bottleneck

Failure Modes

Consumer Lag

Rebalancing Storm

Broker Failure

Disk Full

Poison Pill Messages

When to Use vs. Alternatives

Deployment Topologies

Staff-Level Operational Concerns

Interview Application

Which Playbooks Use Kafka

How to Introduce Kafka in an Interview

What L5 Says vs. What L6 Says

Common Interview Mistakes

In the Wild

LinkedIn — Kafka as the Central Nervous System

Uber — Kafka for Real-Time Pricing and ETL

The New York Times — Kafka for Publishing Pipeline

Practice Drill

Staff Insight