Design with ZooKeeper & etcd — Staff-Level Technology Guide
The 60-Second Pitch
ZooKeeper and etcd are distributed coordination services — small, highly reliable key-value stores that provide linearizable writes, watches (event notifications), and primitives for building distributed systems: leader election, distributed locks, service discovery, and configuration management. They are not general-purpose databases. They store kilobytes of metadata — which node is the leader, what the cluster configuration is, which services are alive — and they make that metadata consistent across every client, every time, with sub-100ms latency.
The Staff-level insight: you almost never interact with ZooKeeper or etcd directly. They are infrastructure primitives that other systems build on. Kafka uses ZooKeeper (or KRaft) for controller election and topic metadata. Kubernetes uses etcd for all cluster state. Hadoop, HBase, and Solr depend on ZooKeeper for coordination. When you mention ZooKeeper or etcd in an interview, you are not proposing a database — you are naming the consensus layer beneath your distributed system. The Staff move is to say: "We need a coordination service for leader election — ZooKeeper or etcd — so that exactly one instance of the matching engine is active at any time, with automatic failover in under 5 seconds."
Why Coordination Services Exist
The Coordination Problem
Distributed systems face a fundamental problem: multiple nodes must agree on shared state. Which node is the leader? What is the current schema version? Is service B alive? Which shard owns key range 1000-2000? These questions require consensus — all nodes must see the same answer at the same time, even during network partitions and node failures.
You cannot solve this with a regular database. A replicated PostgreSQL instance can answer "who is the leader," but what happens when the PostgreSQL primary itself fails? You need leader election for the thing that does leader election. This is the infinite regress problem, and it is why dedicated coordination services exist: they implement consensus protocols (ZAB, Raft) at the lowest level, with no external dependencies, so that every other system in your infrastructure can build on their guarantees.
What coordination services provide:
- Linearizable writes — every write is seen by all clients in the same order. If client A writes
leader = node-3and client B readsleader, it will always getnode-3(or a later value), never a stale value. - Watch semantics — clients can subscribe to changes on a key. When the value changes, the coordination service pushes a notification. This enables reactive systems: "when the leader changes, all followers reconfigure immediately."
- Ephemeral state — state that automatically disappears when the client disconnects. This is the foundation of leader election and service discovery: a node registers itself as leader with an ephemeral node/lease, and if it dies, the registration vanishes, triggering a new election.
- Sequential ordering — the ability to create monotonically increasing keys, enabling fair distributed locks and queue implementations.
ZooKeeper — Architecture & Internals
Ensemble & ZAB Protocol
A ZooKeeper deployment is called an ensemble — typically 3 or 5 servers (always an odd number for majority quorum). One server is elected leader; the rest are followers. All write requests are forwarded to the leader, which proposes the write to followers using the ZooKeeper Atomic Broadcast (ZAB) protocol. When a majority (quorum) of servers acknowledge the write, it is committed and the client receives a success response.
ZAB is a consensus protocol similar to Paxos but optimized for the primary-backup model. It has two phases:
- Leader election (recovery phase) — when the ensemble starts or the current leader fails, servers exchange proposals to elect the server with the most up-to-date transaction log. The new leader synchronizes all followers to its state before accepting new writes.
- Atomic broadcast (normal operation) — the leader assigns a monotonically increasing transaction ID (zxid) to each write, broadcasts it to followers, and commits when a quorum acknowledges.
The critical guarantee: ZAB ensures that all committed transactions are delivered in the same order to all servers, and that no committed transaction is lost during leader election. This is what makes ZooKeeper linearizable — every client sees the same sequence of state changes.
Read path: By default, reads can be served by any server (leader or follower), which means reads may return slightly stale data if a follower has not yet applied the latest committed transaction. For strong consistency, clients use the sync command to force the follower to catch up to the leader before reading. In practice, the staleness window is typically single-digit milliseconds — but for leader election and distributed locking, always use sync or read from the leader directly.
Znodes — The Data Model
ZooKeeper stores data in a hierarchical namespace (like a filesystem) of nodes called znodes. Each znode can hold up to 1MB of data (but the convention is <1KB — coordination metadata, not application data) and can have children. The namespace looks like:
/
├── /kafka
│ ├── /kafka/brokers
│ │ ├── /kafka/brokers/ids
│ │ │ ├── /kafka/brokers/ids/0 → {"host":"broker-0","port":9092}
│ │ │ ├── /kafka/brokers/ids/1 → {"host":"broker-1","port":9092}
│ │ │ └── /kafka/brokers/ids/2 → {"host":"broker-2","port":9092}
│ │ └── /kafka/brokers/topics
│ │ └── /kafka/brokers/topics/orders → {partitions: 12, rf: 3}
│ └── /kafka/controller → {"broker_id": 0}
├── /services
│ ├── /services/payment-svc
│ │ ├── /services/payment-svc/instance-001 (EPHEMERAL)
│ │ └── /services/payment-svc/instance-002 (EPHEMERAL)
│ └── /services/order-svc
│ └── /services/order-svc/instance-001 (EPHEMERAL)
└── /config
└── /config/feature-flags → {"dark_mode": true}
Znode types:
| Type | Behavior | Use Case |
|---|---|---|
| Persistent | Exists until explicitly deleted | Configuration, metadata, schema versions |
| Ephemeral | Deleted when the creating client's session ends | Service registration, leader election, locks |
| Persistent Sequential | Persistent + auto-incrementing suffix (e.g., lock-0000000001) | Distributed queues, fair locks |
| Ephemeral Sequential | Ephemeral + auto-incrementing suffix | Leader election with ordering, fair locks with failover |
Watches — Event-Driven Coordination
A watch is a one-time trigger set on a znode. When the znode is created, deleted, or its data changes, ZooKeeper sends a notification to the watching client. Critically, watches are one-time: after firing, the client must re-register the watch to receive subsequent notifications. This design prevents the thundering herd problem (all clients re-reading after every change) but requires careful client-side logic to avoid missing events between the watch firing and re-registration.
Watch guarantees:
- A client will see the watch event before seeing the new data on a subsequent read. This ordering guarantee is essential for correctness.
- Watch events are delivered in order — if znode A changes before znode B, the client receives the watch for A before the watch for B.
- A watch that fires means "something changed" — the client must read the new value to determine what changed.
The ZooKeeper client library (Curator in Java) wraps raw watches with persistent watchers and tree caches that automatically re-register, handle reconnections, and maintain a local cache of the znode tree. In production, always use Curator — never raw ZooKeeper APIs.
Sessions & Ephemeral Nodes
A ZooKeeper session is a TCP connection between a client and any server in the ensemble, maintained by periodic heartbeats. The session has a configurable timeout (default 30 seconds in many configurations). If the server does not receive a heartbeat within the timeout, the session expires and all ephemeral znodes created by that client are deleted.
This session-expiry-triggers-deletion mechanism is the foundation of:
- Leader election: The leader creates an ephemeral node
/election/leader. If the leader crashes, the session expires, the node is deleted, and watchers are notified to trigger a new election. - Service discovery: Each service instance creates an ephemeral node under
/services/{service-name}/. If the instance crashes, it disappears from the registry automatically. - Distributed locks: The lock holder creates an ephemeral sequential node. If the holder crashes, the lock is released automatically.
etcd — Architecture & Internals
Raft Consensus
etcd uses the Raft consensus protocol — a consensus algorithm designed to be understandable (unlike Paxos). Like ZooKeeper's ZAB, Raft elects a leader that handles all writes and replicates to followers. The key differences are in the protocol mechanics:
Raft leader election: Each server starts as a follower with a randomized election timeout (150-300ms). If a follower does not receive a heartbeat from the leader within its timeout, it becomes a candidate, increments the term counter, and requests votes from other servers. The first candidate to receive a majority of votes becomes leader. The randomized timeout ensures that elections resolve quickly (typically in one round) without the competing-candidates problem of simpler protocols.
Raft log replication: The leader appends each client write to its log, sends AppendEntries RPCs to followers, and commits when a majority acknowledge. Committed entries are applied to the state machine (etcd's key-value store). If a follower's log diverges from the leader's (due to a crash or partition), the leader identifies the divergence point and overwrites the follower's log from that point forward.
Key-Value Store & MVCC
etcd stores data as a flat key-value namespace (no hierarchy, unlike ZooKeeper's znode tree). Keys are arbitrary byte strings, values are limited to ~1.5MB per key (but convention is <1KB). Internally, etcd uses a B+ tree (bbolt) for persistent storage and maintains a multi-version concurrency control (MVCC) model — every key modification creates a new revision, and clients can read historical values at any revision.
The MVCC model enables powerful features:
- Watch from a specific revision — a client can say "notify me of all changes to keys matching
/services/*starting from revision 42." If the client disconnects and reconnects, it can resume watching from the last revision it processed, guaranteeing no missed events. - Transactional reads at a point in time — a client can read multiple keys at the same revision, getting a consistent snapshot even if writes are occurring concurrently.
- Compaction — old revisions are periodically compacted to reclaim storage. etcd retains a configurable number of revisions (or time-based retention). After compaction, watches and reads at compacted revisions fail with a
compactederror.
Watch API — Persistent & Multiplexed
etcd's watch API is a significant improvement over ZooKeeper's one-shot watches. etcd watches are persistent — once registered, they continuously deliver events until the client cancels. Watches support prefix matching (/services/ matches all keys under that prefix), range queries, and resumption from a specific revision.
Under the hood, etcd multiplexes all watches for a client over a single gRPC stream. A client watching 1,000 keys uses one TCP connection, not 1,000. This makes etcd watches far more efficient at scale than ZooKeeper watches, which require re-registration after each event and create per-watch overhead.
Watch event structure: Each event contains the key, the new value (for PUT) or the previous value (for DELETE), the revision number, and the event type. The client processes events in revision order, maintaining a consistent view of the key-value store.
Leases — etcd's Ephemeral Mechanism
etcd's equivalent of ZooKeeper's ephemeral nodes is the lease mechanism. A client creates a lease with a TTL (e.g., 10 seconds), attaches keys to the lease, and periodically sends keep-alive requests to refresh the lease. If the client fails to send a keep-alive before the TTL expires, the lease is revoked and all attached keys are deleted.
Lease vs. ZooKeeper session: A ZooKeeper session is tied to a TCP connection — if the connection drops, a session timeout begins. An etcd lease is application-managed — the client explicitly sends keep-alive RPCs. This gives etcd clients more control: a client can let a lease expire intentionally (e.g., during graceful shutdown) without closing the connection. The tradeoff is more application logic — the client must manage keep-alive goroutines/threads and handle lease expiry gracefully.
# etcd lease-based service registration
$ etcdctl lease grant 10 # Create 10s lease
# lease 694d7550e9f211c1 granted with TTL(10s)
$ etcdctl put /services/api/instance-1 \
'{"host":"10.0.1.5","port":8080}' \
--lease=694d7550e9f211c1 # Attach key to lease
$ etcdctl lease keep-alive 694d7550e9f211c1 # Heartbeat loop
# (every ~3s, prevents expiry)
# If client crashes → lease expires → key deleted → watchers notified
ZooKeeper vs etcd — Head-to-Head
| Dimension | ZooKeeper | etcd |
|---|---|---|
| Consensus protocol | ZAB (Zookeeper Atomic Broadcast) | Raft |
| Data model | Hierarchical znode tree (like filesystem) | Flat key-value with prefix matching |
| Watch semantics | One-shot (must re-register) | Persistent (continuous stream) |
| Ephemeral mechanism | Session-based (connection-tied) | Lease-based (TTL + keep-alive) |
| Language | Java (JVM) | Go (single binary) |
| Max data per node/key | 1MB per znode | 1.5MB per key |
| MVCC / revision history | No (only current state) | Yes (read at any revision, watch from revision) |
| Client libraries | Curator (Java), kazoo (Python) | Official clients in Go, Java, Python, etc. |
| Primary ecosystem | Hadoop, Kafka (pre-KRaft), HBase, Solr | Kubernetes, CoreDNS, Vitess, CockroachDB |
| Operational complexity | Higher (JVM tuning, myid files, purging) | Lower (single Go binary, simpler config) |
| Performance (writes/s) | ~10,000-20,000 | ~10,000-20,000 |
| Performance (reads/s) | ~100,000+ (any server) | ~100,000+ (any server) |
When to choose ZooKeeper: Your ecosystem already depends on it (Kafka pre-KRaft, Hadoop, HBase). You need the hierarchical znode model for complex namespace management. You are in a Java-heavy organization with JVM operational expertise.
When to choose etcd: You are on Kubernetes (etcd is already there). You want simpler operations (single binary, no JVM). You need revision-based watches for reliable event processing. You are starting fresh with no legacy ZooKeeper dependency.
The trend: etcd is winning. Kafka replaced ZooKeeper with KRaft. New distributed systems overwhelmingly choose etcd or embed Raft directly. ZooKeeper is stable and battle-tested but is entering maintenance mode for many organizations.
Common Use Cases
Leader Election
The most common use case. One process must be the active leader (processing writes, running the scheduler, owning the matching engine) while others stand by as followers ready to take over.
ZooKeeper pattern:
- Each candidate creates an ephemeral sequential node:
/election/candidate-0000000001,/election/candidate-0000000002, etc. - The candidate with the lowest sequence number is the leader.
- Each non-leader watches the node with the sequence number just below its own (not the leader's node — this prevents the "thundering herd" on leader failure).
- If the watched node is deleted (predecessor fails or gives up leadership), the watcher checks if it is now the lowest — if so, it becomes leader.
etcd pattern:
- Each candidate creates a key with a lease:
/election/leaderwith valuenode-3and a 10-second lease. - Only the first
putsucceeds (using a transaction withcreate_revision == 0as the condition). - The leader maintains the lease with keep-alives. Followers watch the
/election/leaderkey. - If the leader fails, the lease expires, the key is deleted, and all watchers are notified. The first follower to successfully create the key becomes leader.
Distributed Locks
Similar to leader election, but for short-lived critical sections rather than long-lived leadership.
The recipe (ZooKeeper):
- Create an ephemeral sequential node under
/locks/{resource}/lock-. - Get all children of
/locks/{resource}/, sorted by sequence number. - If your node has the lowest sequence number, you hold the lock.
- Otherwise, watch the node immediately before yours.
- When that node is deleted, re-check if you are now lowest.
The critical concern: distributed locks in coordination services are advisory — there is no enforcement at the resource level. If the lock holder's session expires (due to a long GC pause or network partition) while it is still performing the critical operation, the lock is released and a second process acquires it. Both processes now believe they hold the lock. This is the "fencing token" problem.
The fencing token solution: When a client acquires a lock, it receives a monotonically increasing token (ZooKeeper's zxid or sequence number, etcd's revision). The client includes this token in every operation on the protected resource. The resource validates that the token is the latest — if a stale token arrives (from a client whose lock expired), the operation is rejected. This requires the protected resource to support token validation, which is why fencing works with databases (compare-and-swap on a version column) but not with external APIs or third-party services.
Service Discovery
Each service instance registers itself by creating an ephemeral node (ZooKeeper) or a leased key (etcd). Clients watch the service's parent path and maintain a local cache of available instances. When an instance crashes, its registration disappears automatically, and watchers update their routing tables.
ZooKeeper service discovery:
/services/payment-svc/instance-001 → {"host":"10.0.1.5","port":8080} (EPHEMERAL)
/services/payment-svc/instance-002 → {"host":"10.0.1.6","port":8080} (EPHEMERAL)
etcd service discovery:
/services/payment-svc/instance-001 → {"host":"10.0.1.5","port":8080"} (lease: 10s)
/services/payment-svc/instance-002 → {"host":"10.0.1.6","port":8080"} (lease: 10s)
In practice, dedicated service meshes (Consul, Istio) have largely replaced raw ZooKeeper/etcd for service discovery. But the underlying mechanism — ephemeral registrations with watch-based notification — is the same. Know the primitive even if you propose a higher-level tool.
Health checking nuance: Ephemeral nodes/leases only detect process death (no heartbeat = process is gone). They do not detect application-level health (process is alive but not serving requests correctly). For production service discovery, combine ephemeral registration with application-level health checks — the service registers itself on startup, but a separate health-check mechanism (HTTP /health endpoint, TCP probe) determines whether the service is actually healthy. Kubernetes implements this with liveness probes (is the process alive?) and readiness probes (is the process ready to serve traffic?) as separate concepts — both stored in etcd but evaluated differently.
Configuration Management
Store cluster-wide configuration in ZooKeeper/etcd. All nodes watch the configuration key and hot-reload when it changes. This is simpler than pushing configuration changes to every node individually, and the coordination service guarantees that all nodes see the same configuration in the same order.
Examples: Feature flags, rate limit thresholds, database connection strings, shard mapping tables, circuit breaker configurations. Kubernetes ConfigMaps and Secrets are stored in etcd — every kubectl apply updates etcd, and the kubelet watches for changes.
Atomic configuration updates: Both ZooKeeper (multi-op) and etcd (transactions) support atomic updates across multiple keys. This is critical for configurations that must change together — you cannot update the database host without also updating the port. In etcd, a transaction looks like:
txn:
IF: mod_revision("/config/version") == 15
THEN: put("/config/db-host", "new-host"),
put("/config/db-port", "5433"),
put("/config/version", "16")
ELSE: fail (concurrent modification)
This compare-and-swap pattern ensures that configuration updates are atomic and that concurrent updates are detected and rejected. It is the same pattern used for distributed locks and optimistic concurrency control.
Scaling & Performance
Read Scaling
Both ZooKeeper and etcd can serve reads from any server (leader or follower), making reads horizontally scalable by adding servers.
ZooKeeper observer nodes: An observer participates in replication (receives all committed transactions) but does not vote in quorum. Adding observers scales read throughput without increasing quorum latency. A 3-voter + 2-observer ensemble serves 5x the reads of a standalone server while maintaining the same write latency as a 3-server ensemble.
etcd learner nodes: Similar to ZooKeeper observers. Learners replicate the leader's log but do not vote. They serve linearizable reads (via the leader's commit index) and scale read throughput. etcd 3.4+ supports learner nodes as a first-class feature.
Write Limits
Writes are not horizontally scalable — every write goes through the leader and requires quorum acknowledgment. Both ZooKeeper and etcd sustain approximately 10,000-20,000 writes per second under ideal conditions (small values, low-latency network). This is by design: coordination services prioritize consistency and correctness over throughput.
Why the write ceiling is acceptable: Coordination data changes infrequently. Leader elections happen during failures (rare). Configuration updates happen during deployments (minutes to hours apart). Service registrations happen at startup/shutdown. If you are generating 50,000 coordination writes per second, you are misusing the coordination service — that data belongs in a database.
Cluster Sizing
| Cluster Size | Quorum | Fault Tolerance | Latency Impact |
|---|---|---|---|
| 1 node | 1 | None — testing only | Lowest |
| 3 nodes | 2 | 1 node failure | Baseline |
| 5 nodes | 3 | 2 node failures | ~10-20% higher write latency |
| 7 nodes | 4 | 3 node failures | ~30-50% higher write latency |
3 nodes is the production minimum. 5 nodes is standard for critical coordination (Kubernetes control plane, Kafka metadata). 7 nodes is rare — the added fault tolerance rarely justifies the latency cost. Beyond 7, use observer/learner nodes for read scaling instead.
Failure Modes & Recovery
1. Session Expiry Storm (ZooKeeper)
Symptoms: Hundreds of ephemeral nodes deleted simultaneously, all services re-registering, thundering herd on ZooKeeper reads, downstream systems experiencing cascading failovers.
Root cause: A network partition or ZooKeeper leader election causes many client sessions to expire simultaneously. Each session expiry deletes all ephemeral nodes for that client, triggering watches on every deletion, causing all watching clients to re-read and re-register.
Fix: Stagger session timeouts across clients (don't use the same timeout for every service). Implement exponential backoff on re-registration. Use a local cache of service discovery data with a grace period — don't immediately remove a service from the routing table when its ephemeral node disappears; wait for confirmation.
2. Watch Overhead & Fan-Out
Symptoms: ZooKeeper leader CPU at 100%, increasing latency on all operations, large number of pending watch notifications in the pipeline.
Root cause: Too many clients watching the same znode (e.g., 10,000 service instances watching /config/feature-flags). When the znode changes, ZooKeeper must deliver 10,000 notifications from the leader. In ZooKeeper, each watch notification is processed and sent individually.
Fix: Use a tiered watch architecture — a small number of "relay" nodes watch ZooKeeper and propagate changes to the application fleet via a separate mechanism (pub/sub, polling with ETag). In etcd, this is less severe because watch events are multiplexed over gRPC streams, but very high fan-out (>10,000 watchers) still causes leader CPU saturation.
3. Split Brain During Leader Election
Symptoms: Two nodes both believe they are the leader, conflicting writes, data corruption.
Root cause: This should never happen with correct ZAB/Raft implementations — both protocols guarantee that at most one leader exists per term/epoch. In practice, split brain occurs due to: long GC pauses that cause a node to miss its own demotion, clock skew affecting timeout calculations, or bugs in custom leader election implementations built on top of ZooKeeper/etcd.
Fix: Always use well-tested leader election libraries (Curator's LeaderLatch for ZooKeeper, etcd's concurrency package in Go). Never implement leader election from scratch. Use fencing tokens — every operation the leader performs includes its epoch number, and downstream systems reject operations from old epochs.
4. Slow Followers & Degraded Quorum
Symptoms: Write latency increasing gradually, one follower consistently behind, leader log growing unboundedly.
Root cause: A follower node with degraded disk I/O, CPU throttling (in containerized environments), or network issues cannot keep up with the leader's replication rate. The leader must retain log entries until the slow follower acknowledges them, consuming memory and disk.
Fix: Monitor follower lag metrics. In ZooKeeper, check lastProcessedZxid lag between leader and followers. In etcd, check etcd_server_slow_apply_total and raft_apply_duration_seconds. If a follower is consistently >1 second behind, investigate infrastructure issues. As a last resort, remove the slow follower and add a fresh one — it will catch up from a snapshot.
5. Database Size Growth (etcd)
Symptoms: etcd alerts on database size exceeding quota (default 2GB), writes rejected with mvcc: database space exceeded, compaction taking longer.
Root cause: etcd's MVCC model retains all revisions until compaction. If compaction is not running frequently enough, or if the application writes high-volume data that does not belong in etcd, the database grows beyond the quota.
Fix: Enable automatic compaction (--auto-compaction-retention=1 for hourly). Defragment periodically (etcdctl defrag). Increase the quota if necessary (max recommended: 8GB). Most importantly, audit what is being stored in etcd — if any key has >10KB of data or there are >100,000 keys, that data probably belongs in a database.
When to Use — And When Not To
When Coordination Services Are the Right Answer
- Leader election for stateful services — matching engines, single-writer databases, scheduler instances, stream processing coordinators. Exactly one active instance with automatic failover.
- Distributed locks with fencing — exclusive access to a shared resource where correctness (not performance) is the priority.
- Cluster metadata management — which nodes are alive, what their roles are, what version of the schema they are running.
- Configuration distribution — cluster-wide settings that must be consistent and hot-reloadable.
- Service discovery (when a service mesh is overkill) — small-to-medium deployments where Consul or Istio adds unnecessary complexity.
When Coordination Services Are the Wrong Answer
- High-throughput data storage — >1,000 writes/sec sustained or >8GB total data. Use a database.
- Message queuing — use Kafka, RabbitMQ, or SQS. ZooKeeper/etcd are not message brokers.
- Session storage — use Redis or DynamoDB. Coordination services are not caches.
- Metrics and time-series data — use Prometheus, InfluxDB, or TimescaleDB.
- Anything where "eventually consistent" is acceptable — coordination services pay a consistency premium that is wasted if you do not need linearizability.
Interview Application — Staff-Level Plays
Which Playbooks Use ZooKeeper & etcd
| Playbook | How ZooKeeper/etcd Is Used | Key Pattern |
|---|---|---|
| Distributed Consensus | Core consensus primitives for leader election | Ephemeral keys/leases with fencing tokens for split-brain prevention |
| Service Discovery | Service registration and health-aware routing | Watch/lease-based registration with real-time updates to consumers |
| Load Balancer | Backend membership and health state coordination | Watches notify load balancers of topology changes |
| Database Sharding | Shard-to-node assignment and rebalancing coordination | Metadata store for shard map with atomic reassignment |
L5 vs L6 Responses
| Scenario | L5 Answer | L6/Staff Answer |
|---|---|---|
| "How do you ensure one active leader?" | "Use ZooKeeper for leader election" | "Use etcd with a 5-second lease for the matching engine leader. Each candidate attempts to create /election/leader with a transaction (create_revision == 0). The winner maintains keep-alives. Followers watch the key and attempt creation on deletion. Failover completes in <10 seconds. Every write includes the lease revision as a fencing token." |
| "How do you handle configuration changes?" | "Store config in ZooKeeper" | "Store config in etcd with a watch on /config/. When a config key changes, each service receives the event, validates the new value, and hot-reloads. Use etcd transactions to update multiple config keys atomically. The MVCC revision serves as a version number — services can detect if they missed a config update." |
| "ZooKeeper or etcd?" | "They're both for coordination" | "etcd for new systems — simpler operations (single binary), persistent watches (no re-registration), MVCC for reliable event replay. ZooKeeper if the ecosystem requires it (pre-KRaft Kafka, HBase). Both provide linearizable writes at ~10K/s and read scaling via observer/learner nodes." |
| "How do you handle the leader dying?" | "ZooKeeper detects it and elects a new one" | "The leader's etcd lease expires after 5 seconds with no keep-alive. The key is deleted, triggering watch events on all followers. The fastest follower creates the key with a new lease. Total failover: lease TTL (5s) + election (sub-second) + initialization (application-dependent). We design the application to be idempotent during the transition window." |
The Staff Coordination Checklist
When proposing a coordination service in an interview:
- Justify the need: "This component requires exactly-one semantics — only one instance can own the write path at a time. That requires leader election with automatic failover."
- Choose the tool: "etcd for new systems, ZooKeeper if the ecosystem already has it."
- Specify the mechanism: "Leader election via ephemeral key with 5-second lease. Fencing tokens to prevent split-brain writes."
- State the failure mode: "If the leader dies, failover takes <10 seconds. During the transition, writes are paused — the system is consistent but briefly unavailable (CP tradeoff)."
- Scope it narrowly: "We only use etcd for leader election metadata — 3 keys total. Application data goes in PostgreSQL. Cache goes in Redis. The coordination service handles coordination only."
Quick Reference Card
ZooKeeper:
Protocol: ZAB (Zookeeper Atomic Broadcast)
Data model: Hierarchical znode tree, 1MB/znode max
Watches: One-shot (must re-register), use Curator library
Ephemeral: Session-based (tied to TCP connection, ~30s timeout)
Ecosystem: Kafka (pre-KRaft), Hadoop, HBase, Solr
Cluster: 3 or 5 nodes, observers for read scaling
etcd:
Protocol: Raft consensus
Data model: Flat key-value, 1.5MB/key max, MVCC revisions
Watches: Persistent (continuous gRPC stream), prefix matching
Ephemeral: Lease-based (TTL + keep-alive, configurable)
Ecosystem: Kubernetes, CoreDNS, Vitess, CockroachDB
Cluster: 3 or 5 nodes, learners for read scaling
Shared limits:
Writes: ~10K-20K/s (leader bottleneck)
Reads: ~100K/s (any node)
Data size: Keep under 8GB total, <1KB per key
Use for: Leader election, locks, service discovery, config
Do NOT use: Data storage, messaging, caching, metrics