Design WhatsApp | StaffSignal Playbook

Technologies referenced in this playbook: Redis · Cassandra

How to Use This Playbook

This playbook supports three reading modes:

Mode	Time	What to Read
Quick Review	15 min	Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study	1-2 hrs	Interview Walkthrough → Core Flow, expand appendices where you're weak
Deep Dive	3+ hrs	Everything, including all appendices

What is Chat & Messaging Infrastructure? — Why interviewers pick this topic

The Problem

Chat infrastructure delivers messages between users in real time with ordering guarantees, delivery confirmation, and offline support. The challenge is not sending a message — it is guaranteeing that every message arrives exactly once, in the correct order, across unreliable networks, intermittent connectivity, and server fleets spanning multiple regions. WhatsApp delivers 100B+ messages per day. Slack manages millions of persistent connections with per-workspace message ordering. Telegram handles groups with up to 200K members. The engineering is in the guarantees, not the transport.

Common Use Cases

1-to-1 Chat: Direct messaging with delivery/read receipts (WhatsApp, iMessage)
Group Messaging: Fan-out to N recipients with consistent ordering (Slack channels, Telegram groups)
Ephemeral Messaging: Disappearing messages with server-side TTL enforcement (Snapchat, Signal)
Enterprise Collaboration: Threaded conversations, search, compliance archival (Slack, Teams)

Why Interviewers Ask About This

Chat infrastructure exposes the Staff-level skill of reasoning about delivery guarantees under distributed failure. Everyone can build a chat demo with Socket.io. The Staff question is: "What happens when User A sends a message, their phone loses connectivity, and User B is on a different continent with an offline device?" This surfaces tradeoffs in ordering, consistency, fan-out, encryption, and presence that have no single correct answer — only positions you must defend.

Mechanics Refresher: Message Delivery & Connection Protocols — WebSocket lifecycle, delivery semantics, sequence numbering

Who Pays Analysis

Protocol	Direction	Latency	Connection Cost	Best For
WebSocket	Bidirectional	~50ms	High (stateful TCP)	Chat, typing indicators, real-time sync
MQTT	Bidirectional (pub/sub)	~100ms	Low (lightweight binary)	Mobile chat with constrained bandwidth
SSE	Server to client	~100ms	Medium (HTTP keep-alive)	Read-only feeds, notification streams
Long-polling	Simulated bidirectional	500ms-5s	Low (stateless HTTP)	Fallback when WebSocket is unavailable

Delivery semantics:

At-most-once: Fire and forget. Message may be lost. Never duplicated. Suitable for typing indicators.
At-least-once: Retry until ACKed. Message may be duplicated. Requires client-side dedup. Default for chat.
Exactly-once: At-least-once + idempotency key + server-side dedup. Expensive. Required for payment-related messages.

Sequence numbering: Every message gets a monotonically increasing sequence number scoped to the conversation. Clients use this to detect gaps (missed messages) and reorder out-of-order arrivals. The sequence is assigned server-side at write time — never trust client-generated ordering.

Mechanics Refresher: End-to-End Encryption Concepts — Signal Protocol basics, key exchange, ratcheting

Signal Protocol (used by WhatsApp, Signal, Facebook Messenger):

Key pairs: Each device has a long-term identity key and a set of one-time pre-keys uploaded to the server.
X3DH key exchange: Initiator combines their identity key with recipient's pre-key bundle to derive a shared secret. No server can read the message.
Double Ratchet: After initial exchange, each message advances the ratchet — generating a new encryption key. Compromising one key does not reveal past messages (forward secrecy) or future messages (break-in recovery).
Server role: Store and forward encrypted blobs. The server never has plaintext access. This means server-side search, spam detection, and content moderation operate on metadata only.
Group E2E: Sender encrypts once with a shared group key (Sender Key protocol). Key rotates when members join or leave. Trade-off: simpler than N pairwise encryptions but requires group key management.

Operational cost of E2E: No server-side message search, no server-side spam filtering on content, key rotation on device change generates re-encryption storms, and lost devices mean lost message history unless backup keys exist.

What This Interview Actually Tests

Chat infrastructure is not a "use WebSockets and a database" question. Everyone can build a chat demo.

This is a delivery guarantee and ordering question that tests:

Whether you separate the connection layer from the message persistence layer
Whether you reason about message ordering as a per-conversation constraint, not a global constraint
Whether you understand the fan-out cost difference between 1-to-1 chat and group chat at scale
Whether you can articulate the operational trade-offs of end-to-end encryption

The key insight: The hardest problem in chat is not sending messages — it is guaranteeing delivery order when the sender, recipient, and network are all unreliable simultaneously. Staff engineers design for the failure case first, then optimize the happy path.

The L5 vs L6 Contrast (Memorize This)

Level Calibration

Behavior	L5 (Senior)	L6 (Staff)
First move	"WebSocket connection, publish message to recipient"	"What are the delivery guarantees? Is this 1-to-1, small groups, or large broadcast channels? Is E2E encryption required?"
Ordering	"Timestamp each message"	"Per-conversation sequence number assigned server-side. Client-side reordering buffer for out-of-order arrivals. Causal ordering for replies."
Delivery	"Send and hope. Maybe add a retry."	"At-least-once delivery with server ACK. Client-side dedup by message ID. Offline queue with TTL. Delivery receipt (single tick), server receipt (double tick), read receipt (blue tick)."
Failure handling	"Client reconnects"	"On reconnect, client sends last_sequence_number. Server replays missed messages from persistent store. Exponential backoff with jitter to prevent thundering herd."
Ownership	"Build the chat feature"	"Who owns the connection gateway? Who owns the message store? Who owns the offline delivery queue? Who operates the encryption key infrastructure?"

The Three Messaging Intents (Pick One and Commit)

Intent	Constraint	Strategy	Correctness Bar
Consumer Chat (WhatsApp/Telegram)	Delivery guarantee + E2E encryption + offline support	Per-device message queue, Signal Protocol, store-and-forward	Every message delivered exactly once, in order, encrypted end-to-end
Enterprise Collaboration (Slack/Teams)	Searchability + compliance + threading	Server-side indexing, retention policies, workspace-scoped ordering	Messages searchable within 5 seconds, 99.99% delivery within 500ms for online users
Large-Scale Broadcast (Telegram channels, Discord)	Fan-out efficiency for 10K-200K recipients	Write-time fan-out for small groups, read-time fan-out for large channels	95% of recipients receive within 2 seconds, acceptable ordering relaxation

Staff Move: "Before I design anything, I need to know: is this consumer-grade E2E encrypted chat, enterprise collaboration with search and compliance, or large-scale broadcast? The delivery guarantees, storage model, and fan-out strategy change completely. I'll assume consumer chat with E2E encryption for 1-to-1 and groups up to 256 members, with offline delivery support."

The Four Fault Lines (Preview)

Delivery Guarantee — At-most-once (fast, lossy) vs at-least-once with dedup (reliable, complex). Who absorbs duplicate handling?
Message Ordering — Total order (expensive, doesn't scale) vs per-conversation order (practical) vs causal order (for replies and threads). What breaks when ordering is relaxed?
Group Fan-Out — Write-time fan-out (fast reads, expensive writes) vs read-time fan-out (cheap writes, slow reads). The 256-member threshold where everything changes.
Presence Accuracy — Real-time heartbeat (accurate, expensive) vs lazy presence (cheap, stale). Whether "last seen" is a feature or a liability.

Quick Reference: What Interviewers Probe

After You Say...	They Will Ask...
"Use WebSocket for real-time delivery"	"User B is offline. What happens to the message? How long do you store it?"
"Timestamp for ordering"	"Two messages sent 1ms apart from different devices. Which comes first? What if clocks are skewed?"
"Fan-out to group members"	"The group has 50K members. Do you write 50K copies or maintain a shared timeline?"
"End-to-end encryption"	"A user gets a new phone. How do they read old messages? What happens to the group key?"
"Redis pub/sub for routing"	"Redis is a single point of failure. What happens when it goes down for 30 seconds?"

Jump to Practice

-> Active Drills (§7) -- 8 practice prompts with expected answer shapes

System Architecture Overview

Rendering diagram...

Interview Walkthrough: How to Present This in 45 Minutes

Most interview prep covers the basics — step-by-step architecture walkthroughs at tutorial pace. This section is different. Senior candidates spend 25 minutes on the basics and run out of time before reaching anything interesting. Staff candidates speed through the baseline in 10-12 minutes — fast enough to spend the remaining 30+ minutes on the fault lines, failure modes, and ownership questions that actually determine your level.

The six phases below add up to 45 minutes. The ratios matter: phases 1-4 are deliberately compressed so phase 5 gets the lion's share of time. If you're spending more than 12 minutes before the transition to depth, you're pacing like an L5.

Phase 1: Requirements & Framing (2-3 minutes)

State functional requirements in 30 seconds — don't enumerate, state the category:

"Users send and receive messages in 1-to-1 and group conversations. Messages persist across devices, sync offline, and support media attachments, delivery receipts, and typing indicators."

That's it. Don't list every feature. The interviewer knows what chat does.

Invest time on non-functional requirements (this is the Staff move):

"The hard constraint is the delivery guarantee: at-least-once with client-side dedup. We can never lose a message, but we can tolerate brief duplicates."
Clarify: per-conversation ordering with server-assigned sequence numbers
"End-to-end encryption for consumer chat — the server never sees plaintext. Online delivery under 200ms; offline queue with configurable TTL."

Phase 2: Core Entities & API (1-2 minutes)

State entities quickly (30 seconds):

Conversation: participant list, per-conversation sequence counter, metadata
Message: sender, content (encrypted blob), sequence number, delivery status
Connection: WebSocket session tied to a specific gateway server, authenticated
Inbox: per-device offline queue holding undelivered messages

API (1 minute) — the key interaction is send-with-acknowledgment:

WS SEND  { conversation_id, content, client_msg_id } → server ACK { sequence_num }
WS RECV  { conversation_id, message, sequence_num }   → client ACK { sequence_num }
GET /conversations/{id}/messages?after_seq={n}         → missed message backfill

Phase 3: High-Level Architecture (5-7 minutes)

Rendering diagram...

Walk through the flow:

Send → Client sends message over WebSocket to Connection Gateway; gateway forwards to Message Router
Persist + sequence → Router writes to Message Store, assigns per-conversation sequence number, ACKs sender
Route → Router publishes to Pub/Sub; recipient's gateway pushes to recipient if online
Offline → If recipient is offline, message queued in Inbox; delivered on reconnect with sequence-based backfill

Key points to hit on the whiteboard:

Connection Gateway — stateful WebSocket servers with Redis-backed connection registry for routing
Message Router — the decision point: persist first (durability), then route (availability)
Inbox Queue — not an afterthought; offline delivery is the common case for global chat
Pub/Sub — Redis Pub/Sub or Kafka depending on scale; decouples send from deliver

Phase 4: Transition to Depth (1 minute)

At this point you've spent ~12 minutes. Now pivot:

"The basic architecture is straightforward — gateway, router, message store. What makes this a Staff-level problem is the failure mode reasoning. Let me dive into three areas: (1) delivery guarantees when the sender's connection drops mid-send and the recipient is on another continent, (2) group fan-out economics at scale — write amplification vs read amplification, (3) E2E encryption key management when a group member is removed."

Then offer the interviewer a choice:

"I can go deep on any of these. Which is most interesting to you?"

If the interviewer doesn't have a preference, lead with delivery guarantees — it's the most impressive and the most universally applicable.

Phase 5: Deep Dives (25-30 minutes)

The interviewer will steer, but be prepared to go deep on any of these. For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → explain who absorbs that cost.

Fault Line 1: Delivery guarantees under failure (7-10 min)

Open with the tradeoff framing:

"The hard case isn't 'Alice sends Bob a message.' The hard case is: Alice hits send, her connection drops before she receives the server ACK, and Bob is offline on another continent. What happened? Did the server receive it? Did it persist? Is it sitting in Bob's inbox or was it lost?"

Walk through the bidirectional ACK protocol:

Client sends message with client_msg_id (UUID generated locally)
Server persists message → assigns sequence_num → ACKs to sender with { client_msg_id, sequence_num }
If the sender doesn't receive the ACK → client retries on reconnect with the SAME client_msg_id
Server deduplicates on client_msg_id → returns existing sequence_num (idempotent)
Recipient's client ACKs delivery → server marks delivered; if no ACK → stays in Inbox for redelivery

The key insight: "The client_msg_id makes retries idempotent. The server ACK tells the sender 'I have it.' The client ACK tells the server 'they received it.' Without both sides, you have a gap."

Then go deeper on ordering: "Per-conversation sequence numbers give total order within a conversation. But for replies and threads, we also need causal ordering — a reply must never appear before the message it references. We handle this with a reply_to_seq field and client-side buffering: if a reply arrives before its parent, the client buffers it until the parent arrives."

Fault Line 2: Group fan-out economics (5-7 min)

Frame with concrete numbers:

"A 100-person group where every message is fan-out-on-write means 100 inbox writes per message. If the group sends 200 messages/day, that's 20,000 write operations daily per group. At 10,000 active groups, that's 200 million write operations/day just for inbox fan-out."

Present the tradeoff:

Write amplification (fan-out-on-write): Write to every recipient's inbox at send time. Read is simple — each user reads their own inbox. Cost: O(N) writes per message.
Read amplification (fan-out-on-read): Write once to the conversation store. Each read pulls from the shared store and filters. Cost: O(1) write, but every read hits the shared store.

Pick a position: "I'd use fan-out-on-write for groups under 500 members and fan-out-on-read for large groups like broadcast channels. The threshold is where write amplification exceeds the read query cost — for groups with a 10:1 read-to-message ratio, that's roughly 500 members."

Then address the hot group problem: "A viral group where 500 members are all typing creates a thundering herd on the fan-out pipeline. Mitigation: batch fan-out with micro-batching (100ms windows), and separate the typing indicator pipeline from the message pipeline entirely — typing indicators are fire-and-forget, messages are durable."

Fault Line 3: E2E encryption key management (5-7 min)

"End-to-end encryption sounds simple: encrypt with a shared key, only group members can decrypt. The hard part is key rotation — what happens when someone leaves a group?"

Walk through the Signal Protocol Double Ratchet approach:

Group key: A symmetric key shared among all group members, used to encrypt messages
Member removal → ALL remaining members must receive a NEW group key. The departing member still has the old key (forward secrecy requires key rotation)
Key rotation storm: In a 100-person group, removing one member triggers 99 key distribution messages. Remove 5 members and you've sent 495 key messages before a single chat message
Multi-device: Each user may have 3+ devices, each needing its own key exchange. 100 members × 3 devices = 300 key distribution events per rotation

The architectural implication: "Key management traffic can exceed actual message traffic. We need a separate key management channel with its own queue and delivery guarantees. Key distribution is 'at-least-once with dedup' — a missed key update means a member can't decrypt future messages until they re-request."

Message ordering across regions (3-5 min)

"In a global system with servers in US, EU, and APAC, two members of the same conversation may be connected to different regions. Server-assigned sequence numbers require a single sequencer per conversation, which means cross-region round trips. The options: (1) pin each conversation to a home region and accept higher latency for remote members, (2) use logical clocks (Lamport or vector clocks) for partial ordering, or (3) use a globally consistent ID generator like Snowflake with timestamp-based ordering."

Pick a position: "I'd pin conversations to a home region (where the creator is located) and use async replication. Remote members see ~100-200ms additional latency, which is imperceptible in chat. The alternative — distributed consensus per message — adds 200-400ms for everyone."

Phase 6: Wrap-Up (2-3 minutes)

Summarize the key insight — don't just restate your architecture:

"Chat messaging is deceptively simple. Every engineer has built a toy chat app with WebSocket and a database. What makes it Staff-level is three things: (1) the delivery guarantee contract — at-least-once with bidirectional ACKs makes the failure modes tractable, (2) the fan-out economics — write amplification forces an architectural boundary at ~500 members, and (3) encryption isn't a feature you bolt on — the key management infrastructure can be larger than the messaging infrastructure."

If time permits, add the organizational insight:

"The hardest operational problem isn't message delivery — it's abuse. Who decides what constitutes spam? Who reviews reports? Who handles government takedown requests for E2E encrypted content? These are organizational design problems that shape the technical architecture."

Common Timing Mistakes

Level Calibration

Mistake	L5 Does This	L6 Does This
10 min on requirements	Lists every message type, attachment format, emoji reaction	States delivery guarantee in 1 min, moves to what's hard
15 min on WebSocket setup	Deep dive into connection handshake, heartbeats, reconnection	"WebSocket with exponential backoff reconnect. Moving on."
No delivery guarantee discussion	Assumes messages just arrive	Volunteers bidirectional ACK protocol proactively
No fan-out economics	Draws one message flow for all group sizes	Quantifies write amplification, names the threshold where strategy changes
Spreads thin	Touches 6 topics at surface level	Goes deep on 2-3 fault lines with numbers
No E2E key management	"We encrypt messages"	Names key rotation storm, quantifies distribution overhead

Reading the Interviewer

Interviewer Signal	What They Care About	Where to Go Deep
Asks about message delivery failures	Reliability engineering	Bidirectional ACK protocol, offline delivery (§3.1)
Asks about group chat scaling	Distributed systems	Fan-out economics, write vs read amplification (§3.3)
Asks about encryption	Security architecture	E2E key management, rotation storms (§3.4)
Asks about multi-region	Global architecture	Message ordering, conversation pinning, replication lag
Asks "what about abuse?"	Product/trust & safety	Content moderation pipeline, E2E encryption vs abuse detection tension
Pushes back on your architecture	Wants to see you defend or adapt	State your reasoning, acknowledge alternatives, explain your tradeoff

What to Deliberately Skip

Level Calibration

Topic	Why L5 Goes Here	What L6 Says Instead
WebSocket protocol details	Feels like foundational knowledge	"WebSocket over TLS, heartbeat every 30s. Standard. Moving on."
Database schema for messages	Feels productive to draw tables	"Cassandra partitioned by conversation_id, clustered by sequence. Schema is trivial."
Typing indicator implementation	Easy feature to describe	"Ephemeral pub/sub, fire-and-forget, separate from message pipeline."
Read receipts state machine	Detailed but not differentiating	"Sent/delivered/read status tracked per recipient. Standard. Moving on."
Emoji reactions	Common feature but shallow	"Append-only reaction events. Not interesting for this interview."

How to Use This Playbook

The Problem

Common Use Cases

Why Interviewers Ask About This

What This Interview Actually Tests

The L5 vs L6 Contrast (Memorize This)

The Three Messaging Intents (Pick One and Commit)

The Four Fault Lines (Preview)

Quick Reference: What Interviewers Probe

Jump to Practice

System Architecture Overview

Interview Walkthrough: How to Present This in 45 Minutes

Phase 1: Requirements & Framing (2-3 minutes)

Phase 2: Core Entities & API (1-2 minutes)

Phase 3: High-Level Architecture (5-7 minutes)

Phase 4: Transition to Depth (1 minute)

Phase 5: Deep Dives (25-30 minutes)

Phase 6: Wrap-Up (2-3 minutes)

Common Timing Mistakes

Reading the Interviewer

What to Deliberately Skip

Subscribe to continue reading