Real-time Updates — Cross-Cutting Pattern
The Problem
Users expect live data. But every persistent connection is operational state — memory per socket, file descriptors, health checking, reconnection logic, and graceful drain during deploys. The transport you choose determines your operational ceiling long before it determines your latency floor. Pick wrong and you spend more time managing connections than building features.
Playbooks That Use This Pattern
- Real-Time & WebSocket Systems — Connection lifecycle, presence
- Chat & Messaging — Message delivery, typing indicators
- Feed Generation — Live feed updates, new post notifications
- Leaderboard & Counting — Live score updates
- Collaborative Editing — Real-time document sync
The Core Tradeoff
| Strategy | What Works | What Breaks | Who Pays |
|---|---|---|---|
| WebSockets | True bidirectional, low latency, efficient for high-frequency updates | Sticky sessions, connection draining on deploy, load balancer complexity, no auto-reconnect | Infra team — every deploy is a connection migration event |
| Server-Sent Events (SSE) | Server-push with auto-reconnect built in, works through HTTP proxies, simple ops | Unidirectional only, limited browser connection cap (~6 per domain on HTTP/1.1) | Nobody, if server-push is all you need |
| Long Polling | Works everywhere, no special infra, simple client logic | Connection churn, thundering herd on reconnect, wasted server threads holding idle connections | Backend team — thread/connection pool pressure |
| Short Polling | Operationally trivial, stateless, cacheable | Latency proportional to interval, wasted requests when nothing changes | CDN/API layer — cost scales with poll frequency, not user activity |
Staff Default Position
"Every persistent connection is operational state." Staff default: SSE for server-to-client push (simpler ops, built-in reconnect, works through every proxy and load balancer). WebSockets only when bidirectional communication is genuinely required — collaborative editing, gaming, interactive whiteboards. Short polling for low-frequency updates (<1/min) because the operational simplicity outweighs the latency cost.
Before reaching for WebSockets, multiply: connection_count x memory_per_connection x deploy_frequency. If the answer makes you uncomfortable, SSE or polling is the right call.
Fan-out strategy matters more than transport. Fan-out-on-write (push to all subscribers at write time) gives low read latency but amplifies write cost. Fan-out-on-read (pull on demand) is cheaper to write but shifts cost to every reader. Most systems need a hybrid — fan-out-on-write for active users, fan-out-on-read for the rest.
When to Deviate
- Bidirectional data flow is real, not speculative. Collaborative editing, multiplayer state sync, and interactive drawing all require the client to push structured data upstream continuously. SSE cannot do this.
- Sub-50ms latency is a hard product requirement. Financial tickers, live auctions, competitive gaming. Short polling and SSE add latency floors that matter here.
- You already operate sticky-session infrastructure. If your load balancers and deploy pipeline handle connection draining, the operational tax of WebSockets is already paid.
- Clients are not browsers. Mobile apps and backend services don't share the browser's 6-connection-per-domain limit, making SSE's main drawback irrelevant.
Common Interview Mistakes
| What Candidates Say | What Interviewers Hear | What Staff Engineers Say |
|---|---|---|
| "We'll use WebSockets for everything" | "I haven't considered operational cost" | "SSE for push, WebSockets only where bidirectional is required" |
| "WebSockets are faster than polling" | "I'm comparing transports without considering fan-out" | "Fan-out strategy determines perceived latency more than transport" |
| "We'll add a reconnection layer" | "I don't know SSE has this built in" | "SSE gives us auto-reconnect and Last-Event-ID replay for free" |
| "Long polling as a fallback" | "I haven't sized the connection pool impact" | "Short polling at a reasonable interval is cheaper than holding idle long-poll connections" |
| "We need real-time for all updates" | "I haven't triaged what actually needs sub-second delivery" | "Scores need sub-second push. Profile updates can poll every 60s. Different paths for different SLAs." |
Quick Reference
Staff Sentence Templates
Implementation Deep Dive
1. WebSocket Connection Management — Redis Connection Registry
Every WebSocket server in the fleet maintains local connections, but the system needs to know which user is connected to which server. A Redis-backed connection registry solves this.
Connection Registry Pattern
# When a client connects to gateway server
function onConnect(userId, serverId, connectionId):
# Register connection with TTL (heartbeat-refreshed)
redis.HSET("conn:" + userId,
"server", serverId,
"connId", connectionId,
"connectedAt", now())
redis.EXPIRE("conn:" + userId, 90) # 90s TTL, refreshed by heartbeat
# Add to server's connection set (for drain enumeration)
redis.SADD("server:" + serverId + ":connections", userId)
# Publish presence event
redis.PUBLISH("presence", serialize({ userId, status: "online" }))
# Heartbeat every 30 seconds
function onHeartbeat(userId):
redis.EXPIRE("conn:" + userId, 90) # Refresh TTL
# When a client disconnects
function onDisconnect(userId, serverId):
redis.DEL("conn:" + userId)
redis.SREM("server:" + serverId + ":connections", userId)
redis.PUBLISH("presence", serialize({ userId, status: "offline" }))
Why Redis HASH over plain SET: The hash stores connection metadata (server ID, connection time) alongside registration. When routing a message to a user, you look up the server and send directly — no broadcast to the entire fleet.
Message Routing with Connection Registry
function sendToUser(targetUserId, message):
conn = redis.HGETALL("conn:" + targetUserId)
if conn is empty:
# User offline — queue for later delivery or drop
messageQueue.enqueue(targetUserId, message)
return
targetServer = conn["server"]
if targetServer == THIS_SERVER:
# Local delivery — find local WebSocket and send
localSocket = localConnections.get(targetUserId)
if localSocket:
localSocket.send(serialize(message))
else:
# Remote delivery — publish to server-specific channel
redis.PUBLISH("server:" + targetServer + ":inbox", serialize({
targetUserId, message
}))
Connection Draining During Deploy
function gracefulDrain(serverId, drainDurationSec=60):
# Step 1 — Stop accepting new connections
loadBalancer.removeServer(serverId)
# Step 2 — Notify all local clients to reconnect elsewhere
clients = redis.SMEMBERS("server:" + serverId + ":connections")
for userId in clients:
socket = localConnections.get(userId)
if socket:
socket.send(serialize({ type: "reconnect",
reason: "server_drain",
delay: random(0, drainDurationSec * 1000) }))
# Step 3 — Wait for clients to reconnect with jitter
sleep(drainDurationSec)
# Step 4 — Force-close any remaining connections
for userId in localConnections.keys():
localConnections.get(userId).close(1001, "server_shutdown")
redis.DEL("conn:" + userId)
redis.DEL("server:" + serverId + ":connections")
Why randomized delay: Without jitter, all clients reconnect to remaining servers simultaneously — a thundering herd that can overwhelm the fleet. Spreading reconnections over 60 seconds keeps the load manageable. For a 200-server fleet with 100K connections each, draining one server means 100K reconnections spread over 60 seconds = ~1,700/sec, which is routine.
2. Server-Sent Events (SSE) — The Simpler Default
SSE is HTTP-based, unidirectional (server-to-client), and has auto-reconnect built into the browser specification. For server-push use cases (notifications, feed updates, live scores), SSE is operationally simpler than WebSockets.
SSE Implementation
# Server (Node.js / Express style)
function handleSSE(request, response):
response.setHeader("Content-Type", "text/event-stream")
response.setHeader("Cache-Control", "no-cache")
response.setHeader("Connection", "keep-alive")
userId = authenticate(request)
# Send initial connection event
response.write("event: connected\ndata: {}\"connected\":true}\n\n")
# Subscribe to user's event channel
subscription = eventBus.subscribe("user:" + userId, (event) =>
# Last-Event-ID enables replay on reconnect
response.write("id: " + event.id + "\n")
response.write("event: " + event.type + "\n")
response.write("data: " + serialize(event.payload) + "\n\n")
)
# Handle disconnect
request.on("close", () =>
subscription.unsubscribe()
eventBus.publish("presence", { userId, status: "offline" })
)
Auto-reconnect: When the connection drops, the browser automatically reconnects and sends the Last-Event-ID header. The server uses this to replay any missed events from a bounded buffer. No client-side reconnection logic needed.
SSE vs WebSocket: Operational Comparison
| Concern | SSE | WebSocket |
|---|---|---|
| Load balancer | Standard HTTP LB (ALB, nginx) | Requires WebSocket-aware LB or sticky sessions |
| Auto-reconnect | Browser-native | Must implement manually |
| Message replay | Last-Event-ID header | Custom cursor protocol |
| Bidirectional | No — use POST for client→server | Yes — native |
| Connection limit (HTTP/1.1) | 6 per domain per browser | No browser limit |
| Connection limit (HTTP/2) | 100+ multiplexed streams | N/A |
| Deploy drain | Close response → auto-reconnect | Custom drain protocol (see above) |
| Memory per connection | ~8KB (no upgrade overhead) | ~24KB (TLS + WebSocket framing + buffers) |
3. Fan-Out Strategies — Small vs Large Groups
The fan-out strategy determines system cost more than the transport choice. A message sent to a 5-person group chat is fundamentally different from a message sent to 50K live stream viewers.
Small Group Fan-Out (Direct Push)
# For groups with < 500 members
function fanOutSmallGroup(groupId, message, senderId):
members = db.query("SELECT user_id FROM group_members WHERE group_id = ?", groupId)
for memberId in members:
if memberId == senderId:
continue # Don't echo to sender
conn = redis.HGETALL("conn:" + memberId)
if conn:
routeToServer(conn["server"], memberId, message)
else:
# Offline: persist for later delivery
redis.LPUSH("offline:" + memberId, serialize(message))
redis.LTRIM("offline:" + memberId, 0, 999) # Bounded queue
Why direct push for small groups: Querying 500 members and routing individually is O(N) but N is small. No pub/sub overhead, no channel management, no subscription lifecycle. Simple and predictable.
Large Group Fan-Out (Pub/Sub with Subscription Channels)
# For groups with > 500 members (live streams, public channels)
function fanOutLargeGroup(channelId, message):
# Publish once — all gateway servers with subscribed users receive it
redis.PUBLISH("channel:" + channelId, serialize(message))
# Each gateway server subscribes on behalf of its local clients
function onClientJoinChannel(userId, channelId, serverId):
# Track subscription locally
localSubscriptions.add(userId, channelId)
# Subscribe to Redis channel if this is the first local subscriber
if localSubscriptions.countForChannel(channelId) == 1:
redis.SUBSCRIBE("channel:" + channelId)
function onRedisMessage(channelId, message):
# Fan out to all local subscribers for this channel
for userId in localSubscriptions.getUsersForChannel(channelId):
socket = localConnections.get(userId)
if socket:
socket.send(message)
Fan-Out Decision Matrix
| Group Size | Strategy | Publish Cost | Read Cost | Operational Complexity |
|---|---|---|---|---|
| 1-1 (DM) | Direct push | O(1) lookup + route | O(1) | Minimal |
| 2-500 (group chat) | Direct push with member list | O(N) lookups | O(1) per member | Low |
| 500-50K (large channel) | Redis pub/sub | O(1) publish | O(N) local fan-out per server | Medium — manage subscriptions |
| 50K+ (broadcast) | Tiered pub/sub with edge servers | O(1) publish to backbone | O(N) tiered fan-out | High — edge server fleet |
4. Presence Tracking — Redis SET with TTL and Heartbeat
"Online" status is one of the most expensive features in real-time systems. Naive implementations create O(N) fan-out on every status change.
Presence Implementation
# Presence data in Redis
function setOnline(userId):
redis.SET("presence:" + userId, "online", EX=45) # 45s TTL
# Publish to contacts who are currently online
contacts = getOnlineContacts(userId)
for contactId in contacts:
sendToUser(contactId, { type: "presence", userId, status: "online" })
function heartbeat(userId):
redis.SET("presence:" + userId, "online", EX=45) # Refresh TTL
# No presence publish on heartbeat — only on state change
function setOffline(userId):
redis.DEL("presence:" + userId)
contacts = getOnlineContacts(userId)
for contactId in contacts:
sendToUser(contactId, { type: "presence", userId, status: "offline" })
function isOnline(userId):
return redis.EXISTS("presence:" + userId)
function getOnlineContacts(userId):
contactIds = db.query("SELECT contact_id FROM contacts WHERE user_id = ?", userId)
pipeline = redis.pipeline()
for cid in contactIds:
pipeline.EXISTS("presence:" + cid)
results = pipeline.execute()
return [contactIds[i] for i, r in enumerate(results) if r == 1]
Presence Accuracy vs Cost
| Approach | Accuracy | Fan-Out Cost | Memory | Use Case |
|---|---|---|---|---|
| Heartbeat + TTL (45s) | "Online within last 45s" | Per state change | O(online users) | Chat apps — good enough for most |
| Last-seen timestamp | "Last seen 3 min ago" | Zero fan-out | O(all users) | WhatsApp-style — cheapest |
| Real-time presence | Instant online/offline | O(contacts) per change | O(online users) | Slack-style — expensive |
| Lazy presence | On-demand check | Zero proactive fan-out | O(online users) | LinkedIn-style — minimal cost |
Architecture Diagram
Data flow: Clients connect to gateway servers via WebSocket or SSE. Gateways register connections in Redis. When the API tier needs to push a message, it looks up the target user's gateway via the connection registry and publishes to that server's channel. The gateway delivers locally. For large groups, a single Redis PUBLISH fans out to all gateways with subscribers.
Failure Scenarios
1. Gateway Server Crash — 25K Connections Lost
Timeline: Gateway server 3 crashes at 14:00:00. 25K WebSocket connections are severed instantly. All 25K clients begin reconnection simultaneously. Remaining 49 gateways receive a thundering herd of 25K new connections within 2 seconds.
Blast radius: The 25K affected users lose ~5-30 seconds of messages (depending on reconnection speed). Remaining gateway servers experience a CPU spike from connection setup overhead (TLS handshake, authentication, subscription setup). If servers are near capacity, the spike can cascade.
Detection: Gateway health check fails. Redis connection registry for the crashed server shows stale entries (TTL has not yet expired). Connected-user count drops by 25K.
Recovery:
- Client-side exponential backoff with jitter spreads reconnections over 30 seconds instead of 2 seconds
- Redis presence TTL (90s) automatically cleans up stale connection entries — no manual intervention
- Missed messages are replayed via
Last-Event-ID(SSE) or cursor-based recovery (WebSocket) from the message buffer - N+2 fleet sizing ensures remaining servers have capacity to absorb the reconnections
2. Redis Pub/Sub Backpressure — Message Delivery Stall
Timeline: A popular live stream channel has 50K viewers. The streamer generates 20 messages/sec (chat + reactions). Each message fans out to 50 gateway servers. Redis pub/sub throughput is sufficient, but gateway server 7 has a slow consumer — its local fan-out to 5K connections takes 200ms per message. Redis pub/sub has no backpressure — messages are dropped for slow subscribers.
Blast radius: Users connected to gateway 7 miss messages silently. No error, no retry — Redis pub/sub is fire-and-forget. The users see gaps in the chat stream.
Detection: Per-gateway message delivery rate monitoring. If gateway 7 delivers 15 messages/sec while others deliver 20, it is dropping messages. Client-side sequence number gap detection.
Recovery:
- Switch from Redis pub/sub to Redis Streams (
XREADGROUP) for durable delivery with consumer acknowledgment - Buffer messages in a per-gateway queue (Redis list) with a consumer that processes at its own pace
- If a gateway falls too far behind (>5 seconds of lag), disconnect its clients and let them reconnect to healthier servers
- Rate-limit the source — cap chat messages at 10/sec per channel to bound fan-out load
3. Presence Storm During Morning Login Wave
Timeline: 9:00 AM — 500K users log in within a 15-minute window. Each login triggers a presence fan-out to online contacts (average 50 online contacts per user). Total presence messages: 500K users x 50 contacts = 25M presence notifications in 15 minutes = ~28K messages/sec.
Blast radius: Presence messages compete with actual content messages (chat, notifications) for gateway bandwidth. Users experience delayed message delivery because the pipeline is saturated with presence updates.
Detection: Message delivery latency increases. Gateway queue depth grows. Presence message volume spikes relative to content message volume.
Recovery:
- Batch presence updates — instead of sending individual "user X is online" events, send a batch update every 5 seconds: "these 200 contacts came online"
- Deprioritize presence — route presence through a separate, lower-priority channel so content messages are never delayed
- Lazy presence — stop proactively pushing presence; let the client fetch contact status when the user opens their contact list
- Throttle login-wave presence — during detected login spikes, temporarily disable proactive presence fan-out and switch to lazy mode
Staff Interview Application
How to Introduce This Pattern
Lead with the transport decision framework, not a technology choice. Then immediately address fan-out: "The transport is the easy part. The harder question is fan-out strategy — how do we route a message to the right user on the right server without broadcasting to the entire fleet."
When NOT to Use This Pattern
- Update frequency < 1/minute: Short polling is operationally simpler and the latency difference is invisible to users. A 30-second poll interval means average staleness of 15 seconds — acceptable for dashboards, leaderboards, and profile updates.
- No live audience: If users are not actively watching (email summaries, batch reports), push infrastructure is wasted. Queue the updates and deliver on next session.
- Single-server system: If you have one application server with <1K concurrent users, in-process event emitters suffice. No Redis, no pub/sub, no connection registry. Don't build fleet infrastructure for a single-node system.
- Data is cacheable and shared: If all users see the same data (stock ticker, weather, sports scores), a CDN with short TTL and client-side polling is cheaper than maintaining 100K persistent connections.
Follow-Up Questions to Anticipate
| Interviewer Asks | What They Are Testing | How to Respond |
|---|---|---|
| "Why not WebSockets for everything?" | Operational cost awareness | "Every WebSocket connection is operational state — memory, file descriptors, drain on deploy. SSE gives us server-push with auto-reconnect and standard HTTP infrastructure. I use WebSockets only when bidirectional flow is genuinely required." |
| "How do you handle message ordering?" | Distributed systems fundamentals | "Per-user ordering via the connection registry — messages to a user route to one gateway. Cross-user ordering (group chat) uses a sequence number assigned at the application tier before fan-out." |
| "What about mobile clients on flaky networks?" | Reliability engineering | "Cursor-based recovery: each message has a monotonic ID. On reconnect, the client sends its last-seen ID and the server replays from there. This handles both network flaps and server-side connection migrations." |
| "How do you deploy without dropping messages?" | Operational maturity | "Graceful drain: stop accepting new connections, send reconnect-with-jitter to existing clients, wait for drain window, then shut down. Clients reconnect to other servers and replay missed messages via cursor." |
| "How do you scale to 10M connections?" | Architecture scaling | "Horizontal gateway fleet. At 50K connections/server, that is 200 servers. The connection registry in Redis handles routing. The bottleneck shifts to fan-out — large channels need tiered pub/sub with regional edge servers." |