StaffSignal
Cross-Cutting Framework

Real-time Updates

WebSockets vs SSE vs polling and fan-out strategies. Every persistent connection is operational state.

Real-time Updates — Cross-Cutting Pattern

The Problem

Users expect live data. But every persistent connection is operational state — memory per socket, file descriptors, health checking, reconnection logic, and graceful drain during deploys. The transport you choose determines your operational ceiling long before it determines your latency floor. Pick wrong and you spend more time managing connections than building features.

Playbooks That Use This Pattern


The Core Tradeoff

StrategyWhat WorksWhat BreaksWho Pays
WebSocketsTrue bidirectional, low latency, efficient for high-frequency updatesSticky sessions, connection draining on deploy, load balancer complexity, no auto-reconnectInfra team — every deploy is a connection migration event
Server-Sent Events (SSE)Server-push with auto-reconnect built in, works through HTTP proxies, simple opsUnidirectional only, limited browser connection cap (~6 per domain on HTTP/1.1)Nobody, if server-push is all you need
Long PollingWorks everywhere, no special infra, simple client logicConnection churn, thundering herd on reconnect, wasted server threads holding idle connectionsBackend team — thread/connection pool pressure
Short PollingOperationally trivial, stateless, cacheableLatency proportional to interval, wasted requests when nothing changesCDN/API layer — cost scales with poll frequency, not user activity

Staff Default Position

"Every persistent connection is operational state." Staff default: SSE for server-to-client push (simpler ops, built-in reconnect, works through every proxy and load balancer). WebSockets only when bidirectional communication is genuinely required — collaborative editing, gaming, interactive whiteboards. Short polling for low-frequency updates (<1/min) because the operational simplicity outweighs the latency cost.

Before reaching for WebSockets, multiply: connection_count x memory_per_connection x deploy_frequency. If the answer makes you uncomfortable, SSE or polling is the right call.

Fan-out strategy matters more than transport. Fan-out-on-write (push to all subscribers at write time) gives low read latency but amplifies write cost. Fan-out-on-read (pull on demand) is cheaper to write but shifts cost to every reader. Most systems need a hybrid — fan-out-on-write for active users, fan-out-on-read for the rest.


When to Deviate

  • Bidirectional data flow is real, not speculative. Collaborative editing, multiplayer state sync, and interactive drawing all require the client to push structured data upstream continuously. SSE cannot do this.
  • Sub-50ms latency is a hard product requirement. Financial tickers, live auctions, competitive gaming. Short polling and SSE add latency floors that matter here.
  • You already operate sticky-session infrastructure. If your load balancers and deploy pipeline handle connection draining, the operational tax of WebSockets is already paid.
  • Clients are not browsers. Mobile apps and backend services don't share the browser's 6-connection-per-domain limit, making SSE's main drawback irrelevant.

Common Interview Mistakes

What Candidates SayWhat Interviewers HearWhat Staff Engineers Say
"We'll use WebSockets for everything""I haven't considered operational cost""SSE for push, WebSockets only where bidirectional is required"
"WebSockets are faster than polling""I'm comparing transports without considering fan-out""Fan-out strategy determines perceived latency more than transport"
"We'll add a reconnection layer""I don't know SSE has this built in""SSE gives us auto-reconnect and Last-Event-ID replay for free"
"Long polling as a fallback""I haven't sized the connection pool impact""Short polling at a reasonable interval is cheaper than holding idle long-poll connections"
"We need real-time for all updates""I haven't triaged what actually needs sub-second delivery""Scores need sub-second push. Profile updates can poll every 60s. Different paths for different SLAs."

Quick Reference

Rendering diagram...

Staff Sentence Templates


Implementation Deep Dive

1. WebSocket Connection Management — Redis Connection Registry

Every WebSocket server in the fleet maintains local connections, but the system needs to know which user is connected to which server. A Redis-backed connection registry solves this.

Connection Registry Pattern

# When a client connects to gateway server
function onConnect(userId, serverId, connectionId):
    # Register connection with TTL (heartbeat-refreshed)
    redis.HSET("conn:" + userId,
        "server", serverId,
        "connId", connectionId,
        "connectedAt", now())
    redis.EXPIRE("conn:" + userId, 90)       # 90s TTL, refreshed by heartbeat

    # Add to server's connection set (for drain enumeration)
    redis.SADD("server:" + serverId + ":connections", userId)

    # Publish presence event
    redis.PUBLISH("presence", serialize({ userId, status: "online" }))

# Heartbeat every 30 seconds
function onHeartbeat(userId):
    redis.EXPIRE("conn:" + userId, 90)        # Refresh TTL

# When a client disconnects
function onDisconnect(userId, serverId):
    redis.DEL("conn:" + userId)
    redis.SREM("server:" + serverId + ":connections", userId)
    redis.PUBLISH("presence", serialize({ userId, status: "offline" }))

Why Redis HASH over plain SET: The hash stores connection metadata (server ID, connection time) alongside registration. When routing a message to a user, you look up the server and send directly — no broadcast to the entire fleet.

Message Routing with Connection Registry

function sendToUser(targetUserId, message):
    conn = redis.HGETALL("conn:" + targetUserId)

    if conn is empty:
        # User offline — queue for later delivery or drop
        messageQueue.enqueue(targetUserId, message)
        return

    targetServer = conn["server"]

    if targetServer == THIS_SERVER:
        # Local delivery — find local WebSocket and send
        localSocket = localConnections.get(targetUserId)
        if localSocket:
            localSocket.send(serialize(message))
    else:
        # Remote delivery — publish to server-specific channel
        redis.PUBLISH("server:" + targetServer + ":inbox", serialize({
            targetUserId, message
        }))

Connection Draining During Deploy

function gracefulDrain(serverId, drainDurationSec=60):
    # Step 1 — Stop accepting new connections
    loadBalancer.removeServer(serverId)

    # Step 2 — Notify all local clients to reconnect elsewhere
    clients = redis.SMEMBERS("server:" + serverId + ":connections")
    for userId in clients:
        socket = localConnections.get(userId)
        if socket:
            socket.send(serialize({ type: "reconnect",
                                    reason: "server_drain",
                                    delay: random(0, drainDurationSec * 1000) }))

    # Step 3 — Wait for clients to reconnect with jitter
    sleep(drainDurationSec)

    # Step 4 — Force-close any remaining connections
    for userId in localConnections.keys():
        localConnections.get(userId).close(1001, "server_shutdown")
        redis.DEL("conn:" + userId)

    redis.DEL("server:" + serverId + ":connections")

Why randomized delay: Without jitter, all clients reconnect to remaining servers simultaneously — a thundering herd that can overwhelm the fleet. Spreading reconnections over 60 seconds keeps the load manageable. For a 200-server fleet with 100K connections each, draining one server means 100K reconnections spread over 60 seconds = ~1,700/sec, which is routine.

2. Server-Sent Events (SSE) — The Simpler Default

SSE is HTTP-based, unidirectional (server-to-client), and has auto-reconnect built into the browser specification. For server-push use cases (notifications, feed updates, live scores), SSE is operationally simpler than WebSockets.

SSE Implementation

# Server (Node.js / Express style)
function handleSSE(request, response):
    response.setHeader("Content-Type", "text/event-stream")
    response.setHeader("Cache-Control", "no-cache")
    response.setHeader("Connection", "keep-alive")

    userId = authenticate(request)

    # Send initial connection event
    response.write("event: connected\ndata: {}\"connected\":true}\n\n")

    # Subscribe to user's event channel
    subscription = eventBus.subscribe("user:" + userId, (event) =>
        # Last-Event-ID enables replay on reconnect
        response.write("id: " + event.id + "\n")
        response.write("event: " + event.type + "\n")
        response.write("data: " + serialize(event.payload) + "\n\n")
    )

    # Handle disconnect
    request.on("close", () =>
        subscription.unsubscribe()
        eventBus.publish("presence", { userId, status: "offline" })
    )

Auto-reconnect: When the connection drops, the browser automatically reconnects and sends the Last-Event-ID header. The server uses this to replay any missed events from a bounded buffer. No client-side reconnection logic needed.

SSE vs WebSocket: Operational Comparison

ConcernSSEWebSocket
Load balancerStandard HTTP LB (ALB, nginx)Requires WebSocket-aware LB or sticky sessions
Auto-reconnectBrowser-nativeMust implement manually
Message replayLast-Event-ID headerCustom cursor protocol
BidirectionalNo — use POST for client→serverYes — native
Connection limit (HTTP/1.1)6 per domain per browserNo browser limit
Connection limit (HTTP/2)100+ multiplexed streamsN/A
Deploy drainClose response → auto-reconnectCustom drain protocol (see above)
Memory per connection~8KB (no upgrade overhead)~24KB (TLS + WebSocket framing + buffers)

3. Fan-Out Strategies — Small vs Large Groups

The fan-out strategy determines system cost more than the transport choice. A message sent to a 5-person group chat is fundamentally different from a message sent to 50K live stream viewers.

Small Group Fan-Out (Direct Push)

# For groups with < 500 members
function fanOutSmallGroup(groupId, message, senderId):
    members = db.query("SELECT user_id FROM group_members WHERE group_id = ?", groupId)

    for memberId in members:
        if memberId == senderId:
            continue                          # Don't echo to sender

        conn = redis.HGETALL("conn:" + memberId)
        if conn:
            routeToServer(conn["server"], memberId, message)
        else:
            # Offline: persist for later delivery
            redis.LPUSH("offline:" + memberId, serialize(message))
            redis.LTRIM("offline:" + memberId, 0, 999)   # Bounded queue

Why direct push for small groups: Querying 500 members and routing individually is O(N) but N is small. No pub/sub overhead, no channel management, no subscription lifecycle. Simple and predictable.

Large Group Fan-Out (Pub/Sub with Subscription Channels)

# For groups with > 500 members (live streams, public channels)
function fanOutLargeGroup(channelId, message):
    # Publish once — all gateway servers with subscribed users receive it
    redis.PUBLISH("channel:" + channelId, serialize(message))

# Each gateway server subscribes on behalf of its local clients
function onClientJoinChannel(userId, channelId, serverId):
    # Track subscription locally
    localSubscriptions.add(userId, channelId)

    # Subscribe to Redis channel if this is the first local subscriber
    if localSubscriptions.countForChannel(channelId) == 1:
        redis.SUBSCRIBE("channel:" + channelId)

function onRedisMessage(channelId, message):
    # Fan out to all local subscribers for this channel
    for userId in localSubscriptions.getUsersForChannel(channelId):
        socket = localConnections.get(userId)
        if socket:
            socket.send(message)

Fan-Out Decision Matrix

Group SizeStrategyPublish CostRead CostOperational Complexity
1-1 (DM)Direct pushO(1) lookup + routeO(1)Minimal
2-500 (group chat)Direct push with member listO(N) lookupsO(1) per memberLow
500-50K (large channel)Redis pub/subO(1) publishO(N) local fan-out per serverMedium — manage subscriptions
50K+ (broadcast)Tiered pub/sub with edge serversO(1) publish to backboneO(N) tiered fan-outHigh — edge server fleet

4. Presence Tracking — Redis SET with TTL and Heartbeat

"Online" status is one of the most expensive features in real-time systems. Naive implementations create O(N) fan-out on every status change.

Presence Implementation

# Presence data in Redis
function setOnline(userId):
    redis.SET("presence:" + userId, "online", EX=45)   # 45s TTL
    # Publish to contacts who are currently online
    contacts = getOnlineContacts(userId)
    for contactId in contacts:
        sendToUser(contactId, { type: "presence", userId, status: "online" })

function heartbeat(userId):
    redis.SET("presence:" + userId, "online", EX=45)    # Refresh TTL
    # No presence publish on heartbeat — only on state change

function setOffline(userId):
    redis.DEL("presence:" + userId)
    contacts = getOnlineContacts(userId)
    for contactId in contacts:
        sendToUser(contactId, { type: "presence", userId, status: "offline" })

function isOnline(userId):
    return redis.EXISTS("presence:" + userId)

function getOnlineContacts(userId):
    contactIds = db.query("SELECT contact_id FROM contacts WHERE user_id = ?", userId)
    pipeline = redis.pipeline()
    for cid in contactIds:
        pipeline.EXISTS("presence:" + cid)
    results = pipeline.execute()
    return [contactIds[i] for i, r in enumerate(results) if r == 1]

Presence Accuracy vs Cost

ApproachAccuracyFan-Out CostMemoryUse Case
Heartbeat + TTL (45s)"Online within last 45s"Per state changeO(online users)Chat apps — good enough for most
Last-seen timestamp"Last seen 3 min ago"Zero fan-outO(all users)WhatsApp-style — cheapest
Real-time presenceInstant online/offlineO(contacts) per changeO(online users)Slack-style — expensive
Lazy presenceOn-demand checkZero proactive fan-outO(online users)LinkedIn-style — minimal cost

Architecture Diagram

Rendering diagram...

Data flow: Clients connect to gateway servers via WebSocket or SSE. Gateways register connections in Redis. When the API tier needs to push a message, it looks up the target user's gateway via the connection registry and publishes to that server's channel. The gateway delivers locally. For large groups, a single Redis PUBLISH fans out to all gateways with subscribers.


Failure Scenarios

1. Gateway Server Crash — 25K Connections Lost

Timeline: Gateway server 3 crashes at 14:00:00. 25K WebSocket connections are severed instantly. All 25K clients begin reconnection simultaneously. Remaining 49 gateways receive a thundering herd of 25K new connections within 2 seconds.

Blast radius: The 25K affected users lose ~5-30 seconds of messages (depending on reconnection speed). Remaining gateway servers experience a CPU spike from connection setup overhead (TLS handshake, authentication, subscription setup). If servers are near capacity, the spike can cascade.

Detection: Gateway health check fails. Redis connection registry for the crashed server shows stale entries (TTL has not yet expired). Connected-user count drops by 25K.

Recovery:

  1. Client-side exponential backoff with jitter spreads reconnections over 30 seconds instead of 2 seconds
  2. Redis presence TTL (90s) automatically cleans up stale connection entries — no manual intervention
  3. Missed messages are replayed via Last-Event-ID (SSE) or cursor-based recovery (WebSocket) from the message buffer
  4. N+2 fleet sizing ensures remaining servers have capacity to absorb the reconnections

2. Redis Pub/Sub Backpressure — Message Delivery Stall

Timeline: A popular live stream channel has 50K viewers. The streamer generates 20 messages/sec (chat + reactions). Each message fans out to 50 gateway servers. Redis pub/sub throughput is sufficient, but gateway server 7 has a slow consumer — its local fan-out to 5K connections takes 200ms per message. Redis pub/sub has no backpressure — messages are dropped for slow subscribers.

Blast radius: Users connected to gateway 7 miss messages silently. No error, no retry — Redis pub/sub is fire-and-forget. The users see gaps in the chat stream.

Detection: Per-gateway message delivery rate monitoring. If gateway 7 delivers 15 messages/sec while others deliver 20, it is dropping messages. Client-side sequence number gap detection.

Recovery:

  1. Switch from Redis pub/sub to Redis Streams (XREADGROUP) for durable delivery with consumer acknowledgment
  2. Buffer messages in a per-gateway queue (Redis list) with a consumer that processes at its own pace
  3. If a gateway falls too far behind (>5 seconds of lag), disconnect its clients and let them reconnect to healthier servers
  4. Rate-limit the source — cap chat messages at 10/sec per channel to bound fan-out load

3. Presence Storm During Morning Login Wave

Timeline: 9:00 AM — 500K users log in within a 15-minute window. Each login triggers a presence fan-out to online contacts (average 50 online contacts per user). Total presence messages: 500K users x 50 contacts = 25M presence notifications in 15 minutes = ~28K messages/sec.

Blast radius: Presence messages compete with actual content messages (chat, notifications) for gateway bandwidth. Users experience delayed message delivery because the pipeline is saturated with presence updates.

Detection: Message delivery latency increases. Gateway queue depth grows. Presence message volume spikes relative to content message volume.

Recovery:

  1. Batch presence updates — instead of sending individual "user X is online" events, send a batch update every 5 seconds: "these 200 contacts came online"
  2. Deprioritize presence — route presence through a separate, lower-priority channel so content messages are never delayed
  3. Lazy presence — stop proactively pushing presence; let the client fetch contact status when the user opens their contact list
  4. Throttle login-wave presence — during detected login spikes, temporarily disable proactive presence fan-out and switch to lazy mode

Staff Interview Application

How to Introduce This Pattern

Lead with the transport decision framework, not a technology choice. Then immediately address fan-out: "The transport is the easy part. The harder question is fan-out strategy — how do we route a message to the right user on the right server without broadcasting to the entire fleet."

When NOT to Use This Pattern

  • Update frequency < 1/minute: Short polling is operationally simpler and the latency difference is invisible to users. A 30-second poll interval means average staleness of 15 seconds — acceptable for dashboards, leaderboards, and profile updates.
  • No live audience: If users are not actively watching (email summaries, batch reports), push infrastructure is wasted. Queue the updates and deliver on next session.
  • Single-server system: If you have one application server with <1K concurrent users, in-process event emitters suffice. No Redis, no pub/sub, no connection registry. Don't build fleet infrastructure for a single-node system.
  • Data is cacheable and shared: If all users see the same data (stock ticker, weather, sports scores), a CDN with short TTL and client-side polling is cheaper than maintaining 100K persistent connections.

Follow-Up Questions to Anticipate

Interviewer AsksWhat They Are TestingHow to Respond
"Why not WebSockets for everything?"Operational cost awareness"Every WebSocket connection is operational state — memory, file descriptors, drain on deploy. SSE gives us server-push with auto-reconnect and standard HTTP infrastructure. I use WebSockets only when bidirectional flow is genuinely required."
"How do you handle message ordering?"Distributed systems fundamentals"Per-user ordering via the connection registry — messages to a user route to one gateway. Cross-user ordering (group chat) uses a sequence number assigned at the application tier before fan-out."
"What about mobile clients on flaky networks?"Reliability engineering"Cursor-based recovery: each message has a monotonic ID. On reconnect, the client sends its last-seen ID and the server replays from there. This handles both network flaps and server-side connection migrations."
"How do you deploy without dropping messages?"Operational maturity"Graceful drain: stop accepting new connections, send reconnect-with-jitter to existing clients, wait for drain window, then shut down. Clients reconnect to other servers and replay missed messages via cursor."
"How do you scale to 10M connections?"Architecture scaling"Horizontal gateway fleet. At 50K connections/server, that is 200 servers. The connection registry in Redis handles routing. The bottleneck shifts to fan-out — large channels need tiered pub/sub with regional edge servers."