Design a Load Balancer | StaffSignal Playbook

Technologies referenced in this playbook: ZooKeeper & etcd

How to Use This Playbook

If you have 2 hours before your interview, read the Interview Walkthrough and §3 (Fault Lines). Everything else is depth you can pull on where you're weak. Appendices are collapsed — expand them for targeted review.

What is a Load Balancer? — Why interviewers pick this topic

The Problem

A load balancer distributes incoming network traffic across multiple backend servers to prevent any single server from becoming overwhelmed. Without it, one server handles all traffic until it crashes, while others sit idle. It's the traffic cop that keeps your system responsive and resilient.

Common Use Cases

Horizontal Scaling: Spread requests across a fleet of servers to handle more traffic than one machine could
High Availability: Route around failed servers automatically so users don't notice outages
Zero-Downtime Deployments: Drain traffic from servers during updates, then add them back
Geographic Distribution: Route users to the nearest datacenter for lower latency
SSL Termination: Offload encryption/decryption from backend servers

Why Interviewers Ask About This

Load balancing seems simple—just round-robin, right? But Staff-level interviews probe the hidden complexity: When do you need L7 vs L4? What happens when the load balancer itself fails? How do you handle sticky sessions without killing scalability? This topic reveals whether you understand the operational realities of running distributed systems at scale, not just the happy-path architecture diagrams.

Executive Summary

What This Interview Actually Tests

Load balancing is not a "just add nginx" question. Everyone knows round-robin.

This is a distributed systems ownership question that tests:

Whether you understand the L4 vs L7 tradeoff and when each matters
Whether you reason about health checking failure modes proactively
Whether you recognize session affinity as a scalability anti-pattern
Whether you can design for load balancer failure itself

The key insight: Load balancing is a single point of failure disguised as a reliability feature. Staff engineers reason about what happens when the "reliable" component fails.

The L5 vs L6 Contrast (Memorize This)

Level Calibration

Behavior	L5 (Senior)	L6 (Staff)
First move	"We'll add an nginx load balancer"	Asks "What's our latency budget? Do we need application-layer routing?"
Algorithm	"Round-robin is fine"	Identifies when round-robin fails: long-lived connections, heterogeneous backends, stateful requests
Health checks	"We'll ping every 5 seconds"	Asks "What's the blast radius of a false positive? What's our detection-to-removal latency budget?"
Session affinity	"We'll use sticky sessions"	Warns that sticky sessions break horizontal scaling and asks "Can we make the backend stateless instead?"
Failure	Assumes LB is reliable	Designs for LB failure: redundant LBs, DNS failover, client-side fallback
Ownership	"DevOps handles load balancing"	Defines SLOs for routing latency, health check accuracy, and failover time

Default Staff Positions

These are your opening stances. Adjust based on requirements.

Default Staff Positions

Dimension	Default Position	Rationale
L4 vs L7	Start with L4, upgrade to L7 only if needed	L4 is faster and simpler; L7 adds latency but enables content-based routing
Algorithm	Least connections for most workloads	Round-robin fails with variable request durations; least-connections adapts
Health checks	Active + passive, tuned for workload	Active catches silent failures; passive reduces detection latency
Session affinity	Avoid if possible; use external session store	Sticky sessions are a scalability trap; externalize state to Redis/DB
Redundancy	Active-passive LB pair minimum	Single LB is a SPOF; active-active adds complexity but improves capacity

System Architecture Overview

Rendering diagram...

Interview Walkthrough

The six phases below are compressed for a deep-dive format. Phases 1-3 fit in 2-3 minutes. If the interviewer keeps probing, you expand into Phase 5's detailed fault lines. Most interviews don't go beyond the first probe — know when to stop talking.

Phase 1: Requirements & Framing (30 seconds)

Name the intent before drawing a single box:

"Load balancing serves three purposes: distribute traffic for horizontal scaling, detect and route around unhealthy instances, and enable zero-downtime deployments."

Then immediately frame the four decisions:

"Four things matter: L4 vs L7, algorithm selection, health checking strategy, and LB redundancy. Let me walk through each."

Phase 2: Core Entities & API (30 seconds)

State the components (not entities — this is infrastructure):

VIP (Virtual IP): the stable endpoint clients connect to; maps to the load balancer
Backend pool: set of healthy server instances, each with weight and health status
Health check: active probe (HTTP GET /health) + passive monitoring (error rate tracking)
Connection drain: graceful removal of a backend — finish in-flight requests before cutting traffic

Phase 3: The 2-Minute Architecture (2 minutes)

Deliver this in ~90 seconds. Hit the four key decisions and move on:

Staff-grade phrasing

"For load balancing, four decisions matter:

1. L4 vs L7. I'd start with L4 (TCP-level) — microseconds of overhead vs 5-10ms for L7 HTTP parsing. Upgrade to L7 only when we need content-based routing: path prefixes, headers, or cookie-based decisions.

2. Algorithm. Least-connections over round-robin. Round-robin fails when request durations vary — a server handling a 10-second query still gets new requests. Least-connections adapts to actual load.

3. Health checking. Active checks (HTTP probe every 10 seconds) combined with passive monitoring (track real error rates). Active catches silent crashes. Passive detects degradation faster.

4. Redundancy. The load balancer itself is a SPOF. Active-passive pair with VIP failover, or cloud-managed LB (ALB/NLB) that handles this natively."

Then stop. Let the interviewer steer.

Phase 4: Transition to Depth (15 seconds)

If the interviewer wants more, offer choices:

"I can go deeper on any of these four. The most interesting tradeoffs are: health check tuning and cascading failure, sticky sessions as a scalability trap, or what happens to in-flight requests when the LB itself fails."

Phase 5: Deep Dives (5-15 minutes if probed)

Probe 1: "What happens when a health check is wrong?" (3-5 min)

Walk through the fix:

Blast radius protection: Never remove more than 20% of the fleet at once. If >20% are failing health checks, assume the problem is the health check, not the servers. "Panic mode: if half your fleet fails health checks simultaneously, the health check is lying."
Gradual drain, not instant removal: When a server fails health checks, reduce its weight over 30 seconds before removing. This gives time for false positives to self-correct.
Health check circuit breaker: If the health check endpoint itself is slow (because the server is under load), don't fail the check — that creates a death spiral. Separate the "is the server alive?" check from the "is the server healthy?" check.

Probe 2: "What about sticky sessions?" (3-5 min)

Walk through the alternatives:

Best answer: Externalize session state to Redis. Any server handles any request. Sticky sessions become unnecessary.
If sticky sessions are unavoidable (legacy app, WebSocket connections): Use consistent hashing so adding/removing servers only remaps ~1/N sessions. "With 10 servers, adding one remaps ~10% of sessions. Round-robin remaps 100%."
WebSocket sticky sessions: These are legitimate — the connection IS the session. Use a connection registry (Redis hash: connection_id → server_id) so other services know where to route messages for a specific connection.

Probe 3: "What if the load balancer itself fails?" (3-5 min)

Active-passive pair: Two LB instances, one active. VRRP failover in 1-5 seconds. During failover, active TCP connections are reset — clients must reconnect.
Active-active with ECMP: Multiple LB instances sharing the same VIP via BGP/ECMP. The network distributes packets across all instances. No failover — just capacity reduction when one fails.
Cloud-managed LB: AWS ALB/NLB, GCP Cloud LB. Already multi-AZ redundant. "The LB is someone else's problem — but you still need to reason about cross-region failure."

The deeper question: "What happens to in-flight requests? L4 LBs reset TCP connections on failover — the client sees a connection timeout and must retry. L7 LBs can retry transparently for idempotent requests. Clients need retry logic regardless."

Probe 4: "How do you handle zero-downtime deployments?" (3-5 min)

"Connection draining is the key. When removing a server for deployment: (1) stop sending NEW requests, (2) let existing requests complete (drain timeout = p99 request duration × 2), (3) once all connections are closed or timeout expires, shut down the server."

"The reverse — bringing a new server online — uses slow-start. Don't immediately give it full traffic weight. Ramp from 10% to 100% over 30 seconds. This lets the new server warm up (JIT compilation, cache population) before taking full load."

Phase 6: Wrap-Up

If you've gone deep, close with the organizational insight:

"Load balancing is the most over-engineered and under-monitored part of most architectures. Teams spend weeks choosing between nginx and HAProxy, then never set up alerting for connection drain failures or health check false positives. The Staff question isn't 'which load balancer' — it's 'what are the failure modes, who gets paged, and what does the runbook say.'"

Quick-Reference: The 30-Second Cheat Sheet

Level Calibration

Topic	The L5 Answer	The L6 Answer (say this)
Layer choice	"L7 because we might need it"	"L4 unless we need content-based routing — L7 adds 5-10ms per request"
Algorithm	"Round-robin"	"Least-connections — round-robin fails with variable request durations"
Health checks	"Ping every 5 seconds"	"Active + passive combined, blast radius protected — never remove >20% of fleet"
Session state	"Sticky sessions"	"Externalize to Redis — sticky sessions are a scalability trap"
LB failure	"Use a managed LB"	"Active-passive minimum — and plan for in-flight TCP resets during failover"
Deployment	"Zero-downtime deploy"	"Connection draining with timeout + slow-start on recovery"

Key Numbers Worth Memorizing

Metric	Value	Why It Matters
L4 vs L7 latency	5-10ms per request difference	L7 tax on every request
Health check cascade	Remove 1 of 5 = 25% load spike	Why blast radius protection matters
Active-passive failover	1-5 seconds (VRRP)	Sets expectation for connection resets
Connection drain timeout	p99 request duration × 2	Too short = dropped requests
Slow-start ramp	10% → 100% over 30 seconds	Prevents cold-cache overload

How to Use This Playbook

The Problem

Common Use Cases

Why Interviewers Ask About This

Executive Summary

What This Interview Actually Tests

The L5 vs L6 Contrast (Memorize This)

Default Staff Positions

System Architecture Overview

Interview Walkthrough

Phase 1: Requirements & Framing (30 seconds)

Phase 2: Core Entities & API (30 seconds)

Phase 3: The 2-Minute Architecture (2 minutes)

Phase 4: Transition to Depth (15 seconds)

Phase 5: Deep Dives (5-15 minutes if probed)

Phase 6: Wrap-Up

Quick-Reference: The 30-Second Cheat Sheet

Key Numbers Worth Memorizing

Subscribe to continue reading