Technologies referenced in this playbook: ZooKeeper & etcd
How to Use This Playbook
If you have 2 hours before your interview, read the Interview Walkthrough and §3 (Fault Lines). Everything else is depth you can pull on where you're weak. Appendices are collapsed — expand them for targeted review.
What is a Load Balancer? — Why interviewers pick this topic
The Problem
A load balancer distributes incoming network traffic across multiple backend servers to prevent any single server from becoming overwhelmed. Without it, one server handles all traffic until it crashes, while others sit idle. It's the traffic cop that keeps your system responsive and resilient.
Common Use Cases
- Horizontal Scaling: Spread requests across a fleet of servers to handle more traffic than one machine could
- High Availability: Route around failed servers automatically so users don't notice outages
- Zero-Downtime Deployments: Drain traffic from servers during updates, then add them back
- Geographic Distribution: Route users to the nearest datacenter for lower latency
- SSL Termination: Offload encryption/decryption from backend servers
Why Interviewers Ask About This
Load balancing seems simple—just round-robin, right? But Staff-level interviews probe the hidden complexity: When do you need L7 vs L4? What happens when the load balancer itself fails? How do you handle sticky sessions without killing scalability? This topic reveals whether you understand the operational realities of running distributed systems at scale, not just the happy-path architecture diagrams.
Executive Summary
What This Interview Actually Tests
Load balancing is not a "just add nginx" question. Everyone knows round-robin.
This is a distributed systems ownership question that tests:
- Whether you understand the L4 vs L7 tradeoff and when each matters
- Whether you reason about health checking failure modes proactively
- Whether you recognize session affinity as a scalability anti-pattern
- Whether you can design for load balancer failure itself
The key insight: Load balancing is a single point of failure disguised as a reliability feature. Staff engineers reason about what happens when the "reliable" component fails.
The L5 vs L6 Contrast (Memorize This)
| Behavior | L5 (Senior) | L6 (Staff) |
|---|---|---|
| First move | "We'll add an nginx load balancer" | Asks "What's our latency budget? Do we need application-layer routing?" |
| Algorithm | "Round-robin is fine" | Identifies when round-robin fails: long-lived connections, heterogeneous backends, stateful requests |
| Health checks | "We'll ping every 5 seconds" | Asks "What's the blast radius of a false positive? What's our detection-to-removal latency budget?" |
| Session affinity | "We'll use sticky sessions" | Warns that sticky sessions break horizontal scaling and asks "Can we make the backend stateless instead?" |
| Failure | Assumes LB is reliable | Designs for LB failure: redundant LBs, DNS failover, client-side fallback |
| Ownership | "DevOps handles load balancing" | Defines SLOs for routing latency, health check accuracy, and failover time |
Default Staff Positions
These are your opening stances. Adjust based on requirements.
| Dimension | Default Position | Rationale |
|---|---|---|
| L4 vs L7 | Start with L4, upgrade to L7 only if needed | L4 is faster and simpler; L7 adds latency but enables content-based routing |
| Algorithm | Least connections for most workloads | Round-robin fails with variable request durations; least-connections adapts |
| Health checks | Active + passive, tuned for workload | Active catches silent failures; passive reduces detection latency |
| Session affinity | Avoid if possible; use external session store | Sticky sessions are a scalability trap; externalize state to Redis/DB |
| Redundancy | Active-passive LB pair minimum | Single LB is a SPOF; active-active adds complexity but improves capacity |
System Architecture Overview
Interview Walkthrough
The six phases below are compressed for a deep-dive format. Phases 1-3 fit in 2-3 minutes. If the interviewer keeps probing, you expand into Phase 5's detailed fault lines. Most interviews don't go beyond the first probe — know when to stop talking.
Phase 1: Requirements & Framing (30 seconds)
Name the intent before drawing a single box:
- "Load balancing serves three purposes: distribute traffic for horizontal scaling, detect and route around unhealthy instances, and enable zero-downtime deployments."
Then immediately frame the four decisions:
- "Four things matter: L4 vs L7, algorithm selection, health checking strategy, and LB redundancy. Let me walk through each."
Phase 2: Core Entities & API (30 seconds)
State the components (not entities — this is infrastructure):
- VIP (Virtual IP): the stable endpoint clients connect to; maps to the load balancer
- Backend pool: set of healthy server instances, each with weight and health status
- Health check: active probe (HTTP GET /health) + passive monitoring (error rate tracking)
- Connection drain: graceful removal of a backend — finish in-flight requests before cutting traffic
Phase 3: The 2-Minute Architecture (2 minutes)
Deliver this in ~90 seconds. Hit the four key decisions and move on:
Then stop. Let the interviewer steer.
Phase 4: Transition to Depth (15 seconds)
If the interviewer wants more, offer choices:
"I can go deeper on any of these four. The most interesting tradeoffs are: health check tuning and cascading failure, sticky sessions as a scalability trap, or what happens to in-flight requests when the LB itself fails."
Phase 5: Deep Dives (5-15 minutes if probed)
Probe 1: "What happens when a health check is wrong?" (3-5 min)
Walk through the fix:
- Blast radius protection: Never remove more than 20% of the fleet at once. If >20% are failing health checks, assume the problem is the health check, not the servers. "Panic mode: if half your fleet fails health checks simultaneously, the health check is lying."
- Gradual drain, not instant removal: When a server fails health checks, reduce its weight over 30 seconds before removing. This gives time for false positives to self-correct.
- Health check circuit breaker: If the health check endpoint itself is slow (because the server is under load), don't fail the check — that creates a death spiral. Separate the "is the server alive?" check from the "is the server healthy?" check.
Probe 2: "What about sticky sessions?" (3-5 min)
Walk through the alternatives:
- Best answer: Externalize session state to Redis. Any server handles any request. Sticky sessions become unnecessary.
- If sticky sessions are unavoidable (legacy app, WebSocket connections): Use consistent hashing so adding/removing servers only remaps ~1/N sessions. "With 10 servers, adding one remaps ~10% of sessions. Round-robin remaps 100%."
- WebSocket sticky sessions: These are legitimate — the connection IS the session. Use a connection registry (Redis hash: connection_id → server_id) so other services know where to route messages for a specific connection.
Probe 3: "What if the load balancer itself fails?" (3-5 min)
- Active-passive pair: Two LB instances, one active. VRRP failover in 1-5 seconds. During failover, active TCP connections are reset — clients must reconnect.
- Active-active with ECMP: Multiple LB instances sharing the same VIP via BGP/ECMP. The network distributes packets across all instances. No failover — just capacity reduction when one fails.
- Cloud-managed LB: AWS ALB/NLB, GCP Cloud LB. Already multi-AZ redundant. "The LB is someone else's problem — but you still need to reason about cross-region failure."
The deeper question: "What happens to in-flight requests? L4 LBs reset TCP connections on failover — the client sees a connection timeout and must retry. L7 LBs can retry transparently for idempotent requests. Clients need retry logic regardless."
Probe 4: "How do you handle zero-downtime deployments?" (3-5 min)
"Connection draining is the key. When removing a server for deployment: (1) stop sending NEW requests, (2) let existing requests complete (drain timeout = p99 request duration × 2), (3) once all connections are closed or timeout expires, shut down the server."
"The reverse — bringing a new server online — uses slow-start. Don't immediately give it full traffic weight. Ramp from 10% to 100% over 30 seconds. This lets the new server warm up (JIT compilation, cache population) before taking full load."
Phase 6: Wrap-Up
If you've gone deep, close with the organizational insight:
"Load balancing is the most over-engineered and under-monitored part of most architectures. Teams spend weeks choosing between nginx and HAProxy, then never set up alerting for connection drain failures or health check false positives. The Staff question isn't 'which load balancer' — it's 'what are the failure modes, who gets paged, and what does the runbook say.'"
Quick-Reference: The 30-Second Cheat Sheet
| Topic | The L5 Answer | The L6 Answer (say this) |
|---|---|---|
| Layer choice | "L7 because we might need it" | "L4 unless we need content-based routing — L7 adds 5-10ms per request" |
| Algorithm | "Round-robin" | "Least-connections — round-robin fails with variable request durations" |
| Health checks | "Ping every 5 seconds" | "Active + passive combined, blast radius protected — never remove >20% of fleet" |
| Session state | "Sticky sessions" | "Externalize to Redis — sticky sessions are a scalability trap" |
| LB failure | "Use a managed LB" | "Active-passive minimum — and plan for in-flight TCP resets during failover" |
| Deployment | "Zero-downtime deploy" | "Connection draining with timeout + slow-start on recovery" |
Key Numbers Worth Memorizing
| Metric | Value | Why It Matters |
|---|---|---|
| L4 vs L7 latency | 5-10ms per request difference | L7 tax on every request |
| Health check cascade | Remove 1 of 5 = 25% load spike | Why blast radius protection matters |
| Active-passive failover | 1-5 seconds (VRRP) | Sets expectation for connection resets |
| Connection drain timeout | p99 request duration × 2 | Too short = dropped requests |
| Slow-start ramp | 10% → 100% over 30 seconds | Prevents cold-cache overload |