How to Use This Playbook
This playbook supports three reading modes:
| Mode | Time | What to Read |
|---|---|---|
| Quick Review | 15 min | Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7) |
| Targeted Study | 1-2 hrs | Interview Walkthrough → Core Flow, expand appendices where you're weak |
| Deep Dive | 3+ hrs | Everything, including all appendices |
What are Circuit Breakers & Resilience Patterns? — Why interviewers pick this topic
The Problem
In a microservice architecture, services call other services. When a downstream service fails or slows down, the calling service can get stuck waiting — consuming threads, connections, and memory. Without protection, one slow service can cascade failures across the entire system. Circuit breakers detect when a dependency is unhealthy and "trip" — returning fast failures instead of waiting. Combined with retries, timeouts, bulkheads, and fallbacks, they form the resilience layer that prevents localized failures from becoming system-wide outages.
Common Use Cases
- Service-to-Service Calls: Protect callers when a downstream microservice is degraded or unreachable
- Database Connections: Prevent connection pool exhaustion when the database is slow
- External API Calls: Handle third-party API failures gracefully (payment providers, email services)
- Resource Protection: Prevent resource exhaustion (thread pools, connection pools, memory) during failures
- Graceful Degradation: Serve cached or default content when a real-time dependency fails
Why Interviewers Ask About This
Resilience patterns test whether you understand failure domain isolation and cascading failure prevention — the core operational challenges of distributed systems. Everyone can describe the circuit breaker pattern from a blog post. Staff engineers reason about: Where do you place circuit breakers? What are the thresholds? Who owns the fallback behavior? What's the recovery strategy? This separates "I've read about Hystrix" from "I've been paged at 3 AM when a cascading failure took down production."
What This Interview Actually Tests
Circuit breakers are not a "wrap calls in try/catch with a state machine" question. Everyone can draw the three states (closed, open, half-open).
This is a failure domain ownership question that tests:
- Whether you reason about cascading failures as organizational problems
- Whether you can size failure budgets and set thresholds based on SLOs
- Whether you understand the interaction between retries, timeouts, and circuit breakers
- Whether you design for recovery, not just failure detection
The key insight: Resilience is not a library you install — it's an organizational discipline. The hardest part isn't implementing circuit breakers; it's deciding what the fallback behavior should be, who owns the SLO, and what constitutes "healthy enough" to close the circuit.
The L5 vs L6 Contrast (Memorize This)
| Behavior | L5 (Senior) | L6 (Staff) |
|---|---|---|
| First move | "Add a circuit breaker library" | "Which failure domain are we isolating? What's the blast radius if this dependency fails?" |
| Thresholds | "Trip after 5 errors" | "Thresholds derive from the SLO. If our error budget is 0.1%, the circuit trips when the dependency's error rate would burn the budget in <1 hour" |
| Retry | "Retry 3 times with backoff" | "Retries amplify load on a struggling service. We need a retry budget shared across all callers to prevent retry storms" |
| Fallback | "Return cached data" | "The fallback IS the product decision. Does the user see stale data, degraded UX, or an error? Product owns this decision." |
| Recovery | "The circuit auto-recovers" | "Half-open probes must be throttled. If 1000 instances all probe simultaneously, the recovering service gets hammered again." |
Default Staff Positions (Unless Proven Otherwise)
| Position | Rationale |
|---|---|
| Timeouts before circuit breakers | A circuit breaker without aggressive timeouts is useless — threads still block |
| Retry budgets over per-call retries | Per-call retries compound across callers; shared budgets cap total retry load |
| Bulkheads for critical dependencies | Isolate thread/connection pools per dependency so one failure can't exhaust all resources |
| Fallback behavior is a product decision | Engineering decides how to detect failure; product decides what users see |
| Client-side circuit breakers, not server-side | The caller decides when to stop calling; the callee doesn't know it's unhealthy |
| Resilience as organizational discipline | Patterns work only if every team implements them; one unprotected caller path can cascade |
The Three Intents (Pick One and Commit)
| Intent | Constraint | Strategy | Correctness Bar |
|---|---|---|---|
| Cascading Failure Prevention | Protect the caller from a slow/failing dependency | Circuit breaker + timeout + fallback | Caller never blocks on a failing dependency; degrades gracefully |
| Resource Exhaustion Protection | Prevent one bad dependency from starving others | Bulkhead (isolated pools) + circuit breaker | Each dependency has bounded resource allocation; failures don't cross boundaries |
| System-Wide Resilience | Prevent any single failure from becoming a total outage | Retry budgets + load shedding + graceful degradation | System stays within SLO even during partial failures |
🎯 Staff Insight: "I'll focus on cascading failure prevention first — it's where circuit breakers, timeouts, and retries interact in the most complex ways. Bulkheads and load shedding are complementary patterns we layer on top."
System Architecture Overview
Interview Walkthrough
The six phases below are compressed for a deep-dive format. Phases 1-3 deliver the crisp answer in 2-3 minutes. If probed, Phase 5 has depth for 15+ minutes. Know when to stop.
Phase 1: Requirements & Framing (30 seconds)
Name the problem before naming the solution:
- "When a downstream service is failing, the worst thing we can do is keep sending it traffic. Failed requests pile up, consuming threads and connections. The calling service slows down, exhausts its own resources, and the failure cascades upstream. Circuit breakers stop this cascade by cutting off traffic to a failing service and returning a fast fallback."
Phase 2: Core Entities & API (30 seconds)
State the three states:
- Closed (normal): Requests flow through. The circuit breaker tracks error rate and latency.
- Open (tripped): All requests are immediately rejected with a fallback response. No traffic reaches the downstream service. Timer running.
- Half-Open (testing): After a timeout, the circuit breaker lets a small number of test requests through. If they succeed, transition to Closed. If they fail, back to Open.
Phase 3: The 2-Minute Architecture (2 minutes)
Phase 4: Transition to Depth (15 seconds)
"The state machine is simple. The hard problems are: choosing the trip threshold, handling the half-open state correctly, and designing fallback strategies. Want me to go deeper on any of these?"
Phase 5: Deep Dives (5-15 minutes if probed)
Probe 1: "How do you set the trip threshold?" (3-5 min)
"The threshold has three components: error rate, latency percentile, and the measurement window."
Walk through the calibration:
- Error rate threshold: Trip at 50% error rate over a 10-second sliding window. "Not 10% — services have natural error baselines (timeouts, bad requests). Tripping at 10% would cause false positives during normal traffic spikes."
- Latency threshold: Trip when p99 latency exceeds 5x the normal p99. "If the normal p99 is 100ms and it jumps to 500ms, the downstream service is degraded even if it's not returning errors. Slow responses are worse than failures — they hold threads."
- Minimum request volume: Don't trip with fewer than 20 requests in the window. "With only 5 requests, 1 error is a 20% error rate — that's noise, not a real failure. The minimum volume filter prevents premature tripping."
The calibration problem: "These thresholds are different for every service-to-service call. A payment service with 0.1% natural error rate should trip at 5%. A recommendation service with 2% natural error rate should trip at 20%. There is no universal threshold."
Probe 2: "What happens in Half-Open?" (3-5 min)
"Half-Open is the most dangerous state. The circuit breaker lets a small number of test requests through. If they succeed, we close the circuit. But if we let too many through and the service is still failing, we've just sent a burst of traffic to a recovering service and potentially killed it again."
Walk through the protocol:
- Single request probe: Let exactly 1 request through. If it succeeds, close. If it fails, re-open with an exponentially increasing timeout (10s → 20s → 40s → cap at 5 min).
- Gradual traffic ramp: Instead of all-or-nothing, use a percentage-based ramp. Half-Open allows 5% of traffic, then 10%, 25%, 50%, 100%. Each step requires a success rate above 95% to proceed.
- Health check probe: Instead of sending real traffic, send a synthetic health check. If the health check succeeds, ramp real traffic gradually. "This avoids putting user requests at risk during the testing phase."
Probe 3: "What's the fallback strategy?" (3-5 min)
"When the circuit is open, we need a fallback. The fallback depends on the criticality of the call:"
- Cached response: For read-heavy, slowly-changing data (product catalog, user preferences). Serve the last known good response from cache. "Users see slightly stale data. They don't see errors."
- Degraded response: Return a reduced response (e.g., product page without recommendations, feed without personalized ranking). "Core functionality works. Non-critical features are absent."
- Queue for later: For non-critical writes (analytics events, notifications). Buffer the request locally and replay when the circuit closes. "The user doesn't wait. The data arrives eventually."
- Fail fast with user-facing message: For critical calls where no fallback exists (payment processing). Return a clear error immediately: "Payment temporarily unavailable. Please try again in a few minutes."
Probe 4: "How do you prevent thundering herd on recovery?" (3-5 min)
"The downstream service recovers. All circuit breakers transition from Open to Half-Open simultaneously. Each sends test traffic. If 100 callers each send 5% of their traffic, the recovering service gets 500% of normal traffic — and crashes again."
Mitigations:
- Jittered half-open timing: Each circuit breaker adds random jitter (±30%) to its Open→Half-Open transition timeout. Not all breakers probe at the same time.
- Coordinated recovery via health check: A centralized health check (not per-caller) determines when the service is ready. Callers watch the health status instead of independently probing.
- Token bucket for recovery traffic: The recovering service advertises a capacity limit. Callers collectively respect it, dividing the available capacity proportionally.
Phase 6: Wrap-Up
"Circuit breakers are the immune system of a distributed architecture. The state machine is trivial — three states, two thresholds. What makes it Staff-level is threshold calibration (per-call, not global), fallback classification (cached, degraded, queued, or fail-fast), and recovery coordination (preventing the thundering herd). The technology choice (Resilience4j vs Envoy) matters less than the operational discipline of calibrating and monitoring every circuit breaker in the system."
Quick-Reference: The 30-Second Cheat Sheet
| Topic | The L5 Answer | The L6 Answer (say this) |
|---|---|---|
| Purpose | "Stops calling a failed service" | "Protects the CALLER from resource exhaustion — the failed service is already dead" |
| States | "Open, closed, half-open" | "Three states + the transition thresholds are the real design decisions" |
| Implementation | "Use Hystrix/Resilience4j" | "Sidecar-level for consistency — library requires every team to configure correctly" |
| Threshold | "50% error rate" | "Per-call calibration: error rate + latency + minimum volume, based on historical baselines" |
| Fallback | "Return an error" | "Classified by criticality: cached, degraded, queued, or fail-fast" |
| Recovery | "Close the circuit when service recovers" | "Gradual ramp (5% → 100%) with jitter to prevent thundering herd" |
1The Staff Lens
Why Resilience Separates L5 from L6
Every microservice architecture eventually experiences cascading failures. The difference between a 5-minute blip and a 4-hour outage is whether the system has resilience patterns — and whether those patterns are tuned correctly.
Staff Signal: The interviewer is testing whether you understand resilience as failure domain isolation — a systems thinking skill — not just a design pattern implementation.
The Three Dimensions Interviewers Probe
- Failure Domain Identification — Can you identify the blast radius of each dependency failure? Do you know which failures cascade and which are contained?
- Threshold Reasoning — How do you set circuit breaker thresholds, timeout values, and retry limits? Are they arbitrary or derived from SLOs?
- Recovery Strategy — What happens after the circuit trips? How does the system recover? Who decides when it's safe to restore traffic?
What L5s Get Wrong
L5s implement resilience patterns as defense mechanisms — add a circuit breaker, add retries, add a timeout. They treat each pattern independently.
Staff engineers implement resilience as a system property — understanding how patterns interact (retries can defeat circuit breakers, timeouts without backpressure just move the queue), and designing fallback behaviors that are product-aware, not just technically safe.
2Problem Framing & Intent
The Three Core Intents
Intent 1: Cascading Failure Prevention
Constraint: A failing dependency must not degrade the caller beyond its own SLO.
- Circuit breaker detects failure, short-circuits calls, returns fallback
- Timeout prevents thread blocking on slow responses
- The combination prevents one slow service from consuming all caller resources
Intent 2: Resource Exhaustion Protection
Constraint: Each dependency gets bounded resources; one failure can't starve others.
- Bulkhead pattern: separate thread pools or connection pools per dependency
- Even if Dependency A consumes all its allocated threads, Dependencies B and C still have theirs
- Without bulkheads, a single slow dependency can exhaust the shared thread pool
Intent 3: System-Wide Load Management
Constraint: During partial failures, the system must stay within SLO by shedding load gracefully.
- Load shedding: reject excess requests before they consume resources
- Graceful degradation: disable non-critical features to preserve critical path
- Retry budget: cap total system-wide retries to prevent amplification
Staff Signal: Most interview questions target Intent 1 (cascading failure prevention) because it surfaces the hardest tradeoffs: timeout tuning, retry interaction, fallback design, and recovery strategy.
Mechanics Refresher: Circuit Breaker State Machine
The Three States:
- Closed (normal): Requests pass through. Failures are counted. When failure count exceeds threshold within a window, transition to Open.
- Open (tripped): All requests are rejected immediately (fail-fast). After a configured timeout, transition to Half-Open.
- Half-Open (probing): A limited number of requests are allowed through as probes. If probes succeed, transition to Closed. If probes fail, transition back to Open.
Key parameters:
- Failure threshold: Number or percentage of failures to trip (e.g., 50% error rate over 10-second window)
- Open duration: How long to stay open before probing (e.g., 30 seconds)
- Probe count: How many requests to allow in half-open state (e.g., 3)
- Success threshold: How many probe successes needed to close (e.g., 3/3)
- Sliding window: Time window for counting failures (e.g., 10 seconds)
What counts as failure:
- HTTP 5xx responses, connection timeouts, connection refused, socket errors
- Not failures: HTTP 4xx (client errors), business logic errors, rate limit responses (429)
- Counting 4xx as failures will trip the circuit when it shouldn't — this is a common misconfiguration.
3Fault Lines
Fault Line 1: Threshold Tuning — Sensitivity vs. Stability
The fundamental tension: aggressive thresholds trip quickly (fast protection) but cause false positives during brief transient errors. Conservative thresholds are stable but allow cascading failures to develop.
The Tradeoff Matrix
| Approach | Trip Speed | False Positive Rate | Cascade Risk | Complexity |
|---|---|---|---|---|
| Count-based (5 failures → trip) | Fast on burst | High during transients | Low | Low |
| Rate-based (>50% error rate) | Medium | Low (tolerates some errors) | Medium | Low |
| SLO-derived (error budget burn rate) | Adaptive | Lowest | Lowest | Medium |
| ML-based anomaly detection | Adaptive | Very low | Very low | Very high |
Who Pays
| Who Pays | Aggressive (fast trip) | Conservative (slow trip) | SLO-Derived |
|---|---|---|---|
| Users | More false-positive degradation | Exposed to cascading failures longer | Balanced — degrades only when SLO is at risk |
| Engineering | Frequent circuit trips to investigate | Longer outages when cascade hits | Must define and maintain SLOs |
| Ops | Alert fatigue from circuit trips | Fewer alerts but worse outages | Actionable alerts tied to business impact |
| Product | More frequent fallback UX | Less fallback but worse failures | Fallback proportional to actual risk |
Staff Signal: SLO-derived thresholds are the Staff answer. Instead of "trip after 5 errors," the threshold is: "trip when the dependency's error rate would burn our error budget within 1 hour at the current rate." This makes the circuit breaker adaptive — it tolerates brief spikes but trips fast during sustained failures.
Bar-Raiser Question
"Your service has a 99.9% SLO. A dependency starts returning 5% errors. Should the circuit trip?"
L5 answer: "Yes, 5% is above normal."
L6 answer: "It depends on traffic volume and time. At our current traffic, 5% errors means we're burning error budget at 50x the sustainable rate — our monthly budget would be gone in 14 hours. The circuit should trip. But if traffic is low (say, off-peak), 5% might only be 10 actual errors per minute — not worth tripping. The threshold must be rate-aware, not just percentage-aware."
Fault Line 2: Retries — Protection vs. Amplification
Retries are essential for handling transient failures but dangerous in aggregate. If every caller retries 3 times, a slow service receiving 1000 req/s gets 3000 req/s during failure — amplifying the problem.
The Tradeoff Matrix
| Strategy | Recovery from Transients | Load Amplification | Complexity | Coordination |
|---|---|---|---|---|
| Per-call retries (3x with backoff) | Good | 3x amplification worst case | Low | None |
| Retry budget (10% of traffic) | Good | Bounded to 110% of normal load | Medium | Shared state across callers |
| Selective retry (only idempotent ops) | Partial | Low (only safe operations retry) | Low | Per-endpoint configuration |
| No retries (fail-fast + circuit breaker) | None | Zero amplification | Lowest | None |
Who Pays
| Who Pays | Per-Call Retries | Retry Budget | No Retries |
|---|---|---|---|
| Users | Transparent recovery from transients | Same recovery, bounded impact | More visible errors |
| Engineering | Simple implementation | Budget tracking infrastructure | Simplest |
| Downstream | 3x load during failure (dangerous) | Bounded extra load | No extra load |
| System | Retry storm risk | Controlled amplification | No amplification |
Staff Signal: Retry budgets are the Staff answer. A shared budget (e.g., "10% of requests can be retries") caps the total retry load across all callers. When the budget is exhausted, new retries are rejected — callers get fast failures. This prevents the deadly spiral where retries overwhelm a recovering service.
Bar-Raiser Question
"Your service retries 3 times with exponential backoff. There are 50 instances of your service calling the same dependency. The dependency is slow. What happens?"
L5 answer: "Each instance retries 3 times, so the dependency gets 4x traffic."
L6 answer: "Worse than 4x. Each of the 50 instances retries independently. Normal load is 50,000 req/s. With 3 retries each, the dependency gets up to 200,000 req/s — a 4x amplification. But with exponential backoff, the retries cluster at specific intervals (1s, 2s, 4s), creating synchronized bursts that hit the dependency in waves. This is a retry storm. The fix is: (1) add jitter to backoff, (2) implement a shared retry budget across instances, (3) make the circuit breaker trip before all retries are exhausted."
Fault Line 3: Fallback Design — Technical vs. Product Decision
When the circuit trips, what does the user see? This is where resilience becomes a product problem.
The Tradeoff Matrix
| Fallback Strategy | User Experience | Correctness | Complexity | Product Involvement |
|---|---|---|---|---|
| Error page ("Service unavailable") | Bad | Honest | None | None needed |
| Cached/stale data | Good (if stale is acceptable) | Approximate | Medium (cache management) | Must define staleness tolerance |
| Default values | Acceptable | Approximate | Low | Must define acceptable defaults |
| Degraded functionality | Reduced but functional | Partial | High | Must define degradation tiers |
| Queue for later | Deferred but complete | Eventually correct | Medium | Must design retry UX |
Who Pays
| Who Pays | Error Page | Cached Data | Degraded Functionality |
|---|---|---|---|
| Users | Frustrating, honest | Seamless but possibly stale | Reduced but usable |
| Engineering | Zero effort | Cache infrastructure + invalidation | Multiple code paths |
| Product | No decisions needed | Define staleness tolerance | Define degradation tiers |
| Support | High ticket volume | Low tickets (users don't notice) | Moderate (confusion about missing features) |
Staff Signal: Fallback design is a product decision, not an engineering decision. Engineering decides how to detect failure and how to implement the fallback. Product decides what users see. The circuit breaker configuration includes the fallback behavior, and changing it requires a product review, not just a code change.
Bar-Raiser Question
"The recommendation service is down. Your product page normally shows 'You might also like...' How do you handle it?"
L5 answer: "Show a cached version of recommendations."
L6 answer: "Three options, escalating in effort: (1) Hide the section entirely — users don't miss what they never saw. Lowest effort, no stale data risk. (2) Show popular items globally instead of personalized recommendations. Generic but relevant. (3) Show cached personalized recommendations with a staleness limit (e.g., max 24 hours). Each option has different product implications — I'd propose all three to product and let them decide based on the conversion impact data."
Fault Line 4: Recovery — Thundering Herd on Close
When a circuit transitions from open to half-open and probes succeed, it closes — and all backed-up traffic floods the recovering dependency. This thundering herd can immediately re-trigger the failure.
The Tradeoff Matrix
| Strategy | Recovery Speed | Re-Failure Risk | Complexity | Predictability |
|---|---|---|---|---|
| Immediate close (all traffic) | Instant | Very high (thundering herd) | None | Unpredictable |
| Gradual ramp (10% → 25% → 50% → 100%) | Minutes | Low | Medium | Predictable |
| Canary probing (single instance tests) | Slow | Very low | Medium | Controlled |
| Adaptive (close rate based on probe success rate) | Variable | Low | High | Self-tuning |
Who Pays
| Who Pays | Immediate Close | Gradual Ramp |
|---|---|---|
| Users | Instant recovery but risk of re-failure | Slower recovery but stable |
| Engineering | Simple but fragile | Traffic shaping logic needed |
| Downstream | Full traffic hit immediately | Controlled traffic increase |
| Reliability | Oscillation risk (trip → close → trip) | Smooth recovery |
Staff Signal: Gradual ramp is the Staff default. When the circuit closes, route 10% of traffic to the recovered dependency, monitor error rate, ramp to 25%, and so on. If errors increase at any step, pause or reopen. This prevents the "recovery oscillation" where the circuit trips, recovers, gets hammered, trips again, in an infinite loop.
4Failure Modes & Degradation
Failure Mode 1: Cascading Failure (Unprotected Dependency Chain)
What happens: Service A calls Service B, which calls Service C. C slows down. B's threads block waiting for C. A's threads block waiting for B. All three services fail.
Blast radius: Total — every service in the call chain fails. Users see full outage.
Mitigation:
- Timeouts at every call boundary (aggressive — 2x the p99 latency, not 30 seconds)
- Circuit breakers at every call boundary
- Bulkheads isolating each dependency's resource allocation
- The first service to detect the problem should shed load, not propagate it
Who owns this: Each team owns their outgoing circuit breakers. The platform team provides the library and defaults.
Failure Mode 2: Retry Storm
What happens: A dependency slows down. Every caller retries 3 times. The dependency receives 4x normal traffic while already struggling. It fails completely.
Blast radius: The dependency and everything that depends on it.
Mitigation:
- Shared retry budget across all callers (e.g., 10% of requests)
- Circuit breakers trip before all retries are exhausted
- Jitter on retry timing to prevent synchronized bursts
- Server-side: respond with 503 + Retry-After header to explicitly control retry timing
Who owns this: Caller teams for retry configuration. Platform team for retry budget infrastructure.
Failure Mode 3: Circuit Breaker Oscillation
What happens: A dependency partially recovers. The circuit closes, full traffic hits, dependency fails again, circuit opens. Repeat.
Blast radius: The dependency never fully recovers. Users experience intermittent failures.
Mitigation:
- Gradual ramp on circuit close (10% → 25% → 50% → 100%)
- Longer open duration after repeated trips (exponential backoff on the circuit itself)
- Canary probing: test with a single instance before closing for all instances
Who owns this: Platform team for circuit breaker library. SRE for monitoring oscillation patterns.
Failure Mode 4: Timeout Misconfiguration
What happens: Timeouts are set too high (30 seconds). When a dependency slows down, threads block for 30 seconds each. Thread pool exhausts in minutes.
Blast radius: Thread pool exhaustion causes the caller to reject all requests — even to healthy dependencies.
Mitigation:
- Timeouts must be aggressive: 2x the dependency's p99 latency, not an arbitrary large value
- Connection timeout separate from request timeout (connect: 1s, read: 3s)
- Timeout values derived from production latency data, reviewed quarterly
- Bulkheads prevent thread exhaustion from affecting other dependencies
Who owns this: Each team owns their timeout configuration. Platform team provides sensible defaults.
5Evaluation Rubric
Signal 1: Failure Domain Thinking
| Score | Behavior |
|---|---|
| Weak | "Add a circuit breaker" without identifying what it protects |
| Adequate | Identifies which dependency needs protection |
| Strong | Maps all failure domains, identifies blast radius per dependency, prioritizes protection by business impact |
Signal 2: Threshold Reasoning
| Score | Behavior |
|---|---|
| Weak | "Trip after 5 errors" (arbitrary) |
| Adequate | Percentage-based threshold with time window |
| Strong | SLO-derived thresholds, explains how traffic volume affects threshold behavior, considers transient vs. sustained failures |
Signal 3: Retry Awareness
| Score | Behavior |
|---|---|
| Weak | "Retry 3 times with backoff" without considering load amplification |
| Adequate | Mentions retry storms as a risk |
| Strong | Designs retry budgets, explains jitter, calculates amplification factor, knows when NOT to retry |
Signal 4: Fallback Design
| Score | Behavior |
|---|---|
| Weak | "Return an error" |
| Adequate | "Return cached data" |
| Strong | Proposes tiered fallback options, identifies fallback as a product decision, explains staleness tradeoffs |
Signal 5: Recovery Strategy
| Score | Behavior |
|---|---|
| Weak | "The circuit auto-recovers" |
| Adequate | Describes half-open probing |
| Strong | Designs gradual ramp recovery, addresses thundering herd on close, explains oscillation prevention |
6Interview Flow & Pivots
Recommended Pacing (45 min)
| Phase | Time | Focus |
|---|---|---|
| Clarify Intent | 0-5 min | Which dependency? What's the blast radius? What's the SLO? |
| Timeout & Circuit Breaker | 5-15 min | Timeout values, circuit breaker thresholds, failure counting. |
| Retry Strategy | 15-22 min | Per-call vs. budget retries. Amplification reasoning. Idempotency. |
| Fallback Design | 22-30 min | What does the user see? Cached data vs. degraded vs. error. Product decision. |
| Recovery & System-Wide | 30-45 min | Half-open probing, gradual ramp, bulkheads, load shedding. |
Pivot Recognition
| Interviewer Says | They're Testing | Pivot To |
|---|---|---|
| "What if all your dependencies fail simultaneously?" | System-wide resilience | Load shedding, graceful degradation tiers, static fallback page |
| "The dependency is owned by another team" | Organizational resilience | SLO negotiation between teams, timeout contracts, shared on-call for critical paths |
| "Users are seeing stale data from the fallback" | Fallback quality | Staleness bounds, cache warming strategy, user-visible degradation indicators |
| "One instance of your service recovered but others haven't" | Distributed recovery | Per-instance circuit breakers, canary probing, fleet-wide circuit coordination |
| "The circuit keeps tripping every few minutes" | Oscillation | Gradual ramp, exponential backoff on trip duration, root cause investigation |
| "How do you test resilience?" | Chaos engineering | Fault injection, chaos monkey, game days, failure mode testing |
Staff Signal: If the interviewer asks about organizational resilience (cross-team SLOs), they're probing L7 territory. Show you understand that resilience is an organizational discipline — every team must implement patterns consistently, and SLOs must be negotiated across dependency boundaries.
7Active Drills
Drill 1: Design a Circuit Breaker for a Payment Service
Scenario: Your checkout service calls a payment provider. Design the resilience layer.
Staff Answer
- Timeout: 3 seconds (payment provider p99 is 800ms; 3s gives headroom without blocking too long)
- Circuit breaker: trip when error rate exceeds 10% over a 30-second sliding window. Open duration: 60 seconds.
- Retry: no retries for charges (not idempotent). Retry once for status checks (idempotent with idempotent key).
- Fallback: queue the payment for retry. Show user "Payment processing..." with email confirmation when complete.
- Bulkhead: dedicated connection pool for payment provider (10 connections), separate from other dependencies.
- Recovery: gradual ramp. In half-open, route 1 payment every 10 seconds. If it succeeds, ramp to 10%, 25%, 50%, 100%.
- Monitoring: alert on circuit trip. Page on-call if open for >5 minutes.
Why this is L6:
- Distinguishes between non-idempotent charges (never retry) and idempotent status checks (safe to retry)
- Fallback is a product decision (queuing with user notification) rather than technical default (error page)
- Gradual recovery prevents thundering herd on the payment provider when circuit closes
Drill 2: Prevent a Retry Storm
Scenario: 100 instances of your service call a catalog service. The catalog service is slow. Design retry strategy.
Staff Answer
- Baseline: 100 instances × 500 req/s each = 50,000 req/s to catalog service
- With 3 retries: up to 200,000 req/s during failure. This will kill the catalog service.
- Fix: implement a shared retry budget of 10%. Max 5,000 retries/second across all instances.
- Implementation: each instance tracks its own retry rate. When retry percentage exceeds 10% of its traffic, stop retrying and fail fast.
- Alternative: token bucket for retries — each instance gets 50 retry tokens/second (500 × 10%). Retry only if token available.
- Combined with circuit breaker: circuit trips when error rate exceeds 20%. Retries stop when circuit opens. This prevents retries from sustaining load on a failing service.
- Jitter: all retries add random delay (0-1 second) to prevent synchronized bursts.
Why this is L6:
- Quantifies retry amplification (4x load) and demonstrates understanding of compounding effects across instances
- Implements decentralized budget enforcement (per-instance) that achieves global compliance without coordination
- Recognizes the interaction between retries and circuit breakers — retries can prevent circuits from tripping
Drill 3: Design Bulkhead Isolation
Scenario: Your API gateway calls 5 downstream services. One is slow. Prevent it from affecting the others.
Staff Answer
- Separate thread pool per downstream service. Each pool has max N threads.
- Service A (critical, fast): 50 threads. Service B (important, medium): 30 threads. Service C (nice-to-have, slow): 10 threads. Services D, E: 20 threads each.
- When Service C consumes all 10 threads (slow), only Service C calls are affected. A, B, D, E continue normally.
- Without bulkheads: shared pool of 100 threads. Service C consumes all 100. Every service is now unresponsive.
- Connection pools: similarly isolated. Each dependency gets its own connection pool with max connections.
- Semaphore-based alternative: instead of thread pools, use semaphores to limit concurrent requests per dependency. Lighter weight, works with async I/O.
- Monitoring: track thread pool utilization per dependency. Alert when any pool exceeds 80% utilization.
Why this is L6:
- Sizes bulkheads based on criticality (critical services get more threads) rather than uniform allocation
- Recognizes bulkheads as resource isolation, not just concurrency limiting — prevents one dependency from starving others
- Proposes semaphore-based alternative for async environments, showing awareness of implementation tradeoffs
Drill 4: Design Graceful Degradation Tiers
Scenario: Your e-commerce product page depends on 6 services: catalog, pricing, recommendations, reviews, inventory, and user profile. Design degradation strategy.
Staff Answer
- Tier 0 (all healthy): full page with all features
- Tier 1 (recommendations down): hide "You might also like" section. No user impact on purchase flow.
- Tier 2 (reviews down): hide reviews, show "Reviews temporarily unavailable." Purchase still works.
- Tier 3 (inventory down): show "Check availability" button instead of real-time stock count. Accept orders, verify inventory asynchronously.
- Tier 4 (pricing down): show last cached price with "Price as of [timestamp]" disclaimer. Allow purchase at cached price.
- Tier 5 (catalog down): show static cached product page. Allow purchase of recently cached products only.
- Each tier corresponds to a circuit breaker state. When a dependency's circuit trips, the page degrades to the appropriate tier.
- Product owns the tier definitions. Engineering implements the fallback rendering.
Why this is L6:
- Designs tiered degradation by business impact (recommendations dispensable, catalog essential) rather than technical ease
- Recognizes fallback design as a product decision requiring cross-functional input, not just engineering defaults
- Preserves critical purchase flow across all degradation tiers — shows understanding of business continuity
Drill 5: Design Timeout Configuration
Scenario: Your service has 8 downstream dependencies with different latency profiles. Design the timeout strategy.
Staff Answer
- Principle: timeout = 2-3x the dependency's p99 latency, measured in production. Not arbitrary constants.
- Fast dependencies (cache, config): connect 200ms, read 500ms
- Medium dependencies (CRUD services): connect 500ms, read 2s
- Slow dependencies (search, ML inference): connect 500ms, read 5s
- External APIs (payment, email): connect 1s, read 10s (they're outside your control)
- Total request budget: if the user-facing request has a 3-second SLO, the sum of dependency timeouts can't exceed 3 seconds for the critical path. Parallelize independent calls.
- Review quarterly: p99 latencies drift. Timeouts must be updated based on production data.
- Connection timeout vs. read timeout: always set both. Connection timeout catches unreachable hosts. Read timeout catches slow responses.
Why this is L6:
- Derives timeouts from production metrics (p99) rather than arbitrary constants like 30 seconds
- Validates total timeout budget against user-facing SLO — recognizes that sequential timeouts compound
- Treats timeout configuration as living documentation requiring quarterly review, not set-and-forget
Drill 6: Implement Chaos Engineering
Scenario: You want to verify your resilience patterns work in production. Design a chaos engineering approach.
Staff Answer
- Level 1: Inject latency. Add 2-second delay to a non-critical dependency. Verify: circuit trips, fallback activates, user experience degrades gracefully. Run during low-traffic hours.
- Level 2: Inject errors. Return 500 from a dependency for 10% of requests. Verify: circuit trips at configured threshold, retries work correctly, no retry storm.
- Level 3: Kill instances. Terminate a dependency instance. Verify: load balancer routes around it, circuit trips on callers of that instance, recovery is clean.
- Level 4: Network partition. Block traffic between service A and service B. Verify: circuit opens, fallback activates, recovery when partition heals.
- Game day: schedule a full-team exercise. Inject failures during business hours with the team watching dashboards. Document gaps, fix, repeat.
- Safety: all chaos experiments have a kill switch. Start with one experiment in staging. Graduate to production only after staging succeeds.
Why this is L6:
- Structures chaos testing as progressive levels of risk, from latency injection to network partitions
- Validates not just that resilience patterns trigger, but that user experience degrades gracefully under failure
- Frames game days as organizational learning exercises (documentation of gaps) rather than pure technical validation
Drill 7: Design Cross-Service SLO Contracts
Scenario: Your service has a 99.9% availability SLO. You depend on 5 services. How do you ensure your SLO?
Staff Answer
- Each dependency must have a higher SLO than yours to give you headroom. If you need 99.9%, dependencies should target 99.95%.
- But you can't control other teams' SLOs. So: design for dependency SLOs being 99.5% (worse than your own) and use circuit breakers + fallbacks to bridge the gap.
- Error budget math: 99.9% = 43 minutes downtime/month. If dependency A has 99.5% SLO, it can be down 3.6 hours/month. Your circuit breaker + fallback must handle those 3.6 hours without burning your 43-minute budget.
- This means: fallback must be good enough that users don't notice dependency failures. If fallback is "error page," you're coupling your SLO to your dependency's SLO.
- Formalize dependency SLOs in a contract: "Service X provides <5ms p99 latency and <0.1% error rate. Caller should configure 15ms timeout and circuit breaker at 1% error threshold."
Why this is L6:
- Performs error budget arithmetic to derive resilience requirements from SLO constraints (3.6 hours vs 43 minutes)
- Recognizes SLO dependency as cross-team negotiation requiring formalized contracts, not just technical configuration
- Understands that fallback quality determines whether you can decouple your SLO from dependency SLOs
Drill 8: Design Rate-Based Load Shedding
Scenario: Your service receives 10x normal traffic during a flash sale. Not all requests can be served. Design load shedding.
Staff Answer
- Priority classification: P0 (checkout, payment) — never shed. P1 (product page, cart) — shed last. P2 (recommendations, reviews) — shed first.
- Measurement: track request rate per second. When rate exceeds capacity (e.g., 80% of max throughput), begin shedding.
- Shedding order: first disable P2 features (no recommendation calls, no review calls). Then rate-limit P1 by returning 503 to excess requests with Retry-After header. Never shed P0.
- Implementation: request admission controller at the service entry point. Checks current load, request priority, and admits or rejects.
- Client-side: on 503, client shows "We're experiencing high demand" with auto-retry. For P0 (checkout), queue and retry automatically.
- Monitoring: track shedding rate, shed request types, user impact. Alert if P0 requests are shed (this should never happen).
Why this is L6:
- Designs criticality-based shedding where P0 traffic is protected absolutely, not just rate-limited proportionally
- Uses admission control at service entry (before consuming resources) rather than reactive throttling
- Specifies client-side cooperation (Retry-After header, adaptive retry) to prevent shed requests from becoming retry storms
8Deep Dive Scenarios
Scenario-based analysis for Staff-level depth
Deep Dive 1: Netflix's Resilience Architecture
Context: Netflix was an early pioneer of microservice resilience with Hystrix (now deprecated), Resilience4j, and chaos engineering (Chaos Monkey).
Questions You Should Ask First:
- Is this a library adoption exercise, or does it require an organizational model change — who owns resilience configuration per service?
- Why was Hystrix deprecated in favor of Resilience4j — is it a threading model problem (thread-pool vs functional), and does our stack have the same constraint?
- Do our teams actually test their resilience configurations, or do they set defaults and assume they work until a production incident?
- What's the gap between 'we imported the library' and 'we validated our fallback behavior under realistic failure conditions'?
Staff-Level Discussion:
- Netflix's insight: in a microservice architecture with hundreds of dependencies, something is always failing. Resilience isn't about preventing failure — it's about containing it.
- Hystrix introduced: circuit breakers, thread pool bulkheads, fallback mechanisms, and real-time monitoring dashboards.
- Why Hystrix was deprecated: it was thread-pool-based, which doesn't work well with reactive/async programming (Project Reactor, WebFlux). Resilience4j replaced it with a functional, lightweight, non-thread-pool approach.
- Netflix's organizational model: each team is responsible for their own resilience configuration. There's no central "resilience team." The platform team provides the library and defaults. Teams customize thresholds for their specific SLOs.
- Key lesson: resilience libraries don't help if teams don't use them correctly. Netflix invested heavily in education, game days, and automated chaos testing to ensure every team understood and tested their resilience patterns.
Metrics to Watch:
resilience.circuit_breaker.trip_rate_by_service (which teams' breakers are tripping and how often), resilience.fallback.activation_count (are fallbacks actually being exercised), resilience.chaos_test.coverage_pct (percentage of services with validated resilience), resilience.config.staleness_days (how long since thresholds were reviewed).
Organizational Follow-up: Establish that every service team owns their resilience thresholds — the platform team provides tooling and defaults, not configuration. Schedule quarterly chaos game days where teams validate their fallback behavior. Create a resilience maturity scorecard: has the team tested their circuit breaker? Have they run a game day? Are thresholds SLO-derived?
Ownership Question: "Should there be a central resilience team?" Staff answer: No. Resilience is owned by every team. A platform team provides tooling and defaults, but threshold tuning, fallback design, and recovery testing must be owned by the team that owns the service. Centralizing resilience creates a bottleneck and false sense of security.
Staff Signals:
- Frames Hystrix deprecation as a threading-model limitation, not a pattern failure
- Insists resilience ownership lives with each service team, not a central resilience team
- Measures chaos test coverage percentage rather than just library adoption
Deep Dive 2: The Retry Budget Pattern in Detail
Context: Retries are the most dangerous resilience pattern because they amplify load. The retry budget pattern caps total retry volume.
Questions You Should Ask First:
- Have we calculated the retry amplification factor — if 100 clients each retry 3 times, what's the total load multiplier on a degraded service?
- Is the retry budget enforced per-instance or globally — and does per-instance enforcement with a 10% budget provide sufficient global bounding?
- How do retries interact with circuit breaker thresholds — can retries prevent circuits from tripping by masking failures?
- Who sets the retry budget — the calling service team unilaterally, or is it negotiated with the callee team as part of the SLO contract?
Staff-Level Discussion:
- The problem: if 100 clients each retry 3 times, a failing service sees 4x normal load. During a degradation (not total failure), retries can push it from degraded to completely failed.
- Retry budget: each service tracks the ratio of retries to total requests. When retries exceed the budget (e.g., 10%), stop retrying.
- Implementation: per-instance token bucket. Refill rate = 10% of normal traffic rate. Each retry consumes a token. No token = fail fast.
- Cross-instance coordination: not needed. If each instance independently maintains 10% retry budget, the total retry volume is bounded to 10% of total traffic.
- The interaction with circuit breakers: retries happen during the circuit breaker's counting window. If retries themselves succeed, they prevent the circuit from tripping. If retries fail, they accelerate tripping. The retry budget prevents retries from sustaining load on a failing dependency long enough to prevent the circuit from ever tripping.
Metrics to Watch:
retry.budget_utilization_pct (current retry volume as percentage of budget — alert at 80%), retry.amplification_factor (total requests including retries divided by original requests), retry.budget_exhausted_count (how often the budget hits zero and retries are suppressed), circuit_breaker.trip_delay_from_retries_ms (time retries extend before circuit trips).
Organizational Follow-up: Define retry budgets in the service SLO contract: the callee publishes acceptable retry load (e.g., '110% of normal traffic during degradation'). Add retry amplification factor to the cross-service dependency dashboard. Create a runbook for when retry budgets are exhausted — this signals a dependency is failing and the circuit breaker should be inspected.
Ownership Question: "Who sets the retry budget — the caller or the callee?" Staff answer: The callee should publish a recommended retry budget in its SLO contract. The caller implements it. If the callee says "I can handle 110% of normal load during degradation," the retry budget is 10%.
Staff Signals:
- Calculates the retry amplification factor before choosing a retry strategy
- Recognizes that successful retries mask failures and delay circuit breaker tripping
- Proposes the callee publish acceptable retry budgets in its SLO contract
Deep Dive 3: Resilience in Service Mesh (Istio/Envoy)
Context: Service meshes like Istio move resilience patterns (retries, timeouts, circuit breakers) from application code to the sidecar proxy.
Questions You Should Ask First:
- Can the mesh distinguish between a retryable 500 (transient) and a permanent 500 (bad request) — or does it blindly retry all errors?
- What's the right split between mesh-level and application-level resilience — should the mesh handle transport concerns while the app handles semantic concerns?
- With 500 services, how do we manage the configuration explosion of VirtualService and DestinationRule resources — standardized policies with per-service overrides?
- If the mesh is handling retries and the application is also handling retries, are we accidentally double-retrying and amplifying load?
Staff-Level Discussion:
- Advantage: consistent resilience across all services without requiring each team to implement it. The mesh handles retries, timeouts, and circuit breaking at the network layer.
- Disadvantage: mesh-level resilience doesn't understand application semantics. It can't distinguish between "this 500 is retryable" and "this 500 is permanent." It can't implement application-aware fallbacks.
- The right split: mesh handles transport-level resilience (connection timeouts, retries on connection errors, basic circuit breaking). Application handles semantic resilience (fallback rendering, retry on specific error codes, degradation tiers).
- Configuration explosion: Istio's VirtualService and DestinationRule resources configure retries, timeouts, and circuit breakers per route. With 500 services, this becomes a management challenge. Need standardized policies with per-service overrides.
Metrics to Watch:
mesh.retry.total_volume (retries generated by the mesh layer — watch for amplification), mesh.circuit_breaker.ejection_rate (Envoy outlier detection ejections), mesh.config.resource_count (total VirtualService + DestinationRule resources — track growth), mesh.app_retry.overlap_pct (requests retried by both mesh and application).
Organizational Follow-up: Define a clear boundary: mesh handles transport-level resilience (connection timeouts, connection-error retries, basic circuit breaking), applications handle semantic resilience (fallbacks, error-code-specific retries, degradation tiers). Create standardized Istio policy templates that teams extend, not copy. Audit for double-retry patterns where both mesh and application are retrying the same request.
Ownership Question: "Should resilience live in the mesh or the application?" Staff answer: Both. The mesh provides a consistent baseline (timeouts, connection-level retries, basic circuit breaking). The application provides semantic resilience (fallbacks, degradation tiers, retry on specific errors). Don't try to do everything in one layer.
Staff Signals:
- Splits resilience into transport-level (mesh) and semantic-level (application) responsibilities
- Audits for double-retry patterns where both mesh and application retry the same request
- Addresses configuration explosion with standardized policies and per-service overrides
Deep Dive 4: Load Shedding at Google Scale
Context: Google's approach to load shedding (described in the SRE book) uses adaptive admission control to protect services during overload.
Questions You Should Ask First:
- Is our load shedding static (fixed rate threshold) or adaptive (service self-measures health and sheds progressively)?
- How do we prevent criticality tag inflation — if every team labels their requests as CRITICAL, load shedding becomes meaningless?
- Does our load shedding trigger a retry storm from clients — are clients using adaptive backoff when they receive 503s?
- Who decides the criticality of a request — the caller team based on user impact, or the callee team based on server capacity?
Staff-Level Discussion:
- Google's insight: instead of static rate limits, use adaptive load shedding. The service measures its own health (CPU, latency, error rate) and starts rejecting requests when it's approaching overload.
- Client-side cooperation: clients respect the rejection and back off. This prevents the "thundering herd on recovery" problem.
- Criticality levels: each request carries a criticality tag (CRITICAL_PLUS, CRITICAL, SHEDDABLE_PLUS, SHEDDABLE). During overload, shed SHEDDABLE first, then SHEDDABLE_PLUS, etc.
- The organizational challenge: every team must correctly tag request criticality. If everything is tagged CRITICAL, load shedding doesn't work.
- Anti-pattern: load shedding that triggers more retries from clients, which generates more load, which triggers more shedding. The fix: clients must use adaptive retry (back off when they see 503s) and circuit breakers.
Metrics to Watch:
load_shedding.rejection_rate_by_criticality (are we shedding SHEDDABLE first, as designed?), load_shedding.criticality_distribution (percentage of requests tagged at each level — alert if >50% are CRITICAL), load_shedding.retry_storm_detection (rejection rate increasing despite shedding), service.adaptive_health_score (composite of CPU, latency, error rate).
Organizational Follow-up: Define criticality tagging as an SLO negotiation between caller and callee teams, not a unilateral decision. Publish a criticality taxonomy with examples (e.g., checkout = CRITICAL, analytics = SHEDDABLE). Create quarterly audits of criticality tag distribution to detect inflation. Add load shedding behavior to game day exercises.
Ownership Question: "Who decides the criticality of a request — the caller or the callee?" Staff answer: The caller tags criticality based on user impact. The callee sheds based on tags. If there's disagreement (caller says CRITICAL, callee wants to shed it), this is an SLO negotiation between teams, not a technical problem.
Staff Signals:
- Treats criticality tagging as an SLO negotiation, not a technical configuration
- Designs shedding to be adaptive based on service health, not static rate thresholds
- Ensures clients implement adaptive backoff so shedding does not trigger a retry storm
Deep Dive 5: Testing Resilience Patterns
Context: Resilience patterns are notoriously hard to test because they activate only during failure conditions that are rare in testing environments.
Questions You Should Ask First:
- Are we only testing single-failure scenarios, or do we test multi-failure cascades — what happens when 3 dependencies fail simultaneously?
- Do our chaos experiments run in production during low-traffic windows, or only in staging where the conditions don't match reality?
- What's the most valuable output of a game day — is it the technical findings, or the documentation of response time gaps and missing runbooks?
- How often are we running chaos experiments — continuously for automated fault injection, monthly for planned scenarios, quarterly for full-team game days?
Staff-Level Discussion:
- Unit testing: test the circuit breaker state machine in isolation. Verify: correct state transitions, threshold counting, half-open probing, gradual ramp.
- Integration testing: inject failures in staging. Verify: end-to-end fallback behavior, timeout handling, retry budget enforcement.
- Chaos testing in production: inject failures in production during low-traffic periods. Verify: monitoring detects the failure, circuit trips, fallback activates, recovery is clean, no customer impact.
- Game days: full-team exercise where a planned failure is injected and the team responds as if it were real. Document response time, gaps in monitoring, and missing runbooks.
- The hardest test: "What happens when 3 dependencies fail simultaneously?" Most teams test single failures. Multi-failure scenarios are where real cascading failures happen.
Metrics to Watch:
resilience.test_coverage_pct (percentage of services with validated circuit breaker, fallback, and timeout behavior), chaos.experiment.blast_radius (number of users affected by each chaos experiment), game_day.mean_time_to_detect_minutes (how long the team takes to notice injected failure), game_day.runbook_gap_count (missing or outdated runbook steps discovered).
Organizational Follow-up: Schedule quarterly game days with the full engineering team — inject multi-dependency failures and measure response time. Create a resilience testing maturity model: Level 1 (unit tests), Level 2 (integration tests), Level 3 (production chaos), Level 4 (game days). Assign each service a maturity level and track progression. Document every game day's findings — response time gaps and missing runbooks are the most valuable output.
Ownership Question: "How often should you run chaos experiments?" Staff answer: Continuously for automated fault injection (Chaos Monkey-style random instance termination). Monthly for planned chaos experiments (specific failure scenarios). Quarterly for game days (full-team exercise with stakeholders).
Staff Signals:
- Layers four testing tiers: unit, integration, production chaos, and full-team game days
- Values game day documentation of response gaps over the technical pass/fail result
- Tests multi-dependency simultaneous failures, not just single-service isolation
| Dimension | L5 (Senior) | L6 (Staff) | L7 (Principal) |
|---|---|---|---|
| Circuit Breaker | Implements the pattern | SLO-derived thresholds, gradual recovery ramp, oscillation prevention | Designs the organization's circuit breaker standards and defaults |
| Retries | Adds retry with backoff | Implements retry budgets, calculates amplification, knows when NOT to retry | Defines retry policy contracts between services |
| Fallback | Returns cached data | Designs tiered degradation with product input, staleness bounds | Defines degradation strategy as organizational policy |
| Bulkheads | Separate thread pools | Sizes pools based on dependency SLOs, monitors utilization | Designs resource isolation as a platform pattern |
| Load Shedding | Rate limiting | Priority-based shedding, adaptive admission control | Designs organization-wide criticality taxonomy |
| Testing | Unit tests circuit breaker | Chaos engineering in production, game days | Defines resilience testing standards and maturity model |
Self-Check Questions
Before your interview, verify you can answer:
- Why are per-call retries dangerous in a microservice architecture?
- How do you set circuit breaker thresholds based on an SLO?
- What's the difference between a timeout, a circuit breaker, and a bulkhead?
- Who decides what the user sees when a dependency fails?
- What happens when a circuit breaker transitions from open to half-open with 1000 instances?
9Level Expectations Summary
| Dimension | L5 (Senior) | L6 (Staff) | L7 (Principal) |
|---|---|---|---|
| Scope | Single service | Cross-service | Org-wide strategy |
| Failure reasoning | Lists failure modes | Proposes mitigation with cost analysis | Designs preventive architecture |
| Ownership signal | Implements solution | Owns rollout + monitoring | Sets policy + review cadence |
The Bar for This Question
Mid-level (L4/E4): Can explain the three circuit breaker states (closed, open, half-open) and why the pattern exists — preventing a failing downstream from consuming upstream resources and cascading failures. Knows that a circuit breaker "trips" after a threshold of errors and that the half-open state tests whether the downstream has recovered. May use default library thresholds without justification, but understands the core concept.
Senior (L5/E5): Reasons about the interaction between timeouts, circuit breakers, and retries as a coordinated resilience strategy rather than independent knobs. Tunes circuit breaker thresholds based on actual service SLOs and traffic patterns instead of using arbitrary defaults. Designs meaningful fallback responses (cached data, degraded functionality, graceful error messages) and applies bulkhead isolation to prevent a single failing dependency from consuming all threads or connections.
Staff+ (L6/E6+): Derives circuit breaker thresholds from SLO error budgets — if the service has a 99.9% SLO and the error budget burns at 10x the sustainable rate, the breaker should trip within seconds, not minutes. Quantifies retry amplification: N upstream callers each retrying 3 times means a struggling downstream sees 4x its normal load at exactly the moment it can least handle it. Designs recovery strategies that prevent thundering herd on half-open transitions. Frames fallback design as a product decision (what do users see when the dependency is down?) rather than a pure engineering decision, involving product stakeholders in the degraded experience design.
10Staff Insiders: Controversial Opinions
Opinion 1: "Most Circuit Breakers Are Misconfigured"
Teams install Resilience4j, use default thresholds (50% error rate, 60-second window), and never tune them. Default thresholds are almost always wrong for the specific service's traffic pattern and SLO. A circuit breaker with arbitrary thresholds provides false security — it either trips too late (damage already done) or too early (false positive degradation). If you can't explain why your threshold is set to its current value, it's wrong.
Why this differentiates: Shows you think about resilience quantitatively, not as checkbox compliance.
Opinion 2: "Retries Are the Most Dangerous Resilience Pattern"
Retries are the first pattern everyone adds and the most likely to make things worse. In a system with 10 layers of services, retries at each layer multiply: 3 retries × 10 layers = 59,049 potential requests from a single user action (3^10). Retries should be the last resilience pattern you add, not the first. Start with timeouts, then circuit breakers, then bulkheads. Only add retries for idempotent operations with a retry budget.
Why this differentiates: Shows you understand the compounding effect of patterns in distributed systems.
Opinion 3: "Fallback Design Is Harder Than Circuit Breaker Design"
Building the circuit breaker takes a day. Designing the fallback takes a month. What does the product page look like without recommendations? Without pricing? Without inventory? Each combination requires product review, design mockups, and testing. Most teams build the circuit breaker and skip the fallback, which means the fallback is an error page — the worst possible user experience.
Why this differentiates: Shows you understand resilience as a product problem, not just an infrastructure problem.
Opinion 4: "Chaos Engineering Is Organizational Therapy"
The value of chaos engineering isn't the technical findings — it's forcing the organization to confront its fear of failure. Teams that run regular chaos experiments build confidence in their resilience patterns and develop muscle memory for incident response. Teams that don't are perpetually anxious about "what if X fails?" and over-engineer in the wrong places.
Why this differentiates: Demonstrates organizational thinking about engineering culture.
Opinion 5: "The Best Resilience Pattern Is Fewer Dependencies"
Every resilience pattern (circuit breakers, retries, bulkheads, fallbacks) is a band-aid for the real problem: too many dependencies. Before adding resilience patterns, ask: can we remove the dependency? Can we make the call async? Can we cache the result and tolerate staleness? The most resilient system is the one with the fewest failure domains.
Why this differentiates: Shows you think about architecture simplification, not just pattern accumulation.
Expand: How Patterns Work Together (and Against Each Other)
| Pattern A | Pattern B | Interaction | Risk |
|---|---|---|---|
| Retries | Circuit Breaker | Retries happen within the CB's counting window. Retries can mask failures (if some succeed) or accelerate tripping (if all fail). | Retries sustaining load prevent CB from tripping |
| Retries | Timeout | Retries restart the timeout clock. 3 retries × 5s timeout = 15s total wait. | Total wait time exceeds user expectation |
| Circuit Breaker | Bulkhead | CB trips on error rate; bulkhead limits concurrent requests. Both protect the caller. | Neither alone is sufficient — CB handles error rate, bulkhead handles slow responses |
| Timeout | Bulkhead | Timeout releases threads; bulkhead limits thread consumption. Complementary. | Without timeout, bulkhead threads stay occupied by slow requests |
| Load Shedding | Circuit Breaker | Load shedding rejects excess requests; CB handles dependency failures. Different failure modes. | Load shedding returning 503 can trigger callers' circuit breakers (may be desirable) |
| Retry | Load Shedding | Retries add load; load shedding rejects excess. Retries can defeat load shedding. | Retry storms overwhelm load shedding capacity |
Golden rule: Apply patterns in this order: Timeout → Circuit Breaker → Bulkhead → Retry (with budget). Each layer builds on the previous one. Don't add retries before you have timeouts and circuit breakers.
Expand: How to Set Timeouts Correctly
Step 1: Measure Production Latency
For each dependency, collect p50, p95, p99, and p999 latency from production metrics over 7 days.
Step 2: Set Timeouts Based on Percentiles
| Dependency Type | Connection Timeout | Read Timeout | Rationale |
|---|---|---|---|
| In-memory cache (Redis) | 200ms | 500ms | Should be <10ms normally. 500ms = definitely broken. |
| CRUD database | 500ms | 2s | p99 is usually <500ms. 2s covers slow queries. |
| Downstream microservice | 500ms | p99 × 2-3 | Gives headroom above p99 without excessive wait. |
| Search/ML inference | 500ms | p99 × 2 (max 5s) | ML inference varies. Cap at 5s absolute. |
| External API | 1s | 10s | Outside your control. Generous but bounded. |
Step 3: Verify Total Request Budget
Sum the timeouts on the critical path. If user-facing SLO is 3 seconds and your critical path has 3 sequential dependencies at 2s each = 6s. This exceeds the SLO. Fix: parallelize calls or reduce timeouts.
Step 4: Review Quarterly
Latency profiles change as services evolve. Review and update timeouts quarterly using fresh production data.
Expand: Full State Machine Specification with Configuration Parameters
State Transition Diagram
Configuration Parameter Reference
| Parameter | Description | Typical Range | How to Derive | Common Mistake |
|---|---|---|---|---|
| Failure threshold | Error rate or count that triggers the circuit to open | 20-60% error rate over window | Set to the rate that would burn your SLO error budget within 1-4 hours | Setting too low (5%) — trips on transient blips; setting too high (90%) — only trips during total outages |
| Sliding window size | Time window over which failures are counted | 10-60 seconds | Must contain enough requests for statistical significance; at 100 req/s, a 10s window has 1000 samples | Window too short with low traffic — 2 failures in 5 requests looks like 40% but is noise |
| Minimum request volume | Minimum requests in the window before the threshold is evaluated | 10-100 requests | Set to the count where error rate is statistically meaningful | Omitting this entirely — circuit trips on 1 failure out of 2 requests during off-peak |
| Reset timeout (open duration) | How long the circuit stays open before transitioning to half-open | 15-120 seconds | Long enough for the dependency to recover; short enough to detect recovery promptly | Too short (5s) — hammers recovering service; too long (10 min) — unnecessary fallback duration |
| Half-open max requests | Number of probe requests allowed in half-open state | 3-10 requests | Enough to be statistically confident the dependency recovered; few enough to not overload it | Allowing unlimited probes — defeats the purpose of half-open |
| Success threshold | Number of consecutive successes in half-open to close the circuit | Equal to half-open max requests (all must succeed) | All probes must succeed to confirm recovery; partial success means the dependency is still flaky | Requiring only 1 success — closing the circuit based on a single lucky request |
| Failure count reset interval | How often the failure counter resets in closed state | Equal to sliding window | Prevents stale failures from accumulating across unrelated incidents | Never resetting — ancient failures from hours ago contribute to a trip today |
| Exponential backoff on re-trip | Multiplier on reset timeout after repeated trips | 2× per consecutive trip, max 5-10 minutes | Prevents oscillation — each re-trip doubles the cool-down period | No backoff — circuit oscillates between open and closed every 30 seconds indefinitely |
Expand: How Circuit Breakers Interact with Retries, Bulkheads, and Timeouts
The Resilience Stack
Resilience patterns are not independent — they form a layered stack where each pattern interacts with the others. The order matters: applying them incorrectly creates compounding failures instead of preventing them.
Pattern Combination Reference
| Pattern Combination | When to Use | Risk if Misconfigured | Correct Configuration |
|---|---|---|---|
| Timeout + Circuit Breaker | Every outgoing call. Timeout prevents blocking; CB prevents repeated calls to failed dependency. | Timeout too long → threads block before CB trips. Timeout too short → false failures feed CB's failure counter. | Timeout = 2-3× p99. CB evaluates failures after timeout applies. |
| Retry + Circuit Breaker | Idempotent operations where transient recovery is likely. | Retries keep the failure counter from reaching threshold (some succeed, resetting counts). Retries sustain load on a degraded service. | Retry budget ≤ 10%. CB failure counter includes retried requests. Retries stop when circuit opens. |
| Retry + Timeout | Short-lived transient failures (DNS blip, connection reset). | 3 retries × 5s timeout = 15s total wait — exceeds user-facing SLA. | Total retry budget (retries × timeout) must fit within the request's overall SLA budget. |
| Bulkhead + Circuit Breaker | When multiple dependencies share infrastructure. Bulkhead isolates resource consumption; CB isolates failure detection. | Bulkhead sized too large → slow dependency consumes all resources before CB trips. Bulkhead too small → rejects requests even when dependency is healthy. | Bulkhead size = expected concurrent requests at p99 latency × 1.5 headroom. CB threshold based on error rate within the bulkhead. |
| Bulkhead + Timeout | Any service calling multiple downstream dependencies. | Without timeout, bulkhead threads are occupied indefinitely by slow calls. Bulkhead "works" but all threads are blocked. | Timeout releases threads. Bulkhead caps concurrent threads. Together they bound both time and concurrency per dependency. |
| Circuit Breaker + Fallback | Every circuit breaker must have a defined fallback. | No fallback → circuit open means hard error to user. Stale fallback → incorrect data silently served. | Fallback defined by product team. Staleness bounds documented. Monitoring tracks fallback utilization rate. |
| Full Stack (Timeout → Retry → CB → Bulkhead → Fallback) | Critical paths with strict SLOs and multiple dependencies. | Any misconfiguration compounds: retries amplify load, slow timeouts exhaust bulkheads, CBs trip too late. | Each layer configured relative to the others. Total timeout budget ≤ SLA. Retry budget ≤ 10%. Bulkhead ≥ peak concurrency. CB threshold derived from SLO burn rate. |
Integration Anti-Patterns
-
Retries outside the circuit breaker: If retries happen before the circuit breaker check, a retry to an open circuit wastes a retry token and adds latency. Retries must be inside the circuit breaker scope — the CB rejects immediately, no retry consumed.
-
Timeout longer than circuit breaker evaluation window: If timeout is 30 seconds but the CB evaluates over a 10-second window, threads block for 30 seconds but the CB only sees failures from 10 seconds ago. The CB may never trip because each 10-second window only contains 1-2 timeout failures.
-
Bulkhead without timeout: The bulkhead limits concurrency to 20 threads. All 20 threads call a dependency that takes 5 minutes to respond. The bulkhead is "full" but doing nothing useful. Timeouts must release threads for the bulkhead to function.
-
Per-layer fallbacks: Each layer (timeout, retry, CB) has its own error handling. When they conflict, the user gets inconsistent behavior — a timeout returns "try again later" while the CB fallback returns cached data. Unify fallback behavior at the outermost layer.
Expand: Metrics, Dashboards, and Alerting for Circuit Breaker Health
Key Metrics to Instrument
| Metric | What It Measures | Why It Matters | Collection Frequency |
|---|---|---|---|
| State transitions/min | How often the circuit changes state (closed→open, open→half-open, half-open→closed) | High transition rate indicates oscillation — the dependency is flapping between healthy and unhealthy | Real-time (emit on every transition) |
| Time-in-open (seconds) | Duration the circuit stays in open state per trip | Long open durations mean the dependency is down for extended periods; short durations with frequent trips mean oscillation | Calculated per trip, aggregated over 1h/24h |
| Half-open success rate | Percentage of probe requests that succeed in half-open state | Low success rate means the dependency isn't recovering between probe cycles; consider extending the open duration | Per probe cycle |
| Fallback utilization rate | Percentage of requests served by fallback vs primary dependency | Rising fallback rate without a circuit trip means the CB threshold may be too conservative — users are getting degraded data but the circuit hasn't tripped | Per minute, 5-minute rolling average |
| Request rejection rate | Percentage of requests rejected by the open circuit (fail-fast) | Shows user impact during outage — how many users are seeing degraded experience | Per second, real-time |
| Error rate within closed state | The current error rate while the circuit is closed | Tracks whether the dependency is approaching the trip threshold — an early warning signal | Per evaluation window (10-60 seconds) |
| Recovery ramp progress | Current traffic percentage during gradual ramp recovery | Shows recovery progress — stuck at 10% for 5 minutes means the dependency can't handle more load | Real-time during recovery |
| Retry budget consumption | Percentage of the retry budget currently in use | Approaching 100% means the system is under stress; hitting 100% means retries are being dropped | Per second |
Alerting Rules by Circuit State
| Circuit State | Alert Condition | Severity | Action |
|---|---|---|---|
| Closed | Error rate > 50% of trip threshold for > 5 min | Warning | Investigate dependency health — may be approaching a trip |
| Closed | Retry budget consumption > 80% | Warning | Dependency is degraded but not tripping the circuit — check thresholds |
| Open | Circuit open for > 2 min on a critical dependency | Page (P1) | Dependency is down. Verify fallback is active. Check dependency team's status. |
| Open | Circuit open for > 10 min on any dependency | Page (P2) | Extended outage. Escalate to dependency team. Verify fallback data freshness. |
| Half-Open | 3 consecutive probe cycles fail | Warning | Dependency not recovering. Consider extending open duration or escalating. |
| Half-Open | All instances enter half-open within 5 seconds of each other | Warning | Thundering herd on probe. Add jitter to reset timeout per instance. |
| Any | > 5 state transitions in 10 minutes | Warning | Oscillation detected. Increase open duration or add exponential backoff. |
| Closed | Fallback utilization > 5% while circuit is closed | Warning | Errors below trip threshold but fallback is serving traffic — possible threshold misconfiguration. |
Dashboard Design Recommendations
The circuit breaker dashboard should be organized as a dependency-centric view, not a service-centric view. Each dependency gets a panel showing:
- Current state — large status indicator (green/red/yellow for closed/open/half-open)
- Error rate trend — 1-hour graph with the trip threshold marked as a horizontal line
- State transition timeline — when the circuit last tripped, how long it stayed open, recovery time
- Fallback utilization — percentage of requests served by fallback in the last hour
- Retry budget gauge — current consumption as a percentage bar
Expand: Circuit Breaker Implementations Across Languages and Infrastructure Layers
Implementation Comparison Matrix
| Implementation | Language / Platform | Client vs Proxy | Configuration Style | Observability | Maintenance Status | Best For |
|---|---|---|---|---|---|---|
| Hystrix | Java | Client-side (in-process) | Programmatic (annotations, code) | Excellent (Hystrix Dashboard, Turbine aggregation) | Deprecated (2018) — in maintenance mode, no new features | Legacy Java services that already use it; do not adopt for new projects |
| Resilience4j | Java, Kotlin | Client-side (in-process) | Programmatic + config files (YAML, properties) | Good (Micrometer integration, Prometheus, Grafana) | Active — actively maintained, functional/reactive programming support | New Java/Kotlin services; Spring Boot integration; reactive applications |
| Polly | .NET (C#, F#) | Client-side (in-process) | Fluent API (policy builder pattern) | Moderate (manual metric emission, custom telemetry) | Active — widely adopted in .NET ecosystem | .NET microservices; Azure-hosted applications; clean policy-as-code |
| Envoy | Any (sidecar proxy) | Proxy-side (out-of-process) | Declarative (YAML, xDS API) | Excellent (built-in stats, Prometheus integration, distributed tracing) | Active — CNCF graduated project | Polyglot environments; infrastructure-managed resilience; Kubernetes |
| Istio | Any (service mesh) | Mesh-level (control plane + Envoy data plane) | Declarative (Kubernetes CRDs: VirtualService, DestinationRule) | Excellent (Kiali, Jaeger, Prometheus — built into the mesh) | Active — CNCF graduated project | Large Kubernetes deployments; organization-wide resilience policies |
| Sentinel | Java | Client-side + dashboard | Programmatic + rules engine (dashboard UI) | Good (built-in dashboard, Prometheus) | Active — Alibaba open source, widely used in Chinese tech ecosystem | High-traffic Java services; flow control + circuit breaking combined |
| gobreaker | Go | Client-side (in-process) | Programmatic (struct configuration) | Basic (callback hooks for custom metrics) | Active — lightweight, minimal dependencies | Go microservices wanting a simple, no-dependency circuit breaker |
| opossum | Node.js | Client-side (in-process) | Programmatic (options object) | Moderate (event emitter for custom metrics, Prometheus plugin) | Active — most popular Node.js circuit breaker | Node.js microservices; Express/Fastify middleware integration |
Application-Level vs Infrastructure-Level Circuit Breaking
| Dimension | Application-Level (Resilience4j, Polly) | Infrastructure-Level (Envoy, Istio) |
|---|---|---|
| Semantics awareness | Full — can distinguish retryable vs permanent errors, apply business-logic fallbacks | None — operates on HTTP status codes and connection errors only |
| Configuration ownership | Application team owns thresholds per dependency | Platform/infra team owns default policies; application teams override via CRDs |
| Deployment coupling | Deployed with the application; changes require app deployment | Deployed independently; policy changes are live without app redeployment |
| Language constraints | Library must exist for your language | Language-agnostic — proxy handles all traffic regardless of application language |
| Fallback behavior | Application implements rich fallbacks (cached data, degraded UX, queued operations) | Limited to returning error codes or routing to a static fallback endpoint |
| Operational overhead | Per-application: each team maintains their own configuration | Centralized: platform team manages mesh-wide policies with per-service overrides |
| Debugging complexity | Stack traces show circuit breaker state in application logs | Circuit breaker state lives in the proxy — requires Envoy admin interface or mesh dashboard |
When to Use Each Layer
- Application-level only: Small teams (<5 services), single language, need rich fallback behavior, no Kubernetes.
- Infrastructure-level only: Polyglot environment, need consistent baseline resilience, can accept simple fallbacks (error codes, static responses).
- Both layers (recommended for Staff-level systems): Infrastructure layer handles transport-level resilience (connection errors, basic circuit breaking). Application layer handles semantic resilience (application-specific fallbacks, retry on specific error codes, degradation tiers). The mesh provides a safety net; the application provides intelligence.