Degraded Mode Framework
What This Framework Covers
Every distributed system faces the question: when a dependency fails, what do we do with requests?
This is the fail-open vs fail-closed decision. It's not a technical choice — it's a business risk decision with explicit ownership.
The Core Decision
When to Use Each
| Context | Recommended | Why | Who Signs Off |
|---|---|---|---|
| Abuse protection | Fail-open | Limiter shouldn't become a kill switch | Security + Infra |
| Billing / quota | Fail-closed | Cannot give away paid resources | Product + Finance |
| Auth / access control | Fail-closed | Cannot grant unauthorized access | Security |
| Feature flags | Fail-closed (usually) | New code shouldn't run without validation | Product |
| Cache miss | Fail-open to origin | Stale data often acceptable | Product |
| Internal service call | Depends on cascade | Analyze downstream impact | Service owner |
The Staff Framework
Step 1: Name the Intent
Before deciding fail-open or fail-closed, answer:
- What is this system protecting?
- What is the cost of over-admission? (resources, money, security)
- What is the cost of over-rejection? (availability, revenue, user trust)
Step 2: Choose with Guardrails
Fail-open is never "just allow everything." It requires:
| Guardrail | Purpose |
|---|---|
| Aggressive timeout | Don't let the dependency slow you down (5-10ms for fast paths) |
| Conservative fallback limits | If bypassing rate limiter, apply local caps |
| Bypass-rate alerting | If bypass_rate > threshold, page on-call |
| Circuit breaker | If backend shows stress, tighten limits or shed load |
| Audit logging | Record bypassed requests for post-incident analysis |
Fail-closed is never "just reject." It requires:
| Guardrail | Purpose |
|---|---|
| Graceful degradation | Return partial data or cached response if possible |
| Clear error messaging | Tell the client what happened and when to retry |
| Fast recovery detection | Reopen the circuit quickly when dependency recovers |
| Escalation path | On-call can manually override if needed |
Step 3: Define the Governance
| Question | Must Have Answer |
|---|---|
| Who can flip fail-open/closed? | On-call? SRE? Requires approval? |
| What is the kill-switch scope? | Per-endpoint? Per-tenant? Global? |
| How fast can we change it? | Config push? Requires deploy? |
| What is the audit trail? | Who changed what, when, why? |
| What post-incident analysis is required? | Review bypassed requests? Customer impact? |
L6 vs L7 Calibration
| Dimension | L6 (Staff) | L7 (Principal) |
|---|---|---|
| Decision speed | Makes the choice in 30 seconds with clear reasoning | Same, but immediately asks about governance |
| Guardrails | Names 3-4 specific mitigations | Designs the guardrail system + monitoring |
| Ownership | "Who operates this?" | "Who can change this behavior in production?" |
| Blast radius | Considers single service | Considers org-wide consistency |
Common Interview Probes
| After You Say... | They Will Ask... |
|---|---|
| "Fail-open for availability" | "What prevents abuse during the window?" |
| "Fail-closed for correctness" | "What's the user experience? Can they retry?" |
| "We'll add alerting" | "What's the threshold? Who gets paged?" |
| "Circuit breaker" | "What signals trigger it? How fast does it recover?" |
Anti-Patterns
1. "We'll decide at runtime"
Degraded mode behavior should be decided at design time, not discovered during an incident.
2. "Fail-open by default"
Without explicit guardrails, fail-open becomes "accept all abuse."
3. "The dependency will be reliable"
Dependencies fail. The question is when, not if.
4. "We'll just add retries"
Retries without backoff + jitter turn a partial outage into a retry storm.
5. "Fail-closed is always safer"
Fail-closed can cause cascading failures if upstream services can't handle rejections.
Applying to Specific Systems
Rate Limiting
- Abuse protection: Fail-open with local caps + bypass alerting
- Billing/quota: Fail-closed with graceful 503
Circuit Breaker
- Default: Fail-closed (that's the point)
- But: Define the half-open behavior for recovery
Feature Flags
- New features: Fail-closed (don't expose untested code)
- Kill switches: Fail-open (the kill switch should always work)
Cache
- Read path: Fail-open to origin (stale is better than error)
- Write path: Depends on consistency requirements
Staff Sentence Templates
Implementation Deep Dive
1. Circuit Breaker with Fallback — The Core Degraded-Mode Primitive
A circuit breaker monitors downstream health and short-circuits requests when a dependency is failing. The key Staff insight: the circuit breaker itself is not the degradation strategy — the fallback behavior is.
Circuit Breaker State Machine
# Circuit breaker with three states
class CircuitBreaker:
state = "closed" # Normal operation
failureCount = 0
lastFailureTime = null
halfOpenAttempts = 0
config:
failureThreshold = 5 # Failures before opening
resetTimeout = 30_000 # ms before trying half-open
halfOpenMaxAttempts = 3 # Successful calls to close
timeout = 2_000 # Per-request timeout (ms)
function call(operation, fallback):
match state:
case "open":
if now() - lastFailureTime > resetTimeout:
state = "half-open"
halfOpenAttempts = 0
# Fall through to half-open
else:
metrics.increment("circuit.short_circuited")
return fallback()
case "half-open":
try:
result = operation(timeout=config.timeout)
halfOpenAttempts += 1
if halfOpenAttempts >= halfOpenMaxAttempts:
state = "closed"
failureCount = 0
metrics.increment("circuit.closed")
return result
except:
state = "open"
lastFailureTime = now()
return fallback()
case "closed":
try:
result = operation(timeout=config.timeout)
failureCount = max(0, failureCount - 1) # Gradual recovery
return result
except:
failureCount += 1
lastFailureTime = now()
if failureCount >= failureThreshold:
state = "open"
metrics.increment("circuit.opened")
return fallback()
2. Tiered Degradation — The Staff Default
A single fail-open/fail-closed decision is too coarse. Staff engineers implement tiered degradation where each tier provides less functionality but maintains core business operations.
Tiered Degradation Implementation
# Service health tiers
DEGRADATION_TIERS = {
"full": { "features": ["all"], "description": "Normal operation" },
"degraded": { "features": ["core", "cached"], "description": "Non-critical features disabled" },
"minimal": { "features": ["core"], "description": "Only essential operations" },
"emergency": { "features": [], "description": "Static error page / maintenance mode" },
}
function getCurrentTier():
healthChecks = {
"database": checkDatabase(), # Primary + replicas
"cache": checkCache(), # Redis cluster
"auth": checkAuth(), # Auth service
"search": checkSearch(), # Elasticsearch
"recommendations": checkRecs(), # ML service
}
criticalDown = not healthChecks["database"] or not healthChecks["auth"]
if criticalDown:
return "emergency"
failingCount = sum(1 for v in healthChecks.values() if not v)
if failingCount >= 3:
return "minimal"
if failingCount >= 1:
return "degraded"
return "full"
function handleRequest(request):
tier = getCurrentTier()
metrics.gauge("service.degradation_tier", tier)
match tier:
case "full":
return fullResponse(request)
case "degraded":
response = coreResponse(request)
response.recommendations = cachedRecommendations(request.userId) # Stale OK
response.searchResults = None # Feature disabled
response.headers["X-Degraded"] = "search,recommendations"
return response
case "minimal":
return coreOnlyResponse(request) # No personalization, no search
case "emergency":
return staticMaintenancePage()
3. Load Shedding — Protecting the Core Under Overload
When the system is overwhelmed, serving everyone slowly is worse than serving some users well and rejecting others cleanly. Load shedding proactively rejects requests to maintain SLA for accepted traffic.
Priority-Based Load Shedding
# Request priority classification
function classifyPriority(request):
if request.path.startsWith("/api/payments"):
return "critical" # Revenue-generating
if request.path.startsWith("/api/checkout"):
return "critical"
if request.path.startsWith("/api/auth"):
return "high" # User access
if request.path.startsWith("/api/search"):
return "medium" # Valuable but not essential
if request.path.startsWith("/api/recommendations"):
return "low" # Nice to have
return "medium"
# Admission control at the load balancer / gateway
function admissionControl(request):
priority = classifyPriority(request)
currentLoad = metrics.get("system.cpu_utilization")
# Shed low-priority traffic first, then medium, etc.
thresholds = {
"low": 0.70, # Start shedding at 70% CPU
"medium": 0.80, # Start shedding at 80% CPU
"high": 0.90, # Start shedding at 90% CPU
"critical": 0.98, # Almost never shed
}
if currentLoad > thresholds[priority]:
metrics.increment("load_shed.rejected", tags=["priority:" + priority])
return Response(503, headers={"Retry-After": "5"},
body={"error": "service_overloaded", "retry_after_sec": 5})
return processRequest(request)
4. Feature Flags for Graceful Degradation
Feature flags are not just for rollouts — they are the primary mechanism for disabling non-essential features during incidents.
Kill Switch Pattern
# Feature flag configuration (stored in Redis or config service)
FEATURE_FLAGS = {
"recommendations.enabled": { "default": true, "kill_switch": true },
"search.autocomplete": { "default": true, "kill_switch": true },
"analytics.client_events": { "default": true, "kill_switch": true },
"notifications.push": { "default": true, "kill_switch": true },
"checkout.express": { "default": true, "kill_switch": false }, # Never disable
}
function isFeatureEnabled(featureName, context):
flag = FEATURE_FLAGS[featureName]
# Kill switch override — fastest possible check
if flag.kill_switch and killSwitchService.isDisabled(featureName):
metrics.increment("feature.killed", tags=["feature:" + featureName])
return false
return flag.default
# On-call incident response
function disableFeature(featureName, reason, operator):
killSwitchService.disable(featureName)
auditLog.record({
"action": "feature_disabled",
"feature": featureName,
"reason": reason,
"operator": operator,
"timestamp": now()
})
alerting.notify("Feature disabled: " + featureName + " by " + operator)
Architecture Diagram
Normal flow (solid): Requests pass through load shedding, circuit breakers, and feature flag checks. All services respond normally.
Degraded flow (dashed): When a dependency fails, the circuit breaker routes to the fallback — cached data for catalog/recommendations, static responses for emergency mode. Kill switches disable non-critical services instantly.
Failure Scenarios
1. Cascading Failure — Auth Service Latency Spike
Timeline: The auth service response time increases from 5ms to 2,000ms due to a database migration. Every request now takes 2 seconds longer. Thread pools fill across all API servers. Requests queue behind slow auth calls. Within 3 minutes, all services are unresponsive — not because they are broken, but because they are waiting for auth.
Blast radius: Total site outage. Every authenticated endpoint is affected. Even health check endpoints return 503 because the thread pool is exhausted.
Detection: Auth service p99 latency exceeds 500ms. API server thread pool utilization hits 100%. Error rate spikes across all services simultaneously (a signature of upstream dependency failure, not service-specific bugs).
Recovery:
- Circuit breaker on auth opens after 5 consecutive timeouts (configured at 200ms) — requests fail fast instead of queueing
- Auth circuit breaker fallback: fail-closed (return 401) for new sessions, fail-open (accept existing valid JWT) for requests with unexpired tokens
- Once auth service recovers, circuit breaker transitions to half-open and gradually restores traffic
- Post-incident: add bulkhead isolation — auth calls get their own thread pool so a slow auth cannot starve other operations
2. Kill Switch Race — Feature Disabled During High Traffic
Timeline: On-call disables the recommendations feature during a Black Friday traffic spike to reduce load. The kill switch propagation takes 15 seconds to reach all 50 app servers. During the propagation window, some servers return recommendations while others do not. Clients see inconsistent behavior — products appear and disappear from recommendation carousels on page refresh.
Blast radius: User-facing inconsistency for 15 seconds. No data loss, no incorrect transactions — purely a visual inconsistency that resolves once propagation completes.
Detection: Feature flag audit log shows the disable event. Recommendation service request rate drops gradually over 15 seconds (not instantly).
Recovery:
- Accept the 15-second propagation window as a known limitation — document it in the runbook
- For features where inconsistency matters, use client-side feature flags: the server responds with the flag state, and the client caches it for the session
- For faster propagation, use a push-based mechanism (Redis pub/sub) instead of polling with a 5-second interval
3. Load Shedding Starvation — Low-Priority Requests Never Served
Timeline: System operates at 75% CPU during peak hours for three consecutive weeks. Load shedding continuously rejects "low" priority requests (recommendations API, analytics events). Recommendation model training starves because it depends on analytics events. Recommendation quality degrades. Product team files a bug: "recommendations are terrible." Root cause: analytics data pipeline has been shed for three weeks.
Blast radius: Recommendation quality degrades silently over weeks. The feedback loop (user behavior → analytics → model training → recommendations) is broken. Revenue impact from poor recommendations is real but difficult to attribute.
Detection: Load shedding metrics show sustained rejection of low-priority traffic for >24 hours. Analytics pipeline lag alert (if monitored). Recommendation A/B test metrics show declining engagement.
Recovery:
- Immediate: increase capacity to reduce CPU below shedding threshold, or temporarily promote analytics events to "medium" priority
- Short-term: add a "minimum guaranteed throughput" per priority tier — even low-priority traffic gets at least 5% of capacity
- Long-term: separate the analytics ingestion path from the user-facing API. Analytics events should not compete with user requests for the same capacity
Staff Interview Application
How to Introduce This Pattern
Lead with the classification framework, then give specific fallback behaviors. This tells the interviewer you think about failure as a design constraint, not an afterthought.
When NOT to Use This Pattern
- Single-dependency system: If your service has one database and no external dependencies, degraded mode means "database is down, nothing works." No amount of degradation design helps — invest in database reliability instead.
- Batch systems: Offline jobs can simply retry or wait. Degraded mode is for real-time request paths where users are waiting.
- Strong consistency requirements everywhere: If the business cannot tolerate stale data in any response (financial trading, medical records), fail-closed on every dependency is the only option. Degradation via cached data is not applicable.
Follow-Up Questions to Anticipate
| Interviewer Asks | What They Are Testing | How to Respond |
|---|---|---|
| "Should auth fail open or closed?" | Business risk reasoning | "Fail-closed for new sessions — we cannot grant unauthorized access. But for requests with a valid, unexpired JWT, we can validate locally without calling the auth service. The token's signature and expiry are self-contained." |
| "How do you decide what to shed first?" | Priority classification | "Revenue-generating paths (checkout, payments) are critical. User access (auth, profile) is high. Discovery (search, recommendations) is medium. Analytics and telemetry are low. We shed from the bottom." |
| "What if the circuit breaker flaps?" | Operational maturity | "The half-open state sends a small percentage of traffic to test recovery. If the dependency flaps, the circuit stays open with exponential backoff on the reset timeout — 30s, 60s, 120s — so we're not constantly reopening into a failing service." |
| "How do you test degraded mode?" | Quality practices | "Chaos engineering: inject failures in staging and verify fallback behavior. Every degradation tier has an integration test that kills the dependency and asserts the expected response shape." |
| "Who decides fail-open vs fail-closed?" | Governance awareness | "It is a business decision, not an engineering decision. Product owns the 'what happens when X is down' answer. Engineering implements it. Security signs off on anything auth-related." |