Degraded Mode Framework

What This Framework Covers

Every distributed system faces the question: when a dependency fails, what do we do with requests?

This is the fail-open vs fail-closed decision. It's not a technical choice — it's a business risk decision with explicit ownership.

The Core Decision

Rendering diagram...

When to Use Each

Context	Recommended	Why	Who Signs Off
Abuse protection	Fail-open	Limiter shouldn't become a kill switch	Security + Infra
Billing / quota	Fail-closed	Cannot give away paid resources	Product + Finance
Auth / access control	Fail-closed	Cannot grant unauthorized access	Security
Feature flags	Fail-closed (usually)	New code shouldn't run without validation	Product
Cache miss	Fail-open to origin	Stale data often acceptable	Product
Internal service call	Depends on cascade	Analyze downstream impact	Service owner

The Staff Framework

Step 1: Name the Intent

Before deciding fail-open or fail-closed, answer:

What is this system protecting?
What is the cost of over-admission? (resources, money, security)
What is the cost of over-rejection? (availability, revenue, user trust)

Step 2: Choose with Guardrails

Fail-open is never "just allow everything." It requires:

Guardrail	Purpose
Aggressive timeout	Don't let the dependency slow you down (5-10ms for fast paths)
Conservative fallback limits	If bypassing rate limiter, apply local caps
Bypass-rate alerting	If bypass_rate > threshold, page on-call
Circuit breaker	If backend shows stress, tighten limits or shed load
Audit logging	Record bypassed requests for post-incident analysis

Fail-closed is never "just reject." It requires:

Guardrail	Purpose
Graceful degradation	Return partial data or cached response if possible
Clear error messaging	Tell the client what happened and when to retry
Fast recovery detection	Reopen the circuit quickly when dependency recovers
Escalation path	On-call can manually override if needed

Step 3: Define the Governance

Question	Must Have Answer
Who can flip fail-open/closed?	On-call? SRE? Requires approval?
What is the kill-switch scope?	Per-endpoint? Per-tenant? Global?
How fast can we change it?	Config push? Requires deploy?
What is the audit trail?	Who changed what, when, why?
What post-incident analysis is required?	Review bypassed requests? Customer impact?

L6 vs L7 Calibration

Dimension	L6 (Staff)	L7 (Principal)
Decision speed	Makes the choice in 30 seconds with clear reasoning	Same, but immediately asks about governance
Guardrails	Names 3-4 specific mitigations	Designs the guardrail system + monitoring
Ownership	"Who operates this?"	"Who can change this behavior in production?"
Blast radius	Considers single service	Considers org-wide consistency

Common Interview Probes

After You Say...	They Will Ask...
"Fail-open for availability"	"What prevents abuse during the window?"
"Fail-closed for correctness"	"What's the user experience? Can they retry?"
"We'll add alerting"	"What's the threshold? Who gets paged?"
"Circuit breaker"	"What signals trigger it? How fast does it recover?"

Anti-Patterns

1. "We'll decide at runtime"

Degraded mode behavior should be decided at design time, not discovered during an incident.

2. "Fail-open by default"

Without explicit guardrails, fail-open becomes "accept all abuse."

3. "The dependency will be reliable"

Dependencies fail. The question is when, not if.

4. "We'll just add retries"

Retries without backoff + jitter turn a partial outage into a retry storm.

5. "Fail-closed is always safer"

Fail-closed can cause cascading failures if upstream services can't handle rejections.

Applying to Specific Systems

Rate Limiting

Abuse protection: Fail-open with local caps + bypass alerting
Billing/quota: Fail-closed with graceful 503

Circuit Breaker

Default: Fail-closed (that's the point)
But: Define the half-open behavior for recovery

Feature Flags

New features: Fail-closed (don't expose untested code)
Kill switches: Fail-open (the kill switch should always work)

Cache

Read path: Fail-open to origin (stale is better than error)
Write path: Depends on consistency requirements

Staff Sentence Templates

Implementation Deep Dive

1. Circuit Breaker with Fallback — The Core Degraded-Mode Primitive

A circuit breaker monitors downstream health and short-circuits requests when a dependency is failing. The key Staff insight: the circuit breaker itself is not the degradation strategy — the fallback behavior is.

Circuit Breaker State Machine

# Circuit breaker with three states
class CircuitBreaker:
    state = "closed"          # Normal operation
    failureCount = 0
    lastFailureTime = null
    halfOpenAttempts = 0

    config:
        failureThreshold = 5          # Failures before opening
        resetTimeout = 30_000         # ms before trying half-open
        halfOpenMaxAttempts = 3       # Successful calls to close
        timeout = 2_000               # Per-request timeout (ms)

    function call(operation, fallback):
        match state:
            case "open":
                if now() - lastFailureTime > resetTimeout:
                    state = "half-open"
                    halfOpenAttempts = 0
                    # Fall through to half-open
                else:
                    metrics.increment("circuit.short_circuited")
                    return fallback()

            case "half-open":
                try:
                    result = operation(timeout=config.timeout)
                    halfOpenAttempts += 1
                    if halfOpenAttempts >= halfOpenMaxAttempts:
                        state = "closed"
                        failureCount = 0
                        metrics.increment("circuit.closed")
                    return result
                except:
                    state = "open"
                    lastFailureTime = now()
                    return fallback()

            case "closed":
                try:
                    result = operation(timeout=config.timeout)
                    failureCount = max(0, failureCount - 1)   # Gradual recovery
                    return result
                except:
                    failureCount += 1
                    lastFailureTime = now()
                    if failureCount >= failureThreshold:
                        state = "open"
                        metrics.increment("circuit.opened")
                    return fallback()

2. Tiered Degradation — The Staff Default

A single fail-open/fail-closed decision is too coarse. Staff engineers implement tiered degradation where each tier provides less functionality but maintains core business operations.

Tiered Degradation Implementation

# Service health tiers
DEGRADATION_TIERS = {
    "full":     { "features": ["all"],          "description": "Normal operation" },
    "degraded": { "features": ["core", "cached"], "description": "Non-critical features disabled" },
    "minimal":  { "features": ["core"],          "description": "Only essential operations" },
    "emergency": { "features": [],               "description": "Static error page / maintenance mode" },
}

function getCurrentTier():
    healthChecks = {
        "database": checkDatabase(),         # Primary + replicas
        "cache": checkCache(),               # Redis cluster
        "auth": checkAuth(),                 # Auth service
        "search": checkSearch(),             # Elasticsearch
        "recommendations": checkRecs(),      # ML service
    }

    criticalDown = not healthChecks["database"] or not healthChecks["auth"]
    if criticalDown:
        return "emergency"

    failingCount = sum(1 for v in healthChecks.values() if not v)
    if failingCount >= 3:
        return "minimal"
    if failingCount >= 1:
        return "degraded"
    return "full"

function handleRequest(request):
    tier = getCurrentTier()
    metrics.gauge("service.degradation_tier", tier)

    match tier:
        case "full":
            return fullResponse(request)

        case "degraded":
            response = coreResponse(request)
            response.recommendations = cachedRecommendations(request.userId)  # Stale OK
            response.searchResults = None                                      # Feature disabled
            response.headers["X-Degraded"] = "search,recommendations"
            return response

        case "minimal":
            return coreOnlyResponse(request)    # No personalization, no search

        case "emergency":
            return staticMaintenancePage()

3. Load Shedding — Protecting the Core Under Overload

When the system is overwhelmed, serving everyone slowly is worse than serving some users well and rejecting others cleanly. Load shedding proactively rejects requests to maintain SLA for accepted traffic.

Priority-Based Load Shedding

# Request priority classification
function classifyPriority(request):
    if request.path.startsWith("/api/payments"):
        return "critical"          # Revenue-generating
    if request.path.startsWith("/api/checkout"):
        return "critical"
    if request.path.startsWith("/api/auth"):
        return "high"              # User access
    if request.path.startsWith("/api/search"):
        return "medium"            # Valuable but not essential
    if request.path.startsWith("/api/recommendations"):
        return "low"               # Nice to have
    return "medium"

# Admission control at the load balancer / gateway
function admissionControl(request):
    priority = classifyPriority(request)
    currentLoad = metrics.get("system.cpu_utilization")

    # Shed low-priority traffic first, then medium, etc.
    thresholds = {
        "low":      0.70,     # Start shedding at 70% CPU
        "medium":   0.80,     # Start shedding at 80% CPU
        "high":     0.90,     # Start shedding at 90% CPU
        "critical": 0.98,     # Almost never shed
    }

    if currentLoad > thresholds[priority]:
        metrics.increment("load_shed.rejected", tags=["priority:" + priority])
        return Response(503, headers={"Retry-After": "5"},
                       body={"error": "service_overloaded", "retry_after_sec": 5})

    return processRequest(request)

4. Feature Flags for Graceful Degradation

Feature flags are not just for rollouts — they are the primary mechanism for disabling non-essential features during incidents.

Kill Switch Pattern

# Feature flag configuration (stored in Redis or config service)
FEATURE_FLAGS = {
    "recommendations.enabled":  { "default": true,  "kill_switch": true },
    "search.autocomplete":      { "default": true,  "kill_switch": true },
    "analytics.client_events":  { "default": true,  "kill_switch": true },
    "notifications.push":       { "default": true,  "kill_switch": true },
    "checkout.express":         { "default": true,  "kill_switch": false },  # Never disable
}

function isFeatureEnabled(featureName, context):
    flag = FEATURE_FLAGS[featureName]

    # Kill switch override — fastest possible check
    if flag.kill_switch and killSwitchService.isDisabled(featureName):
        metrics.increment("feature.killed", tags=["feature:" + featureName])
        return false

    return flag.default

# On-call incident response
function disableFeature(featureName, reason, operator):
    killSwitchService.disable(featureName)
    auditLog.record({
        "action": "feature_disabled",
        "feature": featureName,
        "reason": reason,
        "operator": operator,
        "timestamp": now()
    })
    alerting.notify("Feature disabled: " + featureName + " by " + operator)

Architecture Diagram

Rendering diagram...

Normal flow (solid): Requests pass through load shedding, circuit breakers, and feature flag checks. All services respond normally.

Degraded flow (dashed): When a dependency fails, the circuit breaker routes to the fallback — cached data for catalog/recommendations, static responses for emergency mode. Kill switches disable non-critical services instantly.

Failure Scenarios

1. Cascading Failure — Auth Service Latency Spike

Timeline: The auth service response time increases from 5ms to 2,000ms due to a database migration. Every request now takes 2 seconds longer. Thread pools fill across all API servers. Requests queue behind slow auth calls. Within 3 minutes, all services are unresponsive — not because they are broken, but because they are waiting for auth.

Blast radius: Total site outage. Every authenticated endpoint is affected. Even health check endpoints return 503 because the thread pool is exhausted.

Detection: Auth service p99 latency exceeds 500ms. API server thread pool utilization hits 100%. Error rate spikes across all services simultaneously (a signature of upstream dependency failure, not service-specific bugs).

Recovery:

Circuit breaker on auth opens after 5 consecutive timeouts (configured at 200ms) — requests fail fast instead of queueing
Auth circuit breaker fallback: fail-closed (return 401) for new sessions, fail-open (accept existing valid JWT) for requests with unexpired tokens
Once auth service recovers, circuit breaker transitions to half-open and gradually restores traffic
Post-incident: add bulkhead isolation — auth calls get their own thread pool so a slow auth cannot starve other operations

2. Kill Switch Race — Feature Disabled During High Traffic

Timeline: On-call disables the recommendations feature during a Black Friday traffic spike to reduce load. The kill switch propagation takes 15 seconds to reach all 50 app servers. During the propagation window, some servers return recommendations while others do not. Clients see inconsistent behavior — products appear and disappear from recommendation carousels on page refresh.

Blast radius: User-facing inconsistency for 15 seconds. No data loss, no incorrect transactions — purely a visual inconsistency that resolves once propagation completes.

Detection: Feature flag audit log shows the disable event. Recommendation service request rate drops gradually over 15 seconds (not instantly).

Recovery:

Accept the 15-second propagation window as a known limitation — document it in the runbook
For features where inconsistency matters, use client-side feature flags: the server responds with the flag state, and the client caches it for the session
For faster propagation, use a push-based mechanism (Redis pub/sub) instead of polling with a 5-second interval

3. Load Shedding Starvation — Low-Priority Requests Never Served

Timeline: System operates at 75% CPU during peak hours for three consecutive weeks. Load shedding continuously rejects "low" priority requests (recommendations API, analytics events). Recommendation model training starves because it depends on analytics events. Recommendation quality degrades. Product team files a bug: "recommendations are terrible." Root cause: analytics data pipeline has been shed for three weeks.

Blast radius: Recommendation quality degrades silently over weeks. The feedback loop (user behavior → analytics → model training → recommendations) is broken. Revenue impact from poor recommendations is real but difficult to attribute.

Detection: Load shedding metrics show sustained rejection of low-priority traffic for >24 hours. Analytics pipeline lag alert (if monitored). Recommendation A/B test metrics show declining engagement.

Recovery:

Immediate: increase capacity to reduce CPU below shedding threshold, or temporarily promote analytics events to "medium" priority
Short-term: add a "minimum guaranteed throughput" per priority tier — even low-priority traffic gets at least 5% of capacity
Long-term: separate the analytics ingestion path from the user-facing API. Analytics events should not compete with user requests for the same capacity

Staff Interview Application

How to Introduce This Pattern

Lead with the classification framework, then give specific fallback behaviors. This tells the interviewer you think about failure as a design constraint, not an afterthought.

When NOT to Use This Pattern

Single-dependency system: If your service has one database and no external dependencies, degraded mode means "database is down, nothing works." No amount of degradation design helps — invest in database reliability instead.
Batch systems: Offline jobs can simply retry or wait. Degraded mode is for real-time request paths where users are waiting.
Strong consistency requirements everywhere: If the business cannot tolerate stale data in any response (financial trading, medical records), fail-closed on every dependency is the only option. Degradation via cached data is not applicable.

Follow-Up Questions to Anticipate

Interviewer Asks	What They Are Testing	How to Respond
"Should auth fail open or closed?"	Business risk reasoning	"Fail-closed for new sessions — we cannot grant unauthorized access. But for requests with a valid, unexpired JWT, we can validate locally without calling the auth service. The token's signature and expiry are self-contained."
"How do you decide what to shed first?"	Priority classification	"Revenue-generating paths (checkout, payments) are critical. User access (auth, profile) is high. Discovery (search, recommendations) is medium. Analytics and telemetry are low. We shed from the bottom."
"What if the circuit breaker flaps?"	Operational maturity	"The half-open state sends a small percentage of traffic to test recovery. If the dependency flaps, the circuit stays open with exponential backoff on the reset timeout — 30s, 60s, 120s — so we're not constantly reopening into a failing service."
"How do you test degraded mode?"	Quality practices	"Chaos engineering: inject failures in staging and verify fallback behavior. Every degradation tier has an integration test that kills the dependency and asserts the expected response shape."
"Who decides fail-open vs fail-closed?"	Governance awareness	"It is a business decision, not an engineering decision. Product owns the 'what happens when X is down' answer. Engineering implements it. Security signs off on anything auth-related."