StaffSignal
Cross-Cutting Framework

Degraded Mode Framework

Fail-open vs fail-closed decisions, fallback strategies, and ownership during dependency failures.

Degraded Mode Framework

What This Framework Covers

Every distributed system faces the question: when a dependency fails, what do we do with requests?

This is the fail-open vs fail-closed decision. It's not a technical choice — it's a business risk decision with explicit ownership.


The Core Decision

Rendering diagram...

When to Use Each

ContextRecommendedWhyWho Signs Off
Abuse protectionFail-openLimiter shouldn't become a kill switchSecurity + Infra
Billing / quotaFail-closedCannot give away paid resourcesProduct + Finance
Auth / access controlFail-closedCannot grant unauthorized accessSecurity
Feature flagsFail-closed (usually)New code shouldn't run without validationProduct
Cache missFail-open to originStale data often acceptableProduct
Internal service callDepends on cascadeAnalyze downstream impactService owner

The Staff Framework

Step 1: Name the Intent

Before deciding fail-open or fail-closed, answer:

  • What is this system protecting?
  • What is the cost of over-admission? (resources, money, security)
  • What is the cost of over-rejection? (availability, revenue, user trust)

Step 2: Choose with Guardrails

Fail-open is never "just allow everything." It requires:

GuardrailPurpose
Aggressive timeoutDon't let the dependency slow you down (5-10ms for fast paths)
Conservative fallback limitsIf bypassing rate limiter, apply local caps
Bypass-rate alertingIf bypass_rate > threshold, page on-call
Circuit breakerIf backend shows stress, tighten limits or shed load
Audit loggingRecord bypassed requests for post-incident analysis

Fail-closed is never "just reject." It requires:

GuardrailPurpose
Graceful degradationReturn partial data or cached response if possible
Clear error messagingTell the client what happened and when to retry
Fast recovery detectionReopen the circuit quickly when dependency recovers
Escalation pathOn-call can manually override if needed

Step 3: Define the Governance

QuestionMust Have Answer
Who can flip fail-open/closed?On-call? SRE? Requires approval?
What is the kill-switch scope?Per-endpoint? Per-tenant? Global?
How fast can we change it?Config push? Requires deploy?
What is the audit trail?Who changed what, when, why?
What post-incident analysis is required?Review bypassed requests? Customer impact?

L6 vs L7 Calibration

DimensionL6 (Staff)L7 (Principal)
Decision speedMakes the choice in 30 seconds with clear reasoningSame, but immediately asks about governance
GuardrailsNames 3-4 specific mitigationsDesigns the guardrail system + monitoring
Ownership"Who operates this?""Who can change this behavior in production?"
Blast radiusConsiders single serviceConsiders org-wide consistency

Common Interview Probes

After You Say...They Will Ask...
"Fail-open for availability""What prevents abuse during the window?"
"Fail-closed for correctness""What's the user experience? Can they retry?"
"We'll add alerting""What's the threshold? Who gets paged?"
"Circuit breaker""What signals trigger it? How fast does it recover?"

Anti-Patterns

1. "We'll decide at runtime"

Degraded mode behavior should be decided at design time, not discovered during an incident.

2. "Fail-open by default"

Without explicit guardrails, fail-open becomes "accept all abuse."

3. "The dependency will be reliable"

Dependencies fail. The question is when, not if.

4. "We'll just add retries"

Retries without backoff + jitter turn a partial outage into a retry storm.

5. "Fail-closed is always safer"

Fail-closed can cause cascading failures if upstream services can't handle rejections.


Applying to Specific Systems

Rate Limiting

  • Abuse protection: Fail-open with local caps + bypass alerting
  • Billing/quota: Fail-closed with graceful 503

Circuit Breaker

  • Default: Fail-closed (that's the point)
  • But: Define the half-open behavior for recovery

Feature Flags

  • New features: Fail-closed (don't expose untested code)
  • Kill switches: Fail-open (the kill switch should always work)

Cache

  • Read path: Fail-open to origin (stale is better than error)
  • Write path: Depends on consistency requirements

Staff Sentence Templates


Implementation Deep Dive

1. Circuit Breaker with Fallback — The Core Degraded-Mode Primitive

A circuit breaker monitors downstream health and short-circuits requests when a dependency is failing. The key Staff insight: the circuit breaker itself is not the degradation strategy — the fallback behavior is.

Circuit Breaker State Machine

# Circuit breaker with three states
class CircuitBreaker:
    state = "closed"          # Normal operation
    failureCount = 0
    lastFailureTime = null
    halfOpenAttempts = 0

    config:
        failureThreshold = 5          # Failures before opening
        resetTimeout = 30_000         # ms before trying half-open
        halfOpenMaxAttempts = 3       # Successful calls to close
        timeout = 2_000               # Per-request timeout (ms)

    function call(operation, fallback):
        match state:
            case "open":
                if now() - lastFailureTime > resetTimeout:
                    state = "half-open"
                    halfOpenAttempts = 0
                    # Fall through to half-open
                else:
                    metrics.increment("circuit.short_circuited")
                    return fallback()

            case "half-open":
                try:
                    result = operation(timeout=config.timeout)
                    halfOpenAttempts += 1
                    if halfOpenAttempts >= halfOpenMaxAttempts:
                        state = "closed"
                        failureCount = 0
                        metrics.increment("circuit.closed")
                    return result
                except:
                    state = "open"
                    lastFailureTime = now()
                    return fallback()

            case "closed":
                try:
                    result = operation(timeout=config.timeout)
                    failureCount = max(0, failureCount - 1)   # Gradual recovery
                    return result
                except:
                    failureCount += 1
                    lastFailureTime = now()
                    if failureCount >= failureThreshold:
                        state = "open"
                        metrics.increment("circuit.opened")
                    return fallback()

2. Tiered Degradation — The Staff Default

A single fail-open/fail-closed decision is too coarse. Staff engineers implement tiered degradation where each tier provides less functionality but maintains core business operations.

Tiered Degradation Implementation

# Service health tiers
DEGRADATION_TIERS = {
    "full":     { "features": ["all"],          "description": "Normal operation" },
    "degraded": { "features": ["core", "cached"], "description": "Non-critical features disabled" },
    "minimal":  { "features": ["core"],          "description": "Only essential operations" },
    "emergency": { "features": [],               "description": "Static error page / maintenance mode" },
}

function getCurrentTier():
    healthChecks = {
        "database": checkDatabase(),         # Primary + replicas
        "cache": checkCache(),               # Redis cluster
        "auth": checkAuth(),                 # Auth service
        "search": checkSearch(),             # Elasticsearch
        "recommendations": checkRecs(),      # ML service
    }

    criticalDown = not healthChecks["database"] or not healthChecks["auth"]
    if criticalDown:
        return "emergency"

    failingCount = sum(1 for v in healthChecks.values() if not v)
    if failingCount >= 3:
        return "minimal"
    if failingCount >= 1:
        return "degraded"
    return "full"

function handleRequest(request):
    tier = getCurrentTier()
    metrics.gauge("service.degradation_tier", tier)

    match tier:
        case "full":
            return fullResponse(request)

        case "degraded":
            response = coreResponse(request)
            response.recommendations = cachedRecommendations(request.userId)  # Stale OK
            response.searchResults = None                                      # Feature disabled
            response.headers["X-Degraded"] = "search,recommendations"
            return response

        case "minimal":
            return coreOnlyResponse(request)    # No personalization, no search

        case "emergency":
            return staticMaintenancePage()

3. Load Shedding — Protecting the Core Under Overload

When the system is overwhelmed, serving everyone slowly is worse than serving some users well and rejecting others cleanly. Load shedding proactively rejects requests to maintain SLA for accepted traffic.

Priority-Based Load Shedding

# Request priority classification
function classifyPriority(request):
    if request.path.startsWith("/api/payments"):
        return "critical"          # Revenue-generating
    if request.path.startsWith("/api/checkout"):
        return "critical"
    if request.path.startsWith("/api/auth"):
        return "high"              # User access
    if request.path.startsWith("/api/search"):
        return "medium"            # Valuable but not essential
    if request.path.startsWith("/api/recommendations"):
        return "low"               # Nice to have
    return "medium"

# Admission control at the load balancer / gateway
function admissionControl(request):
    priority = classifyPriority(request)
    currentLoad = metrics.get("system.cpu_utilization")

    # Shed low-priority traffic first, then medium, etc.
    thresholds = {
        "low":      0.70,     # Start shedding at 70% CPU
        "medium":   0.80,     # Start shedding at 80% CPU
        "high":     0.90,     # Start shedding at 90% CPU
        "critical": 0.98,     # Almost never shed
    }

    if currentLoad > thresholds[priority]:
        metrics.increment("load_shed.rejected", tags=["priority:" + priority])
        return Response(503, headers={"Retry-After": "5"},
                       body={"error": "service_overloaded", "retry_after_sec": 5})

    return processRequest(request)

4. Feature Flags for Graceful Degradation

Feature flags are not just for rollouts — they are the primary mechanism for disabling non-essential features during incidents.

Kill Switch Pattern

# Feature flag configuration (stored in Redis or config service)
FEATURE_FLAGS = {
    "recommendations.enabled":  { "default": true,  "kill_switch": true },
    "search.autocomplete":      { "default": true,  "kill_switch": true },
    "analytics.client_events":  { "default": true,  "kill_switch": true },
    "notifications.push":       { "default": true,  "kill_switch": true },
    "checkout.express":         { "default": true,  "kill_switch": false },  # Never disable
}

function isFeatureEnabled(featureName, context):
    flag = FEATURE_FLAGS[featureName]

    # Kill switch override — fastest possible check
    if flag.kill_switch and killSwitchService.isDisabled(featureName):
        metrics.increment("feature.killed", tags=["feature:" + featureName])
        return false

    return flag.default

# On-call incident response
function disableFeature(featureName, reason, operator):
    killSwitchService.disable(featureName)
    auditLog.record({
        "action": "feature_disabled",
        "feature": featureName,
        "reason": reason,
        "operator": operator,
        "timestamp": now()
    })
    alerting.notify("Feature disabled: " + featureName + " by " + operator)

Architecture Diagram

Rendering diagram...

Normal flow (solid): Requests pass through load shedding, circuit breakers, and feature flag checks. All services respond normally.

Degraded flow (dashed): When a dependency fails, the circuit breaker routes to the fallback — cached data for catalog/recommendations, static responses for emergency mode. Kill switches disable non-critical services instantly.


Failure Scenarios

1. Cascading Failure — Auth Service Latency Spike

Timeline: The auth service response time increases from 5ms to 2,000ms due to a database migration. Every request now takes 2 seconds longer. Thread pools fill across all API servers. Requests queue behind slow auth calls. Within 3 minutes, all services are unresponsive — not because they are broken, but because they are waiting for auth.

Blast radius: Total site outage. Every authenticated endpoint is affected. Even health check endpoints return 503 because the thread pool is exhausted.

Detection: Auth service p99 latency exceeds 500ms. API server thread pool utilization hits 100%. Error rate spikes across all services simultaneously (a signature of upstream dependency failure, not service-specific bugs).

Recovery:

  1. Circuit breaker on auth opens after 5 consecutive timeouts (configured at 200ms) — requests fail fast instead of queueing
  2. Auth circuit breaker fallback: fail-closed (return 401) for new sessions, fail-open (accept existing valid JWT) for requests with unexpired tokens
  3. Once auth service recovers, circuit breaker transitions to half-open and gradually restores traffic
  4. Post-incident: add bulkhead isolation — auth calls get their own thread pool so a slow auth cannot starve other operations

2. Kill Switch Race — Feature Disabled During High Traffic

Timeline: On-call disables the recommendations feature during a Black Friday traffic spike to reduce load. The kill switch propagation takes 15 seconds to reach all 50 app servers. During the propagation window, some servers return recommendations while others do not. Clients see inconsistent behavior — products appear and disappear from recommendation carousels on page refresh.

Blast radius: User-facing inconsistency for 15 seconds. No data loss, no incorrect transactions — purely a visual inconsistency that resolves once propagation completes.

Detection: Feature flag audit log shows the disable event. Recommendation service request rate drops gradually over 15 seconds (not instantly).

Recovery:

  1. Accept the 15-second propagation window as a known limitation — document it in the runbook
  2. For features where inconsistency matters, use client-side feature flags: the server responds with the flag state, and the client caches it for the session
  3. For faster propagation, use a push-based mechanism (Redis pub/sub) instead of polling with a 5-second interval

3. Load Shedding Starvation — Low-Priority Requests Never Served

Timeline: System operates at 75% CPU during peak hours for three consecutive weeks. Load shedding continuously rejects "low" priority requests (recommendations API, analytics events). Recommendation model training starves because it depends on analytics events. Recommendation quality degrades. Product team files a bug: "recommendations are terrible." Root cause: analytics data pipeline has been shed for three weeks.

Blast radius: Recommendation quality degrades silently over weeks. The feedback loop (user behavior → analytics → model training → recommendations) is broken. Revenue impact from poor recommendations is real but difficult to attribute.

Detection: Load shedding metrics show sustained rejection of low-priority traffic for >24 hours. Analytics pipeline lag alert (if monitored). Recommendation A/B test metrics show declining engagement.

Recovery:

  1. Immediate: increase capacity to reduce CPU below shedding threshold, or temporarily promote analytics events to "medium" priority
  2. Short-term: add a "minimum guaranteed throughput" per priority tier — even low-priority traffic gets at least 5% of capacity
  3. Long-term: separate the analytics ingestion path from the user-facing API. Analytics events should not compete with user requests for the same capacity

Staff Interview Application

How to Introduce This Pattern

Lead with the classification framework, then give specific fallback behaviors. This tells the interviewer you think about failure as a design constraint, not an afterthought.

When NOT to Use This Pattern

  • Single-dependency system: If your service has one database and no external dependencies, degraded mode means "database is down, nothing works." No amount of degradation design helps — invest in database reliability instead.
  • Batch systems: Offline jobs can simply retry or wait. Degraded mode is for real-time request paths where users are waiting.
  • Strong consistency requirements everywhere: If the business cannot tolerate stale data in any response (financial trading, medical records), fail-closed on every dependency is the only option. Degradation via cached data is not applicable.

Follow-Up Questions to Anticipate

Interviewer AsksWhat They Are TestingHow to Respond
"Should auth fail open or closed?"Business risk reasoning"Fail-closed for new sessions — we cannot grant unauthorized access. But for requests with a valid, unexpired JWT, we can validate locally without calling the auth service. The token's signature and expiry are self-contained."
"How do you decide what to shed first?"Priority classification"Revenue-generating paths (checkout, payments) are critical. User access (auth, profile) is high. Discovery (search, recommendations) is medium. Analytics and telemetry are low. We shed from the bottom."
"What if the circuit breaker flaps?"Operational maturity"The half-open state sends a small percentage of traffic to test recovery. If the dependency flaps, the circuit stays open with exponential backoff on the reset timeout — 30s, 60s, 120s — so we're not constantly reopening into a failing service."
"How do you test degraded mode?"Quality practices"Chaos engineering: inject failures in staging and verify fallback behavior. Every degradation tier has an integration test that kills the dependency and asserts the expected response shape."
"Who decides fail-open vs fail-closed?"Governance awareness"It is a business decision, not an engineering decision. Product owns the 'what happens when X is down' answer. Engineering implements it. Security signs off on anything auth-related."