StaffSignal

Design a Distributed Cache

Staff-Level Playbook

How to Use This Playbook

This playbook supports three reading modes:

ModeTimeWhat to Read
Quick Review15 minExecutive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study1-2 hrsExecutive Summary → Interview Walkthrough → Core Flow, expand appendices where you're weak
Deep Dive3+ hrsEverything, including all appendices
Expandable sections contain deeper mechanics. Open them when you need the detail.
What is Distributed Caching? — Quick primer if you're unfamiliar

The Problem

A cache stores copies of frequently accessed data in fast storage (usually memory) to avoid repeatedly hitting slower backends like databases or external APIs. "Distributed" means the cache spans multiple nodes, allowing it to scale beyond a single machine's memory and survive individual node failures. The tradeoff: you're trading consistency for speed—cached data can become stale.

Common Use Cases

  • Database Query Caching: Store expensive query results to reduce database load (e.g., product catalogs, user profiles)
  • Session Storage: Keep user sessions in fast-access memory across a cluster of web servers
  • API Response Caching: Cache third-party API responses to reduce latency and avoid rate limits
  • Computed Result Caching: Store results of expensive computations (ML model outputs, aggregations)
  • CDN Edge Caching: Cache static and semi-dynamic content at edge locations for global users

Why Interviewers Ask About This

Caching seems simple but hides brutal complexity: invalidation is a consistency problem, not a timeout problem. Interviewers want to see if you understand that adding a cache means you now have two sources of truth that can disagree. Can you articulate when staleness is acceptable? Do you know what happens when the cache goes down? This topic reveals whether you've dealt with real production issues—cache stampedes, thundering herds, and the dreaded "why is this showing old data?" bug.

What This Interview Actually Tests

Caching is not a "make it faster" question. Everyone knows Redis.

This is a consistency and operational ownership question that tests:

  • Whether you understand why caching introduces complexity, not just speed
  • Whether you reason about invalidation before discussing eviction
  • Whether you can articulate what "stale" means for your use case
  • Whether you understand the blast radius when the cache fails

The key insight: Caching is a consistency problem disguised as a performance optimization. Staff engineers reason about who pays for staleness and who owns the invalidation contract.

The L5 vs L6 Contrast (Memorize This)

Level Calibration
BehaviorL5 (Senior)L6 (Staff)
First move"We'll add Redis in front of the database"Asks "What's the staleness tolerance? Who's the source of truth?"
Invalidation"TTL of 5 minutes""TTL is the last resort. What's our invalidation contract?"
Failure"We'll add replicas""When the cache fails, do we hit the DB or return errors? What's the thundering herd plan?"
ConsistencyAssumes cache is always helpfulArticulates when caching makes things worse (write-heavy, low hit rate)
OwnershipFocuses on cache implementationAsks "Who owns cache warming? Who gets paged when hit rate drops?"

The Three Caching Intents (Pick One and Commit)

Who Pays Analysis
IntentConstraintStrategyStaleness Bar
Latency ReductionSpeed is everythingAggressive caching, read-throughSeconds to minutes acceptable
Origin ProtectionShield backend from loadCache-aside with circuit breakerMinutes acceptable, freshness secondary
Cost OptimizationReduce expensive computation/queriesPrecompute + cache, longer TTLsMinutes to hours acceptable

Staff Move: "I'll assume we're protecting the origin from read load while maintaining sub-second staleness for user-facing data. This is the hardest case because we need both availability and freshness."

The Five Fault Lines (The Core of This Interview)

  1. Freshness vs Performance — Shorter TTLs mean fresher data but more origin load. Who decides the staleness budget?

  2. Cache-Aside vs Read-Through vs Write-Through — Where does the invalidation logic live? Who owns it?

  3. Availability vs Consistency — When the cache is down, do we serve stale data, hit origin, or fail?

  4. Local vs Distributed — In-process cache (fast, inconsistent) vs shared cache (slower, consistent)?

  5. Proactive vs Reactive Invalidation — Push invalidation on write, or let TTL expire? Who coordinates?

Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.

Default Staff Positions (Unless Proven Otherwise)

Who Pays Analysis
PositionRationale
Cache-aside over write-throughDecouples cache availability from read availability
TTL is safety net, not strategyPrimary invalidation should be explicit on write path
Invalidation before evictionDesign the invalidation contract before discussing LRU vs LFU
Serve stale only with sign-offProduct must explicitly approve staleness budget per data type
Financial/auth data bypasses cacheCorrectness cost of staleness exceeds performance benefit
If invalidation can't be owned, don't cacheUnowned invalidation = guaranteed stale data incidents

Quick Reference: What Interviewers Probe

After You Say...They Will Ask...
"Add Redis cache""What's your invalidation strategy? What happens on write?"
"TTL of 5 minutes""What if the data changes? Is 5 minutes of staleness acceptable?"
"Cache-aside pattern""What about thundering herd on cache miss?"
"We'll add replicas""Replicas don't answer cold-start. What's your warming strategy?"
"Invalidate on write""How do you handle race conditions between write and invalidation?"

Jump to Practice

Active Drills (§7) — 10 practice prompts with expected answer shapes

System Architecture Overview

Rendering diagram...

Interview Walkthrough: How to Present This in 45 Minutes

The HelloInterview-style guides walk you through each step at tutorial pace. That's fine for Senior candidates. At Staff level, the basics should take 10-12 minutes — fast enough that you spend the remaining 30+ minutes on the invalidation, failure, and consistency questions that actually determine your level.

The six phases below add up to 45 minutes. The ratios matter: phases 1-4 are deliberately compressed so phase 5 gets the lion's share of time. If you're spending more than 12 minutes before the transition to depth, you're pacing like an L5.

Phase 1: Requirements & Framing (2-3 minutes)

State functional requirements in 30 seconds — don't enumerate, state the category:

  • "We need a distributed caching layer to reduce database load and serve repeated reads at sub-millisecond latency."

That's it. Don't list every data type or cache operation.

Invest time on non-functional requirements (this is the Staff move):

  • "What's the staleness budget? Product needs to define acceptable staleness per data type — prices need <30s, product descriptions can tolerate 5 minutes, user sessions need zero staleness."
  • Clarify: read-to-write ratio (100:1 justifies caching, 2:1 probably doesn't), dataset size (does it fit in memory?), consistency model (eventual vs strong)
  • "I'll assume a read-heavy workload with per-type staleness budgets, because that's the most common production scenario and forces the hardest invalidation decisions."

Phase 2: Core Entities & API (1-2 minutes)

State entities quickly (30 seconds):

  • CacheEntry — key, value, TTL, version/etag (the version enables conditional invalidation)
  • InvalidationEvent — key_pattern, source, timestamp, propagation_status (first-class entity, not an afterthought)
  • CachePolicy — data_type, staleness_budget, eviction_strategy, origin_fallback behavior

API (1 minute) — transparent cache-aside in the application layer, not a separate API:

get(key) → HIT(value, age) | MISS
set(key, value, ttl, invalidation_policy) → OK
invalidate(key_or_pattern, reason) → OK

The invalidation path is the one that matters:

on_write(entity) → invalidate(cache_key(entity), "source_write")

Phase 3: High-Level Architecture (5-7 minutes)

Draw the core cache-aside flow on the whiteboard:

Rendering diagram...

Walk the interviewer through the four data flows (reference the full System Architecture diagram above for the complete multi-layer picture):

  1. Read path → App checks Redis first; on miss, reads from PostgreSQL, populates cache with data-type-specific TTL
  2. Write path → App writes to PostgreSQL, then invalidates the cache key (delete, not update — avoids race conditions)
  3. Invalidation propagation → For critical data, CDC (change data capture) publishes invalidation events as a backstop — so even if the application forgets to invalidate, the database change stream catches it
  4. Failure path → When Redis is unavailable, requests fall through to PostgreSQL with circuit breaker protection and request coalescing (singleflight) to prevent thundering herd

Scripted walkthrough: "Read path: the app server checks Redis first — cache-aside. On miss, reads from PostgreSQL, populates cache with TTL. We use request coalescing so if 100 concurrent requests miss on the same key, only one hits the database. Write path: app writes to PostgreSQL, then invalidates the cache key. CDC publishes invalidation events as a backstop."

Key points to hit on the whiteboard:

  1. Cache-aside pattern — application controls both read and write paths (not write-through, which couples cache to write latency)
  2. Redis Cluster with hash slots — 6 shards for horizontal scaling, consistent hashing for key distribution
  3. Request coalescing — singleflight pattern prevents thundering herd on cache miss
  4. Write-path invalidation — delete on write, not update; avoids race conditions between concurrent writers
  5. CDN as first cache layer — browser cache → CDN edge → Redis → PostgreSQL; four-layer hierarchy

Then immediately flag the key tension: "This works for the happy path. The interesting questions are: what happens when Redis goes down and 100% of traffic hits PostgreSQL? Who owns the invalidation contract when 5 different services write to the same entity? And how do you detect that your cache is serving stale data without anyone noticing?"

Phase 4: Transition to Depth (1 minute)

At this point you have a correct, simple architecture on the board. Now you pivot:

"The basic architecture is well-understood — cache-aside with Redis and TTL-based expiry. What makes this Staff-level is the consistency and operational reasoning. Let me dive into three areas: (1) invalidation strategy and who owns it, (2) failure modes when the cache layer goes down, (3) how to detect and measure staleness in production."

Then offer the interviewer a choice:

"I can go deep on any of these. Which is most interesting to you?"

If the interviewer doesn't have a preference, lead with invalidation strategy — it's the most universally asked and the most misunderstood.

Phase 5: Deep Dives (25-30 minutes)

The interviewer will steer, but be prepared to go deep on any of these. For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → explain who absorbs that cost.

Fault Line 1: Freshness vs performance — the staleness budget (5-7 min)

Open with the business framing:

"Every cache entry has a staleness budget — the maximum age the business will tolerate. For product prices, that's <30 seconds (stale prices lose money). For product descriptions, 5 minutes is fine (nobody cares if a typo fix takes 5 minutes to propagate). For user sessions, it's zero (stale session = security vulnerability)."

Go deeper — walk through the TTL decision framework:

  1. Classify data types by staleness tolerance: real-time (<10s), near-real-time (30s-5min), eventual (>5min)
  2. For real-time data: TTL is a safety net, not the primary mechanism. Use event-driven invalidation (CDC or explicit delete-on-write)
  3. For near-real-time: TTL alone is sufficient. Set TTL = staleness_budget × 0.8 (leave 20% margin for clock skew)
  4. For eventual: Long TTL (hours/days) with background refresh. These entries are the highest-value cache entries — they offload the most database reads

The Staff follow-up: "The dangerous case is when someone sets a 24-hour TTL on price data because 'it rarely changes.' It doesn't change — until it does, and then customers see stale prices for 24 hours. That's why TTL ownership should be in the product spec, not the code."

Cross-reference §3.1 Freshness vs Performance for the full analysis.

Fault Line 2: Cache failure and thundering herd (5-7 min)

"When Redis goes down, every request becomes a cache miss. If you have a 95% hit rate and 10K requests/second, that means 9,500 requests/second that were hitting cache now hit PostgreSQL directly. Your database is provisioned for 500 requests/second. It dies."

Name the mitigations in order of priority:

  1. Request coalescing (singleflight) — if 100 concurrent requests miss on key X, only one hits the database; the other 99 wait for the result. This alone handles 80% of thundering herd scenarios
  2. Circuit breaker on the database — if database latency exceeds threshold, reject new requests with a fallback (stale data from local cache, degraded response, or 503)
  3. Stale-while-revalidate — serve the expired cache entry while refreshing in the background. The client gets slightly stale data instead of a slow or failed response
  4. Local in-process cache (L1) — small LRU cache in each app server (1000 hot keys). Survives Redis failures with degraded consistency

Quantify the recovery: "With singleflight + circuit breaker, a Redis outage degrades latency from 2ms to 50ms (database reads) but doesn't cascade. Without protection, the database dies within 30 seconds and recovery takes 5-10 minutes because the connection pool is exhausted."

Fault Line 3: Invalidation strategy — proactive vs reactive (5-7 min)

"There are three invalidation approaches: (a) TTL-only (reactive — you wait for expiry), (b) explicit invalidation on write (proactive — you delete when data changes), (c) CDC-driven invalidation (proactive + reliable — the database change stream triggers invalidation). I'd use a combination: TTL as a safety net + explicit invalidation for the write path + CDC as a backstop."

The organizational problem: "With 5 microservices writing to the products table, who is responsible for invalidating the product cache? If the price service updates the price but doesn't invalidate the cache, the product service serves stale prices. That's why CDC is the backstop — it doesn't depend on every service remembering to invalidate."

Cache-aside vs write-through vs read-through (3-5 min)

"Cache-aside gives the application full control — it's the most common pattern and the most debuggable. Write-through couples your write latency to cache latency (now every write is DB + cache round trip). Read-through hides the caching logic in a library but makes debugging harder — when data is stale, you can't tell if the cache missed, the TTL is wrong, or the invalidation failed."

Pick a position: "I default to cache-aside for most systems. Write-through only when the write volume is low and you need guaranteed cache warmth (e.g., configuration data). Read-through only when you want the cache to be the primary interface and can accept the debugging cost."

Operational maturity: measuring staleness in production (3-5 min)

"How do you know your cache is serving stale data? You can't just check TTLs — you need to measure actual staleness. The approach: periodically sample cache entries, compare their version/etag against the database source of truth, and report the staleness distribution."

Three metrics that matter:

  1. Hit rate per data type — if product prices have a 99.9% hit rate with a 30s TTL, you're serving a lot of cached prices. Is that safe?
  2. Staleness distribution — p50/p95/p99 age of cache entries at read time. If p99 staleness exceeds the business SLA, your TTL or invalidation is broken
  3. Invalidation propagation delay — time between database write and cache invalidation completion. If this exceeds 5 seconds, your "real-time" invalidation isn't real-time

Phase 6: Wrap-Up (2-3 minutes)

Summarize the key tradeoff — don't just restate your architecture, synthesize the insight:

"Distributed caching is a staleness management problem, not a performance optimization. The Staff-level challenge is: who defines the staleness budget, who owns the invalidation contract, and how do you detect when the contract is violated? The architecture is straightforward — cache-aside with Redis, TTL as safety net, CDC for proactive invalidation. The hard problem is organizational: making sure every write path has a corresponding invalidation path, and measuring staleness in production rather than assuming TTLs are correct."

If time permits, add the counterintuitive insight:

"Sometimes the right answer is to remove the cache. If your cache hit rate is 30%, you're paying for Redis infrastructure to serve 30% of reads while adding invalidation complexity for 100% of writes. At that point, scale the database instead. Caching is not free — it's a complexity trade for latency. Only make that trade when the hit rate justifies it."

Common Timing Mistakes

Level Calibration
MistakeL5 Does ThisL6 Does This
10 min on requirementsLists every data type to cacheStates staleness budget concept in 1 min, moves on
10 min on cache-asideExplains get-check-miss-fill at tutorial pace"Cache-aside with Redis. Here's the architecture."
No invalidation discussionOnly mentions TTLDraws the write-path invalidation + CDC backstop proactively
No thundering herdWaits for "what if Redis goes down?"Volunteers singleflight + circuit breaker in the architecture phase
No staleness measurementAssumes TTLs are correctProposes staleness sampling and p99 staleness metrics
No numbers"It should be fast""95% hit rate → database sees 5% of traffic. Redis failure = 20x load increase."

Reading the Interviewer

Interviewer SignalWhat They Care AboutWhere to Go Deep
Asks about consistencyData correctnessInvalidation strategy, CDC, staleness measurement (§3.1)
Asks about Redis failureOperational maturityThundering herd, circuit breaker, singleflight (§3.3)
Asks about write patternsArchitecture depthCache-aside vs write-through, invalidation ownership (§3.2)
Asks "who decides TTLs?"Organizational designStaleness budgets, product ownership, per-type policies
Asks about local cachePerformance engineeringL1/L2 cache hierarchy, consistency tradeoffs (§3.4)
Pushes back on cache-asideWants to see you reason about alternativesWrite-through for warm cache, read-through for abstraction

What to Deliberately Skip

These topics are traps. L5 candidates spend time on them. Staff candidates name them, dismiss them, and redirect to what matters.

Level Calibration
TopicWhy L5 Goes HereWhat L6 Says Instead
Redis vs MemcachedFeels like showing breadth"Redis — richer data structures, persistence, Pub/Sub for invalidation."
LRU vs LFU evictionEasy to explain"LRU for general use, LFU for hot-key workloads. Not the interesting problem."
Cache warming strategiesSeems like completeness"Pre-warm on deploy from the database. Straightforward."
Redis Sentinel vs ClusterInfrastructure trivia"Redis Cluster for sharding, Sentinel for HA on single-shard. Moving on."
Serialization formatEasy to enumerate"Protocol Buffers for compact binary. JSON for debugging. Not a design decision."

The pattern: acknowledge you know it, state your position in one sentence, redirect to the interesting problem — invalidation, staleness, and organizational ownership.

→ Continue to The Five Fault Lines (§3) for the Staff-grade tradeoff reasoning.

11. The Staff Lens

1.1 Why This Problem Exists in Staff Interviews

This is NOT a "speed up reads" question. Everyone knows how to add Redis.

This is a Consistency & Operational Ownership question that tests:

  • Whether you understand that caching trades correctness for performance
  • Whether you reason about invalidation contracts, not just TTLs
  • Whether you can articulate failure scenarios and their blast radius
  • Whether you understand the operational burden of cache infrastructure

1.2 The L5 vs L6 Contrast

Level Calibration
BehaviorL5 (Senior) CandidateL6 (Staff) Candidate
First move"Add Redis in front of the DB"Asks "What's the read/write ratio? What's acceptable staleness?"
InvalidationDefaults to TTLDesigns explicit invalidation contract tied to write path
Failure mode"Add replicas""What's the thundering herd mitigation? Cache-miss circuit breaker?"
ConsistencyAssumes caching always helpsKnows when caching hurts (write-heavy, low reuse, consistency-critical)
OwnershipImplementation focusPlatform thinking: who warms, who monitors, who gets paged

Behavior 1: First move (understand the access pattern)

Staff signal: Characterize the workload before proposing architecture.

Why this matters (L5 vs L6)

L5: Jumps to "add a cache" without understanding the access pattern. This leads to caches with low hit rates (write-heavy data), or caches that make consistency bugs worse.

L6: Asks about read/write ratio, access skew (hot keys vs uniform), staleness tolerance, and data size. Then commits to a caching strategy that fits. "This is 95% reads with high locality — caching will help. If it were 50/50 reads/writes, I'd question whether caching adds value."

Behavior 2: Invalidation strategy (TTL is not a strategy)

Staff signal: Design an explicit invalidation contract before discussing eviction.

Why this matters (L5 vs L6)

L5: Says "TTL of 5 minutes" as the primary invalidation mechanism. TTL is a fallback, not a strategy. It means "we'll serve stale data for up to 5 minutes after a write."

L6: Designs invalidation around the write path: "On user update, we invalidate the user cache entry. TTL is a safety net for orphaned entries, not the primary mechanism." The Staff question is: who owns the invalidation contract, and what happens if it fails?

Rendering diagram...

Behavior 3: Failure handling (replicas don't answer thundering herd)

Staff signal: Design for cache failure, not just cache slowness.

Why this matters (L5 vs L6)

L5: Treats cache failure as "add replicas / HA." That improves availability, but doesn't answer: what happens when the cache is cold (restart, failover, new deployment)? What prevents thousands of requests from stampeding the origin?

L6: Designs for thundering herd: request coalescing, cache-miss circuit breaker, probabilistic early expiry. Names the blast radius: "If the cache goes cold, can the origin handle the load? If not, we need a degraded mode."

Behavior 4: Consistency model (know when caching hurts)

Staff signal: Articulate when caching makes the system worse.

Why this matters (L5 vs L6)

L5: Treats caching as universally good. "Caching always improves performance."

L6: Knows caching can hurt: write-heavy workloads (invalidation churn), low-reuse data (cache pollution), consistency-critical paths (stale reads cause bugs), small datasets that fit in DB buffer pool (redundant layer). The Staff move is to state when you wouldn't cache.

Behavior 5: Ownership (who warms, who monitors)

Staff signal: Design for the organization, not just one service.

Why this matters (L5 vs L6)

L5: Focuses on "how to implement caching" in isolation. Doesn't consider who warms the cache on deploy, who monitors hit rate, who gets paged when the cache is slow.

L6: Treats caching as a platform concern: standardized patterns, shared infrastructure, consistent observability. "Cache hit rate below 80% for 5 minutes pages the owning team. Warming is automated on deploy."

1.3 The Key Insight

22. Problem Framing & Intent

2.1 The Three Intents

Before drawing any boxes, ask Why? The caching strategy changes entirely based on intent:

Who Pays Analysis
IntentConstraintStrategyStaleness ToleranceFailure Mode
Latency ReductionSpeed is everythingAggressive caching, read-through, local cachesSeconds OKServe stale > fail
Origin ProtectionShield backend from loadCache-aside, circuit breakers, request coalescingMinutes OKDegrade gracefully
Cost OptimizationReduce expensive operationsPrecompute + cache, longer TTLs, lazy refreshMinutes-hours OKStale > recompute

This sentence alone separates L5 from L6.

2.2 What's Intentionally Underspecified

The interviewer deliberately avoids specifying:

  • Read/write ratio
  • Data size and cardinality
  • Staleness tolerance
  • Cache budget (memory cost)
  • Multi-region requirements

Staff engineers surface these unknowns. Senior engineers assume them away.

2.3 How to Open (The First 2 Minutes)

  1. Ask 1-2 clarifying questions about access pattern and staleness
  2. State your assumption explicitly
  3. Outline your plan: access pattern → caching strategy → invalidation → failure modes → observability

Example opening:

If Asked: How to characterize workload without sounding junior

What interviewers expect you to name:

  • Read/write ratio (heavy reads = cache helps; balanced = question it)
  • Access skew (hot keys = cache helps; uniform = less value)
  • Data freshness requirements (real-time vs eventual)
  • Data size (fits in memory? needs eviction strategy?)

What NOT to say:

  • "We need to cache everything" (no selectivity)
  • "5 minute TTL should be fine" (no reasoning about staleness)
  • "Redis will handle it" (no architecture)

Staff-calibrated phrasing:

2.4 Terminology (Use Precise Words)

Caching interviews are ambiguous about where and how caching happens. Use precise terms:

TermWhat It MeansConsistency Model
Local/In-Process CacheSame process as applicationPer-instance, no coordination
Distributed CacheShared cache cluster (Redis, Memcached)Shared state, single source
CDN CacheEdge caching for static/semi-static contentEventually consistent
Database Buffer PoolDB's internal page cacheTransparent to application
PatternWrite PathRead PathInvalidation
Cache-AsideWrite to DB, invalidate cacheRead cache, miss → read DB → populate cacheExplicit on write
Read-ThroughWrite to DBRead cache, miss → cache fetches from DBTTL or explicit
Write-ThroughWrite to cache → cache writes to DBRead from cacheImplicit (write updates cache)
Write-BehindWrite to cache → async write to DBRead from cacheImplicit (write updates cache)
If Asked: Cache topology you should be able to articulate

Describe the layers, not implementation details:

If pressed for specifics:

  • L1: In-process LRU, 1000 entries, 10-second TTL
  • L2: Redis cluster, sharded by key hash, 5-minute TTL
  • Invalidation: Write path invalidates L2, L1 expires via short TTL

What you do NOT need:

  • Exact memory sizes
  • Redis cluster configuration
  • Serialization format details

Staff insight: The topology is simple. The hard part is the invalidation contract and failure handling.

2.5 When NOT to Cache (Staff Candidates Say No)

Staff candidates win interviews by knowing when to not cache. This is the highest-signal behavior.

Do NOT cache when:

ScenarioWhy Caching Hurts
Write-heavy dataInvalidation churn exceeds read savings. Cache becomes a write amplifier.
Low-reuse dataEach key read once. Cache fills with cold data, evicts hot data.
Consistency-critical pathsAuth, permissions, financial data. Stale = wrong = incident.
Data already hot in DBDB buffer pool already caches it. You're adding a redundant layer.
Small datasetsIf it fits in DB memory, caching adds latency (network hop) not savings.
Invalidation can't be ownedIf no one knows when to invalidate, you will serve stale data forever.

Staff Move: "Before I add caching, let me check if it's even appropriate. What's the read/write ratio? What's the reuse factor? What's the staleness tolerance? For write-heavy or consistency-critical data, I'd push back on caching entirely."

Bar-Raiser Follow-up: "When would you tell the team NOT to cache this data?"

Expected answer: "If the write rate is close to the read rate, invalidation will dominate. If the data is consistency-critical and staleness causes bugs or compliance issues, caching is the wrong tool."

33. The Five Fault Lines

This section contains the Staff-grade tradeoff reasoning. Each fault line includes:

  • A tradeoff matrix
  • Explicit "who pays" analysis
  • L6 vs L7 calibration
  • Bar-raiser follow-up questions

3.1 Fault Line 1: Freshness vs Performance

Who Pays Analysis
ChoiceWhat WorksWhat BreaksWho Pays
Short TTL / Aggressive InvalidationFresh dataHigher origin load, more invalidation trafficInfra (capacity), Eng (complexity)
Long TTL / Lazy InvalidationLower origin loadStale data, user confusionProduct (UX), Support (complaints)

The tradeoff: Every second of TTL is a second of potential staleness. But shorter TTLs mean more origin traffic.

L6 (Staff) answer: Ties staleness budget to business requirements. "For user profile data, 30 seconds is acceptable — users don't expect instant updates. For inventory counts, we need tighter consistency, so we use write-through invalidation with a short TTL fallback."

L7 (Principal) answer: Establishes org-wide staleness SLAs by data class. "We have three data tiers: real-time (no caching or <1s), near-real-time (sub-minute), and batch (hour+). Each tier has a standard pattern and observability."

3.2 Fault Line 2: Cache-Aside vs Read-Through vs Write-Through

Who Pays Analysis
PatternInvalidation OwnershipProsConsWho Pays
Cache-AsideApplication owns invalidationExplicit control, cache failure = DB fallbackInvalidation logic scattered, race conditionsApplication teams (implement invalidation per service), Infra (cache operational burden)
Read-ThroughCache owns fetchSimple read path, automatic populationCache failure = read failure, less controlInfra (tight coupling to cache service), Application teams (lose control, cache downtime = service downtime)
Write-ThroughCache owns persistenceAlways consistent, simple readsWrite latency includes cache, tight couplingUsers (slower writes), Infra (cache-DB coupling, complex failure modes)
Rendering diagram...

L6 (Staff) answer: Chooses cache-aside for most cases because it decouples cache availability from read availability. "If the cache is down, we fall back to the DB — slower, but not broken. Write-through couples us too tightly."

L7 (Principal) answer: Evaluates pattern choice based on failure modes and organizational capability. "Cache-aside requires every team to implement invalidation correctly. If we don't have that discipline, read-through with a managed cache might be safer despite the coupling."

3.3 Fault Line 3: Availability vs Consistency

Decision Framework:

ContextRecommendedWhy
User-facing readsServe staleUX > perfect consistency
Financial dataFail or hit originCorrectness > availability
Inventory/stockDependsOversell vs undersell tolerance
Rendering diagram...

L6 (Staff) answer: Classifies data by consistency requirements. "User preferences can serve stale — users won't notice a 30-second delay. But inventory must hit origin on cache miss because overselling is worse than latency."

L7 (Principal) answer: Defines consistency tiers as organizational policy with standard patterns and observability for each tier.

Ownership note: In practice, availability-first choices (serve stale) shift cost to Users (stale data experience) and Support (complaint volume). Consistency-first choices (fail or hit origin) shift cost to Engineering (circuit breaker complexity) and Infra (origin capacity to handle cache-miss load).

→ For the complete decision framework, see Degraded Mode Framework — applies to cache failures, origin protection, and graceful degradation.

3.4 Fault Line 4: Local Cache vs Distributed Cache

Who Pays Analysis
ChoiceLatencyConsistencyMemory EfficiencyWho Pays
Local (in-process)MicrosecondsPer-instance (inconsistent across instances)Duplicated per instanceUsers (inconsistent experience), Infra (N × cache memory cost)
Distributed (Redis)MillisecondsShared (consistent across instances)Shared poolUsers (network latency per request), Infra (Redis operational burden)
Multi-tier (L1 + L2)Microseconds + fallbackL1 inconsistent, L2 consistentBest of bothEngineering (two-tier invalidation complexity), Infra (manage both systems)

The tradeoff: Local caches are fast but create consistency issues across instances. Distributed caches are consistent but add network latency.

Rendering diagram...

L6 (Staff) answer: Uses multi-tier for high-traffic data. "L1 in-process cache with 10-second TTL absorbs repeat requests within a single instance. L2 Redis handles cross-instance consistency. L1 staleness is bounded by its short TTL."

L7 (Principal) answer: Standardizes multi-tier patterns across the org with clear guidance on when to use each tier and how to reason about aggregate staleness (L1 TTL + L2 TTL).

3.5 Fault Line 5: Proactive vs Reactive Invalidation

Who Pays Analysis
ApproachFreshnessComplexityFailure ModeWho Pays
Proactive (push on write)ImmediateHigh (coordination)Failed invalidation = stale dataEngineering (invalidation logic in all write paths), Infra (coordination overhead)
Reactive (TTL expiry)DelayedLowGuaranteed staleness up to TTLProduct (stale data complaints), Support (explain delays to users)
Hybrid (push + TTL fallback)Immediate with safetyMediumBest of bothEngineering (implementation complexity), but reduces Product/Support burden

L6 (Staff) answer: Uses hybrid — proactive invalidation on write with TTL as safety net. "On user update, we delete the cache key. TTL handles cases where invalidation fails or the write path changes."

L7 (Principal) answer: Implements change data capture (CDC) for invalidation at scale. "Rather than coupling invalidation to every write path, we tail the DB changelog and invalidate asynchronously. This decouples writers from cache knowledge."

→ For invalidation coordination patterns, see Distributed Coordination Framework.

44. Failure Modes & Degradation

→ This section applies the Degraded Mode Framework. Review it if you need the full availability vs consistency decision tree.

4.1 Thundering Herd (Cache Stampede)

The most common cache failure mode. When a popular cache entry expires or the cache restarts, many requests simultaneously miss the cache and hit the origin.

Timeline:

t=0:     Cache entry for "homepage_data" expires (TTL)
t=0-1s:  1000 concurrent requests hit cache → all miss
t=0-1s:  1000 requests hit database simultaneously
t=1-2s:  Database CPU spikes to 100%, queries slow
t=2-5s:  Database connection pool exhausted
t=5s+:   Cascading failures, homepage down

What breaks first: Database connection pool, then query latency, then availability.

Rendering diagram...

Mitigations:

MitigationHow It WorksTradeoff
Request coalescingFirst request fetches; others waitAdded latency for waiters
Probabilistic early expiryExpire slightly before TTL (jitter)Some extra origin load
Cache-miss circuit breakerLimit concurrent origin requestsSome requests fail
Background refreshRefresh before expiryComplexity, always slightly stale

Staff answer:

4.2 Hot Key Problem

Similar to rate limiting hot keys. One cache key receives disproportionate traffic, overwhelming a single cache shard.

Scenario: Viral Content

Timeline:

t=0:     Content goes viral
t=1min:  Traffic to one key grows 100x
t=2min:  Single Redis shard CPU at 100%
t=3min:  Shard latency spikes, timeouts begin
t=5min:  Shard becomes unavailable
t=5min+: All requests for hot key fail

Mitigations:

MitigationHow It WorksTradeoff
Local cache (L1) for hot keysAbsorb traffic at application tierStaleness across instances
Key replicationReplicate hot keys across shardsMemory overhead, invalidation complexity
Read replicasDirect hot-key reads to replicasEventual consistency

Staff answer:

4.3 Cold Start (Cache Miss Storm)

Scenario: Cache Cluster Restart

Timeline:

t=0:     Redis cluster restarts (maintenance, failure)
t=0:     100% cache miss rate
t=0-1m:  All reads hit database
t=1m:    Database connection pool exhausted
t=2m:    Service degradation or outage

Why "just restart" doesn't work: If your cache handles 90% of read load, a cold cache means 10x the origin load overnight.

Mitigations:

Who Pays Analysis
MitigationHow It WorksTradeoff
Cache warming on deployPre-populate cache before traffic shiftDeployment complexity, warming time
Gradual traffic shiftSlowly move traffic to new cacheLonger rollout, coordination
Origin capacity headroomSize origin for cache-miss loadCost (overprovisioned DB)
Stale-while-revalidateServe old cache + async refreshNeed persistent cache

Staff answer:

4.4 Cache Inconsistency (Stale Data Bugs)

Scenario: Failed Invalidation

Timeline:

t=0:     User updates their email
t=0:     DB write succeeds
t=0:     Cache invalidation fails (network blip)
t=0-5m:  User sees old email in UI
t=5m:    TTL expires, fresh data served

What makes this insidious: Silent failure. No errors, no alerts. Just wrong data.

Mitigations:

MitigationHow It WorksTradeoff
Write-through invalidationInvalidate in same transactionCoupling, latency
Version/generation stampsInclude version in cache keyKey cardinality
Idempotent invalidationRetry invalidationComplexity
Short TTL fallbackLimit staleness windowMore origin load

Staff answer:

4.5 Operational Reality Matrix

FailureLoud/SilentUser ImpactDetection Time
Cache downLoudLatency spike or errorsSeconds
Cache slowMediumLatency degradationMinutes
Thundering herdLoudOrigin overloadSeconds to minutes
Hot keyMediumSingle-key latencyMinutes
Stale dataSilentWrong data shownHours to never
Cold startLoudOrigin overloadSeconds

55. Evaluation Rubric

5.1 Level-Based Signals

Level Calibration
DimensionL5/SeniorL6/StaffL7/Principal
Access PatternAssumes caching helpsCharacterizes workload; knows when caching hurtsEstablishes patterns by data classification
InvalidationTTL-onlyExplicit invalidation contract + TTL fallbackCDC-based invalidation, org-wide patterns
ConsistencyAssumes fresh dataArticulates staleness budget by data typeDefines consistency tiers as policy
Failure Modes"Add replicas"Thundering herd, hot key, cold start mitigationsCapacity planning for cache-miss scenarios
Multi-tierSingle cache layerL1 + L2 with clear reasoningStandardized multi-tier patterns
OwnershipImplementation focusWarming, monitoring, paging ownershipPlatform-wide caching strategy

5.2 Strong Hire Signals

SignalWhat It Looks Like
Staleness Reasoning"30 seconds of staleness is acceptable for this use case because..."
Invalidation Design"We invalidate on write, with TTL as safety net for failed invalidations"
Failure Awareness"When the cache is cold, we need to protect the origin with circuit breakers"
Ownership Thinking"Who warms the cache on deploy? Who gets paged when hit rate drops?"

5.3 Lean No Hire Signals

SignalWhat It Looks Like
Redis Fixation15 minutes on Redis internals without tradeoffs
TTL-Only Thinking"We'll set a 5 minute TTL" with no invalidation strategy
Ignoring FailuresNo mention of thundering herd, cold start, or stale data
Missing IntentCaches everything without reasoning about staleness tolerance

5.4 Common False Positives

  • Knows Redis deeply: Deep Redis knowledge ≠ good cache design
  • Mentions all patterns: Breadth without depth is Senior, not Staff
  • Complex diagrams: Multi-tier diagrams without invalidation reasoning

66. Interview Flow & Pivots

6.1 Typical 45-Minute Structure

PhaseTimeWhat Happens
Framing5 minClarify access pattern, staleness tolerance
Requirements5 minRead/write ratio, data size, consistency needs
High-Level Design10 minCaching pattern, invalidation strategy
Deep Dive15 minFailure modes, thundering herd, consistency
Wrap-Up10 minOperations, monitoring, evolution

6.2 How Interviewers Pivot

After You Say...They Will Probe...
After "add Redis""What's your invalidation strategy?"
After invalidation discussion"What happens during thundering herd?"
After scaling discussion"How do you handle hot keys?"
After happy path"What if the cache is completely cold?"

6.3 What Silence Means

  • After tradeoff question: Interviewer wants you to reason aloud
  • After "what about consistency?": You're missing staleness reasoning
  • After definitive answer: They may disagree or want nuance

6.4 Follow-Up Questions to Expect

  1. "What happens when the cache is cold?"
  2. "How do you handle thundering herd?"
  3. "What's your invalidation strategy on write?"
  4. "How do you detect stale data bugs?"
  5. "What's the staleness budget for this data?"
  6. "When would you NOT cache this data?"

77. Active Drills

Practice these scenarios to internalize Staff-level thinking. Try answering before revealing the Staff approach.

1

Drill 1: The Opening (Access Pattern + Staleness)

Interview Prompt

Interview prompt: "Design a caching layer for our product catalog."

Staff Answer
StepStaff Answer
ClarifyAsk about read/write ratio, catalog size, staleness tolerance, traffic patterns
Assume"I'll assume 95% reads, 100K products, 1-minute staleness acceptable"
OutlineAccess pattern → caching strategy → invalidation → failure modes → observability

Why this is L6:

  • Starts with access-pattern discovery before choosing a technology — intent-driven design, not solution-first
  • States explicit staleness assumptions up front — shows product-awareness and frames the tradeoff space
  • Includes failure modes and observability in the outline — proves ownership extends beyond the happy path
2

Drill 2: Invalidation Strategy

Interview Prompt

Interview prompt: "How do you keep the cache consistent with the database?"

Staff Answer
StepStaff Answer
PrimaryWrite-path invalidation: "On product update, delete cache key"
FallbackTTL as safety net: "5-minute TTL catches failed invalidations"
Edge casesRace conditions, eventual consistency, version stamps if needed

Why this is L6:

  • Layers primary invalidation with a TTL safety net — defense-in-depth thinking, not single-strategy reliance
  • Calls out race conditions and version stamps — anticipates failure modes a Senior would overlook
  • Treats consistency as a spectrum, not a binary — articulates tradeoffs rather than picking one extreme
3

Drill 3: Thundering Herd

Interview Prompt

Interview prompt: "A popular cache key expires. What happens?"

Staff Answer
StepStaff Answer
Problem1000 requests hit origin simultaneously
MitigationRequest coalescing: first request fetches, others wait
PreventionProbabilistic early expiry (jitter), background refresh

Why this is L6:

  • Separates mitigation (coalescing) from prevention (jitter, background refresh) — shows systems thinking across time horizons
  • Names the specific failure cascade (1000 requests hit origin) — quantifies blast radius rather than hand-waving
  • Proposes probabilistic early expiry — demonstrates awareness of techniques that prevent problems at scale, not just react to them
4

Drill 4: Cache Failure

Interview Prompt

Interview prompt: "Redis is down. What happens to your service?"

Staff Answer
StepStaff Answer
DecideBy data type: user data falls back to DB, financial data fails explicitly
ProtectCircuit breaker on origin, request limiting
ObserveAlert on cache unavailability, monitor origin load

→ Review Degraded Mode Framework for the complete decision tree.

Why this is L6:

  • Differentiates fallback strategy by data type (user data vs financial data) — not a one-size-fits-all answer
  • Explicitly chooses "fail loudly" for financial data — demonstrates safety-first thinking over availability bias
  • Includes circuit breakers and origin protection — owns the downstream impact, not just the cache layer
5

Drill 5: Hot Key

Interview Prompt

Interview prompt: "One product gets 90% of traffic. What breaks?"

Staff Answer
StepStaff Answer
ProblemSingle cache shard overwhelmed
MitigationL1 local cache for hot keys, key replication
DetectionMetrics on per-key QPS, automated hot-key detection

Why this is L6:

  • Identifies the root infrastructure failure (single shard overwhelmed) — reasons about the system layer, not just the application layer
  • Proposes L1 local cache and key replication — shows multi-tier caching awareness beyond basic Redis usage
  • Includes automated hot-key detection — builds operational feedback loops rather than relying on manual monitoring
6

Drill 6: Consistency vs Performance

Interview Prompt

Interview prompt: "Users complain they see stale data after updating their profile."

Staff Answer
StepStaff Answer
DiagnoseIs invalidation failing, or is TTL too long?
FixWrite-through invalidation, shorter TTL, or read-your-writes pattern
CommunicateProduct decision: "changes may take a moment" vs immediate consistency

Why this is L6:

  • Starts with diagnosis before jumping to a fix — distinguishes root cause from symptoms
  • Offers read-your-writes as a targeted pattern — applies the right consistency model for the use case, not blanket strong consistency
  • Brings product communication into a technical answer — recognizes that user expectation is an engineering constraint, not just a PM concern
7

Drill 7: Build vs Buy

Interview Prompt

Interview prompt: "Should we use Redis, Memcached, or a managed service?"

Staff Answer
Who Pays Analysis
StepStaff Answer
EvaluateData structures needed, persistence requirements, operational capacity
CompareSelf-managed (control, cost) vs managed (ops burden, features)
RecommendUsually managed unless specific requirements demand self-hosted

Why this is L6:

  • Evaluates operational capacity as a first-class criterion — understands that team ability to run infrastructure matters as much as features
  • Frames the decision around requirements, not preferences — avoids the "I like Redis" trap that signals Senior-level thinking
  • Defaults to managed with an explicit escape hatch — shows organizational awareness of where engineering time is best spent

→ For the complete framework, see Build vs Buy Framework.

8

Drill 8: Multi-Region Caching

Interview Prompt

Interview prompt: "We're expanding to Europe. How does caching change?"

Staff Answer
StepStaff Answer
OptionsPer-region caches (simple, inconsistent) vs global cache (complex, consistent)
TradeoffCross-region latency (100ms+) vs staleness across regions
RecommendPer-region caches with async invalidation for most use cases

Why this is L6:

  • Lays out both architecture options with clear tradeoff dimensions (latency vs consistency) — structured decision-making, not gut feel
  • Quantifies the cross-region latency cost (100ms+) — grounds the tradeoff in real numbers that drive the recommendation
  • Recommends async invalidation as the default — balances pragmatism with correctness rather than chasing perfect global consistency
9

Drill 9: Ownership Conflict — TTL Disagreement

Interview Prompt

Interview prompt: "Product wants 1-hour TTL for faster page loads. Infra says 5 minutes max because of stale data risk. You're the Staff engineer. How do you resolve this?"

Staff Answer
StepStaff Answer
ReframeThis isn't a TTL debate — it's a staleness tolerance question. What's the actual business impact of stale data?
InvestigateWhat data is this? User profile (1hr OK)? Inventory (5min risky)? Pricing (unacceptable)?
ProposeData classification: different TTLs for different data sensitivity. Not one global TTL.
OwnershipProduct owns staleness SLA per data type. Infra provides the mechanisms.
DocumentWrite down: "User preferences: 1hr TTL approved by Product. Inventory: 5min with write-through invalidation."

Staff insight: The conflict exists because nobody defined the staleness contract. The fix is explicit ownership, not splitting the difference on TTL.

Why this is L6:

  • Reframes a technical disagreement as a missing contract — solves the organizational root cause, not the surface argument
  • Introduces data classification with per-type TTLs — shows that the right answer is nuanced, not a single compromise number
  • Assigns explicit ownership (Product owns staleness SLA, Infra provides mechanisms) — demonstrates cross-team boundary thinking
10

Drill 10: Ownership Conflict — Cache Hides a Bug

Interview Prompt

Interview prompt: "A pricing bug went unnoticed for 2 weeks because the cache kept serving correct (cached) prices while the DB had wrong values. Now leadership wants to know why QA didn't catch it. What do you say?"

Staff Answer
StepStaff Answer
DiagnoseThe cache masked the bug. QA tested via the cache (fast path), never hit the DB (slow path).
Root causeNo cache-bypass testing. No correctness monitoring comparing cache vs source.
Immediate fixAdd cache-bypass test cases. Add periodic cache-vs-DB consistency checks.
Systemic fixCache correctness metrics: sample reads and compare to source of truth. Alert on divergence.
OwnershipWho owns cache correctness? Not QA — they test features. Infra or platform team owns cache health.

Staff insight: Caches amplify bugs by serving wrong answers faster and more consistently. The fix is correctness observability, not blaming QA.

Why this is L6:

  • Redirects blame away from QA toward a systemic gap (no cache-bypass testing) — shows leadership maturity in incident response
  • Proposes cache-vs-source correctness monitoring — builds continuous verification, not just one-time test coverage
  • Identifies that caches amplify bugs as an architectural insight — reasons about emergent system behavior, not just component behavior

88. Deep Dive Scenarios

Scenario-based analysis for Staff-level depth

These scenarios test Staff-level operational thinking. Unlike drills (which test interview responses), deep dives test ownership reasoning.

Deep Dive 1: Black Friday Cache Failure

Staff Answer
PhaseWhat to do
Immediate (0-5 min)Is this a cache problem or an origin problem? Check if origin can handle the extra 25% load.
TriageHot keys (one product viral)? Memory pressure (evictions)? Network saturation?
Quick fixIf hot keys, enable L1 caching. If memory, increase cluster size or reduce TTL for low-value keys.
GuardrailsCircuit breaker on origin, shed low-priority traffic if needed.
Post-mortemWhy didn't capacity planning catch this? Load test with Black Friday traffic patterns.

Staff insight: A 25% cache hit rate drop on Black Friday could mean 5x increase in DB load. The cache isn't the patient — the origin is.

Deep Dive 2: Stale Data Incident

Staff Answer
DimensionStaff Answer
Root causePayment data was cached with 3-hour TTL, no write-through invalidation
ImmediatePurge cache for affected customer, verify current state
System fixPayment data should not be cached, OR write-through invalidation with <1min TTL
Process fixData classification policy: financial data requires explicit cache approval
Broader questionWhat other sensitive data is cached with long TTLs?

Staff insight: Some data should never be cached, or only with write-through invalidation. This is a data classification problem, not a TTL tuning problem.

Deep Dive 3: Cache Warming Gone Wrong

Staff Answer
PhaseWhat to do
ImmediateWhy was old cache decommissioned before warming completed? Process failure.
Root causeWarming job didn't account for dataset growth. What was 10 min last month is 2 hours now.
System fixWarming job with progress tracking, don't cutover until warming complete
Process fixWarming is part of deployment checklist, not a background task
Capacity planningMonitor warming time, alert if it grows significantly

Staff insight: Cache warming is deployment-critical. It should block cutover, not run in parallel.

Deep Dive 4: Multi-Tier Cache Debugging

Staff Answer
DimensionStaff Answer
HypothesisL2 was invalidated on write, but L1 on some instances still has stale data
InvestigationWhich instances have stale data? Are they missing invalidation events?
Root causeL1 invalidation wasn't implemented — relied on TTL only
FixEither accept 10s staleness (document it) or implement L1 invalidation
TradeoffL1 invalidation adds complexity; short TTL may be acceptable

Staff insight: Multi-tier caching multiplies consistency complexity. Aggregate staleness is L1 TTL + L2 TTL in the worst case.

Deep Dive 5: Cache Cost Optimization

Staff Answer
PhaseWhat to do
AnalysisWhat's in the cache? Key cardinality, size distribution, hit rate by key pattern
Low-hanging fruitRemove low-hit-rate keys, reduce TTL for rarely-accessed data
ArchitectureCan we move cold data to cheaper storage? Tiered caching?
CompressionCompress large values, more efficient serialization
Eviction tuningAre we caching data that's never reused? Adjust eviction policy

Staff insight: Cache cost optimization starts with understanding what's being cached and why. Often 20% of keys drive 80% of value.

99. Level Expectations Summary

What gets you each level in a caching interview:

Level Calibration
LevelMinimum BarKey Signals
L5 (Senior)Knows cache-aside pattern + Redis basics + understands TTLCan implement a working cache layer
L6 (Staff)Access pattern analysis + invalidation contracts + failure modes + ownership thinkingDesigns a cache you can operate
L7 (Principal)Data classification + org-wide patterns + consistency tiers + build-vs-buy reasoningDesigns a caching platform

What Separates Each Level

Level Calibration
TransitionThe Gap
L5 → L6From "add a cache" to "what's the staleness contract and who owns it"
L6 → L7From "my service's cache" to "the organization's caching strategy"

Quick Self-Check

Before your interview, verify you can answer:

  • What's the read/write ratio threshold where caching stops helping?
  • What's your invalidation strategy, and what's the TTL fallback?
  • How do you handle thundering herd on cache miss?
  • What's the staleness budget, and who signed off on it?
  • When would you NOT cache this data?

The Bar for This Question

Mid-level (L4/E4): You should be able to implement cache-aside with Redis, set reasonable TTLs, and explain the basic read path (check cache → miss → query DB → populate cache → return). You can describe cache hits and misses and why caching improves latency. Understanding cache eviction policies (LRU) or invalidation challenges would be a bonus but isn't expected.

Senior (L5/E5): You should quickly establish the caching pattern (cache-aside vs read-through vs write-through) based on the access pattern and spend time on the real problems: cache invalidation strategy (TTL vs event-driven), thundering herd on cache miss (locking, request coalescing), cache key design and its impact on hit rate, and the staleness contract — how stale is acceptable and who signed off on it. You should quantify: "We cache product catalog with a 5-minute TTL because the business accepts 5 minutes of stale pricing in exchange for 10x lower DB load." Having an opinion on consistent hashing for cache distribution would be strong.

Staff+ (L6/E6+): You should dispatch the baseline architecture in 5 minutes and spend 25+ minutes on operational depth: multi-tier caching (L1 in-process → L2 Redis → L3 CDN), cache warming strategies for cold starts and deploys, the organizational question of who owns the staleness contract (engineering proposes TTLs, product signs off on user-facing staleness), and failure mode analysis — what happens when Redis goes down (do you fall through to DB and crush it, or serve stale from a local cache?). You should reason about cache sizing economics (memory cost vs DB query cost), hot key detection and mitigation (dedicated cache nodes or key splitting), and how caching intersects with consistency requirements across services. The interviewer should see you treat caching as a data freshness contract, not just a performance optimization.

1010. Staff Insiders: Controversial Opinions

These are uncomfortable truths that distinguish Staff engineers from Seniors. They're based on operating caches at scale, not on textbook knowledge. Strong engineers disagree on some of these — that's the point.

Most Stale Data Incidents Are Never Detected

The uncomfortable truth: Your cache is probably serving stale data right now. You just don't know it.

Why it's invisible:

FactorWhy It Hides Staleness
No correctness metricsYou measure hit rate, not accuracy
Intermittent symptomsUsers refresh and it "fixes itself"
Blame shifting"The data was always like that"
TTL masks evidenceBy the time you investigate, the stale entry expired

The Staff position: If you can't measure staleness, you can't claim your cache is correct. Most teams measure cache performance but not cache correctness.

Bar-raiser question: "How would you know if your cache served incorrect data for the last hour?"

High Hit Rate Can Mask Correctness Bugs

The uncomfortable truth: 99% hit rate might mean you're serving confidently wrong answers 99% of the time.

Why this happens:

  • Cache is fast, so nobody questions its answers
  • Bugs in the origin path are never exercised
  • Stale data looks like correct data if you don't check

Real-world example: A payment cache served 3-hour-old card data with 99.5% hit rate. Nobody noticed until a customer complained about being charged on a canceled card.

The Staff position: Hit rate is a performance metric, not a correctness metric. A cache with 99% hit rate that's 1% wrong can cause more damage than a cache with 80% hit rate that's always correct.

Removing a Cache Is Often the Right Fix

The uncomfortable truth: Many caching problems are best solved by removing the cache entirely.

Signs you should delete the cache:

  • Hit rate below 50% (you're paying for misses)
  • Write rate approaches read rate (invalidation churn)
  • Consistency bugs that nobody can debug
  • The origin can handle the load without the cache
  • Multiple incidents traced to cache staleness

Why teams don't remove caches:

  • "We already built it"
  • "It must be helping somehow"
  • Fear of origin load (often unfounded)
  • Nobody owns the decision to remove

The Staff position: Adding a cache is easy. Removing one requires courage. The Staff engineer asks: "What if we just... didn't cache this?"

Cache Invalidation Is an Ownership Problem, Not a Technical One

The uncomfortable truth: "Cache invalidation is hard" is a cop-out. It's hard because nobody owns it.

Why invalidation fails:

Failure ModeRoot Cause
Missed invalidationWriter doesn't know about cache
Partial invalidationMultiple caches, one forgot
Race conditionsNobody designed for concurrency
Schema changes break itCache contract undocumented

The Staff position: Invalidation is hard because it's a coordination problem across teams, not a technical problem within one service. The fix is ownership clarity, not better algorithms.

Bar-raiser question: "Who is responsible for invalidation correctness across all services that write this data?"

Caches Amplify Bugs — They Don't Just Hide Them

The uncomfortable truth: A bug that affects 1% of requests without a cache might affect 99% of requests with a cache.

How caches amplify:

Without cache:
  - Bug writes bad data to DB
  - 1% of reads hit the bug
  - 99% read correct data from DB

With cache:
  - Bug writes bad data to DB
  - Bad data cached
  - 99% of reads hit cache → 99% see bad data

The Staff position: Caches turn transient bugs into persistent outages. A single bad write + aggressive caching = widespread incorrect data served for TTL duration.

"TTL Was Too Long" Is Never the Root Cause

The uncomfortable truth: When stale data causes an incident, "reduce TTL" is the wrong fix.

Why TTL tuning fails:

  • It treats symptoms, not causes
  • Shorter TTL = more origin load
  • The real question: why wasn't invalidation triggered?

The Staff position: Every "TTL was too long" incident is actually an invalidation ownership failure. The fix is better invalidation, not shorter TTL. Shorter TTL is a band-aid that increases cost.

Real root causes:

  • Writer didn't know to invalidate
  • Invalidation code had a bug
  • Invalidation was async and lost
  • Nobody owned the cache contract
Appendices (Deep Dive)
Appendix A: Caching Patterns Deep Dive — Cache-aside, read-through, write-through, write-behind

A.1 Cache-Aside (Lazy Loading)

The most common pattern. Application manages cache explicitly.

Read path:

  1. Check cache
  2. On miss: read from DB, populate cache
  3. Return data

Write path:

  1. Write to DB
  2. Invalidate cache (delete key)
Rendering diagram...

Pros:

  • Cache failure = DB fallback (resilient)
  • Only cache what's actually read (efficient)
  • Simple mental model

Cons:

  • Cache miss = two round trips (latency)
  • Invalidation logic in application (scattered)
  • Race condition window between write and invalidate

When to use: Most read-heavy workloads where you want cache failure to be non-fatal.

A.2 Read-Through

Cache handles fetching. Application always reads from cache; cache fetches from DB on miss.

Read path:

  1. Read from cache
  2. On miss: cache fetches from DB, stores, returns

Write path:

  1. Write to DB
  2. Invalidate cache OR let TTL expire

Pros:

  • Simple application code (always read cache)
  • Automatic population

Cons:

  • Cache failure = read failure (coupled)
  • Cache must understand DB schema
  • Less control over fetch logic

When to use: When you want to centralize caching logic and can accept cache-as-dependency.

A.3 Write-Through

Cache handles persistence. Write to cache; cache synchronously writes to DB.

Write path:

  1. Write to cache
  2. Cache writes to DB (synchronous)
  3. Confirm to client

Read path:

  1. Always read from cache (always fresh)

Pros:

  • Cache always consistent with DB
  • Simple read path
  • No invalidation needed

Cons:

  • Write latency includes cache
  • Cache failure = write failure
  • Cache must understand DB schema

When to use: When consistency is critical and you can accept cache-as-dependency on writes.

A.4 Write-Behind (Write-Back)

Async persistence. Write to cache; cache asynchronously writes to DB.

Write path:

  1. Write to cache (immediate return)
  2. Cache queues DB write
  3. Background process persists to DB

Read path:

  1. Always read from cache

Pros:

  • Lowest write latency
  • Batching opportunities
  • Absorbs write spikes

Cons:

  • Data loss risk (cache crash before DB write)
  • Consistency complexity
  • Requires durable cache or careful failure handling

When to use: Write-heavy workloads where you can tolerate some data loss risk (analytics, logs, non-critical counters).

A.5 Pattern Comparison

PatternRead PathWrite PathConsistencyFailure Impact
Cache-AsideApp → Cache → (miss) → DBApp → DB → Invalidate CacheEventualCache down = DB fallback
Read-ThroughApp → Cache → (miss) → Cache fetches DBApp → DB → Invalidate CacheEventualCache down = reads fail
Write-ThroughApp → CacheApp → Cache → Cache writes DBStrongCache down = writes fail
Write-BehindApp → CacheApp → Cache → async DBEventualCache down = data loss risk
Appendix B: Eviction Strategies — LRU, LFU, TTL, and when each matters

B.1 Why Eviction Matters

Cache memory is finite. When full, something must go. The eviction policy determines what.

The wrong eviction policy:

  • Evicts hot data → cache misses → origin load
  • Keeps cold data → wasted memory → poor hit rate

B.2 Common Eviction Policies

PolicyEvictsBest ForWeakness
LRULeast Recently UsedGeneral workloadsOne-time scans pollute cache
LFULeast Frequently UsedStable hot setSlow to adapt to changing access patterns
TTLExpired entriesTime-sensitive dataDoesn't handle memory pressure
RandomRandom entryLarge caches, uniform accessCan evict hot data

B.3 LRU (Least Recently Used)

How it works: Evict the entry that hasn't been accessed for the longest time.

Good for: Workloads with temporal locality (recently accessed = likely accessed again).

Bad for: Scan workloads — one-time reads push out hot data.

Access pattern: A B C D A B E F G H ...
LRU cache (size 4):
  After A B C D: [D, C, B, A]
  After A B:     [B, A, D, C]
  After E:       [E, B, A, D] (C evicted)
  After F:       [F, E, B, A] (D evicted)
  ...
  Hot keys A, B survive; scan keys E, F, G, H cycle through

B.4 LFU (Least Frequently Used)

How it works: Evict the entry with the lowest access count.

Good for: Stable hot sets where popular items stay popular.

Bad for: Changing popularity — old popular items block new popular items.

Variants:

  • LFU with decay: Counts decay over time, allowing popularity shifts
  • Window LFU: Only count recent accesses

B.5 TTL-Based Eviction

How it works: Entries expire after a fixed time, regardless of access.

Good for: Data with known staleness windows, session data.

Not a memory pressure solution: TTL doesn't help when cache is full but entries haven't expired.

Best practice: Combine TTL with LRU/LFU — TTL for freshness, LRU for capacity.

B.6 Choosing an Eviction Policy

WorkloadRecommendedWhy
General web appLRUGood default, temporal locality
Stable hot set (popular products)LFUProtects frequently-accessed items
Session dataTTLNatural expiration
Scan-heavy (batch reads)LRU + scan resistancePrevent scans from polluting

Staff insight: Most production systems use LRU with TTL. LFU is rarely worth the complexity unless you have a proven stable hot set.

Appendix C: Thundering Herd Mitigations — Request coalescing, probabilistic expiry, circuit breakers

C.1 The Problem

When a popular cache entry expires or is invalidated:

  1. Many requests arrive simultaneously
  2. All miss the cache
  3. All query the origin
  4. Origin is overwhelmed

C.2 Request Coalescing (Singleflight)

How it works: First request triggers the fetch; concurrent requests wait for that result.

Rendering diagram...

Implementation: Use a lock or promise per key. First request acquires lock and fetches; others wait on the promise.

Tradeoffs:

  • ✅ Origin sees single request instead of N
  • ❌ Waiting requests add latency
  • ❌ If fetch fails, all waiters fail

C.3 Probabilistic Early Expiry

How it works: Entries expire slightly before their TTL, with randomness to spread refreshes.

Formula: should_refresh = (now > expiry - TTL * beta * log(random()))

Where beta controls how early refreshes can happen.

Effect: Instead of 1000 requests hitting exactly at TTL, refreshes spread over a window.

Tradeoffs:

  • ✅ Spreads load over time
  • ❌ Some entries refresh "too early" (wasted work)
  • ❌ Requires tuning beta

C.4 Background Refresh (Stale-While-Revalidate)

How it works: Serve stale data immediately while refreshing in background.

Request arrives → Cache has stale entry
  → Return stale data immediately
  → Trigger background refresh
  → Next request gets fresh data

Tradeoffs:

  • ✅ No latency spike on refresh
  • ✅ Origin load is smoothed
  • ❌ Guaranteed staleness during refresh
  • ❌ Complexity (background job, stale tracking)

C.5 Cache-Miss Circuit Breaker

How it works: Limit concurrent requests to origin on cache miss.

Cache miss:
  if (concurrent_origin_requests < limit):
    fetch from origin
  else:
    fail fast (or serve stale if available)

Tradeoffs:

  • ✅ Protects origin from overload
  • ❌ Some requests fail or get stale data
  • ❌ Requires tuning limit

C.6 Choosing a Mitigation

ScenarioRecommendedWhy
High-traffic keysRequest coalescingPrevents duplicate origin fetches
Many keys expiring togetherProbabilistic early expirySpreads refresh load
Latency-sensitiveStale-while-revalidateNo refresh latency visible to user
Origin fragileCircuit breakerHard limit on origin load

Staff answer: "I'll use request coalescing as the primary defense, with probabilistic early expiry to prevent synchronized refreshes. Circuit breaker protects the origin if coalescing isn't enough."

Appendix D: Cache Consistency Patterns — Invalidation, versioning, read-your-writes

D.1 The Fundamental Problem

Cache and database are separate systems. Writes go to DB; reads may come from cache. Keeping them consistent is hard.

D.2 Invalidation Patterns

Delete on Write

How it works: After writing to DB, delete the cache key.

Write: UPDATE users SET email='new' WHERE id=123
Then:  DELETE cache:user:123

Problem: Race condition.

t=0: Thread A reads user:123 from DB (old data)
t=1: Thread B writes user:123 to DB (new data)
t=2: Thread B deletes cache:user:123
t=3: Thread A writes old data to cache
Result: Cache has stale data until TTL

Mitigation: Version stamps or short TTL.

Update on Write (Write-Through)

How it works: After writing to DB, update the cache with new value.

Write: UPDATE users SET email='new' WHERE id=123
Then:  SET cache:user:123 = {new data}

Problem: Same race condition, plus you're computing cache value in write path.

D.3 Version Stamps

How it works: Include a version number in the cache key or value.

Cache key: user:123:v7
On write: increment version → user:123:v8
Old cached data at v7 is naturally orphaned

Tradeoffs:

  • ✅ No race conditions
  • ❌ Key cardinality increases
  • ❌ Need to track current version somewhere

D.4 Read-Your-Writes Consistency

Problem: User updates data but immediately sees old cached data.

Solution: After write, user's session bypasses cache for that key.

Write: User updates profile
Set:   session.bypass_cache['user:123'] = now + 30s
Read:  If bypass active, read from DB, not cache

Tradeoffs:

  • ✅ User always sees their own writes
  • ❌ Complexity in read path
  • ❌ Per-session state required

D.5 Change Data Capture (CDC)

How it works: Subscribe to DB changelog; invalidate cache asynchronously.

DB write → Changelog (binlog, WAL)
         → CDC consumer
         → Cache invalidation

Tradeoffs:

  • ✅ Decouples writers from cache knowledge
  • ✅ Guaranteed to catch all writes
  • ❌ Eventual consistency (lag between write and invalidation)
  • ❌ Infrastructure complexity

When to use: Large-scale systems where write paths are diverse and can't all know about caching.

Appendix E: Metrics & Observability — Hit rate, latency, eviction monitoring

E.1 Core Metrics (Non-Negotiable)

cache_hit_total
cache_miss_total
cache_hit_rate (derived: hit / (hit + miss))
cache_latency_ms
origin_latency_ms
cache_eviction_total
cache_size_bytes
cache_key_count

E.2 Hit Rate Is Not Enough

High hit rate can hide problems:

  • Hit rate 95% but 5% misses overwhelm origin
  • Hit rate 99% but misses are the important requests
  • Hit rate 90% but most hits are stale data

Better signals:

  • Origin load during cache operation
  • Miss latency vs hit latency ratio
  • Staleness metrics (if trackable)

E.3 Metric Dimensions

Slice metrics by:

  • Key pattern: user:*, product:*, session:*
  • Operation: get, set, delete
  • Result: hit, miss, error

E.4 Alerting

AlertThresholdWhy
Hit rate drop< 80% for 5 minOrigin may be overwhelmed
Cache latency spikep99 > 50msNetwork or capacity issue
Eviction rate spike> 1000/sMemory pressure
Cache unavailable> 30sFailover or outage

E.5 Dashboards

Operational dashboard:

  • Hit rate (real-time)
  • Origin load vs cache load
  • Latency percentiles (p50, p95, p99)
  • Error rate

Capacity dashboard:

  • Memory usage vs capacity
  • Key count and growth rate
  • Eviction rate
  • Connection count
Appendix F: Multi-Region Caching — Per-region, global, invalidation strategies

F.1 The Multi-Region Problem

Users in EU should hit EU cache; users in US should hit US cache. But what happens when data is updated?

F.2 Strategies

StrategyConsistencyLatencyComplexity
Per-region caches, no syncEventually consistentLowLow
Per-region with async invalidationEventually consistentLowMedium
Global cache (single region)StrongHigh (cross-region)Low
Global cache (replicated)Eventually consistentLow reads, high writesHigh

F.3 Per-Region with Async Invalidation

How it works:

  1. Write happens in one region
  2. Invalidation message published to message bus
  3. Other regions consume and invalidate their caches
Rendering diagram...

Staleness window: Cross-region propagation delay (typically 100-500ms).

F.4 When to Accept Per-Region Inconsistency

  • User data: Users rarely switch regions mid-session
  • Product catalog: Brief inconsistency rarely matters
  • Sessions: Should be region-sticky anyway

F.5 When to Require Global Consistency

  • Inventory/stock: Overselling is expensive
  • Financial: Compliance requirements
  • Global rate limiting: Abuse protection

Staff insight: Most user-facing data can tolerate per-region caching with async invalidation. Reserve global consistency for data where inconsistency has real cost.

Appendix G: Cache Sizing and Capacity Planning — Memory estimation, hit rate modeling

G.1 Basic Sizing

Formula:

Memory needed = working_set_size × (1 + overhead_factor)
Where:
  working_set_size = num_keys × avg_value_size
  overhead_factor ≈ 1.5-2x (for Redis data structures, fragmentation)

Example:

  • 1M users, 2KB per user profile
  • Working set: 1M × 2KB = 2GB
  • With overhead: 2GB × 1.5 = 3GB
  • Add headroom: 3GB × 1.2 = 3.6GB minimum

G.2 Hit Rate vs Cache Size

The Pareto insight: Often 20% of keys serve 80% of traffic. Caching the hot set gives most of the benefit.

Hit rate curve: Typically logarithmic — doubling cache size doesn't double hit rate.

Cache size: 10%  → Hit rate: 70%
Cache size: 20%  → Hit rate: 85%
Cache size: 50%  → Hit rate: 95%
Cache size: 100% → Hit rate: 99%

Staff insight: Model your access pattern. If you have high skew (few hot keys), a small cache gives great hit rate. If access is uniform, you need to cache more.

G.3 Capacity Planning Questions

  1. What's the working set? All data that could be cached, or just hot data?
  2. What hit rate do we need? 80%? 95%? 99%?
  3. What's the cost of a miss? DB query latency, cost, capacity
  4. What's the cost of cache memory? $/GB for your cache tier
  5. What's the growth rate? How will working set grow?

G.4 Cost Optimization

TechniqueSavingsTradeoff
Compression2-5xCPU overhead
Shorter TTLLess memory for cold dataMore origin load
Tiered cachingHot data in expensive cache, cold in cheapComplexity
Efficient serialization10-50%Developer effort

These frameworks are referenced throughout this playbook and apply to many system design problems:

  • Distributed State Coordination

    • Cache invalidation coordination, multi-tier consistency, leader election for cache warming
    • Applies to: caching, rate limiting, locks, sessions
  • Degraded Mode Framework

    • Cache failure handling, serving stale vs failing, circuit breakers
    • Applies to: caching, rate limiting, dependency isolation
  • Build vs Buy Framework

    • Redis vs Memcached vs managed services, self-hosted vs cloud
    • Applies to: caching, observability, databases, queues
Ready to test your knowledge?

Practice LRU Cache with an L6-calibrated mock interviewer.

Start Mock Interview