Design a Distributed Cache | StaffSignal Playbook

How to Use This Playbook

This playbook supports three reading modes:

Mode	Time	What to Read
Quick Review	15 min	Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study	1-2 hrs	Executive Summary → Interview Walkthrough → Core Flow, expand appendices where you're weak
Deep Dive	3+ hrs	Everything, including all appendices

Expandable sections contain deeper mechanics. Open them when you need the detail.

What is Distributed Caching? — Quick primer if you're unfamiliar

The Problem

A cache stores copies of frequently accessed data in fast storage (usually memory) to avoid repeatedly hitting slower backends like databases or external APIs. "Distributed" means the cache spans multiple nodes, allowing it to scale beyond a single machine's memory and survive individual node failures. The tradeoff: you're trading consistency for speed—cached data can become stale.

Common Use Cases

Database Query Caching: Store expensive query results to reduce database load (e.g., product catalogs, user profiles)
Session Storage: Keep user sessions in fast-access memory across a cluster of web servers
API Response Caching: Cache third-party API responses to reduce latency and avoid rate limits
Computed Result Caching: Store results of expensive computations (ML model outputs, aggregations)
CDN Edge Caching: Cache static and semi-dynamic content at edge locations for global users

Why Interviewers Ask About This

Caching seems simple but hides brutal complexity: invalidation is a consistency problem, not a timeout problem. Interviewers want to see if you understand that adding a cache means you now have two sources of truth that can disagree. Can you articulate when staleness is acceptable? Do you know what happens when the cache goes down? This topic reveals whether you've dealt with real production issues—cache stampedes, thundering herds, and the dreaded "why is this showing old data?" bug.

What This Interview Actually Tests

Caching is not a "make it faster" question. Everyone knows Redis.

This is a consistency and operational ownership question that tests:

Whether you understand why caching introduces complexity, not just speed
Whether you reason about invalidation before discussing eviction
Whether you can articulate what "stale" means for your use case
Whether you understand the blast radius when the cache fails

The key insight: Caching is a consistency problem disguised as a performance optimization. Staff engineers reason about who pays for staleness and who owns the invalidation contract.

The L5 vs L6 Contrast (Memorize This)

Level Calibration

Behavior	L5 (Senior)	L6 (Staff)
First move	"We'll add Redis in front of the database"	Asks "What's the staleness tolerance? Who's the source of truth?"
Invalidation	"TTL of 5 minutes"	"TTL is the last resort. What's our invalidation contract?"
Failure	"We'll add replicas"	"When the cache fails, do we hit the DB or return errors? What's the thundering herd plan?"
Consistency	Assumes cache is always helpful	Articulates when caching makes things worse (write-heavy, low hit rate)
Ownership	Focuses on cache implementation	Asks "Who owns cache warming? Who gets paged when hit rate drops?"

The Three Caching Intents (Pick One and Commit)

Who Pays Analysis

Intent	Constraint	Strategy	Staleness Bar
Latency Reduction	Speed is everything	Aggressive caching, read-through	Seconds to minutes acceptable
Origin Protection	Shield backend from load	Cache-aside with circuit breaker	Minutes acceptable, freshness secondary
Cost Optimization	Reduce expensive computation/queries	Precompute + cache, longer TTLs	Minutes to hours acceptable

Staff Move: "I'll assume we're protecting the origin from read load while maintaining sub-second staleness for user-facing data. This is the hardest case because we need both availability and freshness."

The Five Fault Lines (The Core of This Interview)

Freshness vs Performance — Shorter TTLs mean fresher data but more origin load. Who decides the staleness budget?
Cache-Aside vs Read-Through vs Write-Through — Where does the invalidation logic live? Who owns it?
Availability vs Consistency — When the cache is down, do we serve stale data, hit origin, or fail?
Local vs Distributed — In-process cache (fast, inconsistent) vs shared cache (slower, consistent)?
Proactive vs Reactive Invalidation — Push invalidation on write, or let TTL expire? Who coordinates?

Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.

Default Staff Positions (Unless Proven Otherwise)

Who Pays Analysis

Position	Rationale
Cache-aside over write-through	Decouples cache availability from read availability
TTL is safety net, not strategy	Primary invalidation should be explicit on write path
Invalidation before eviction	Design the invalidation contract before discussing LRU vs LFU
Serve stale only with sign-off	Product must explicitly approve staleness budget per data type
Financial/auth data bypasses cache	Correctness cost of staleness exceeds performance benefit
If invalidation can't be owned, don't cache	Unowned invalidation = guaranteed stale data incidents

Quick Reference: What Interviewers Probe

After You Say...	They Will Ask...
"Add Redis cache"	"What's your invalidation strategy? What happens on write?"
"TTL of 5 minutes"	"What if the data changes? Is 5 minutes of staleness acceptable?"
"Cache-aside pattern"	"What about thundering herd on cache miss?"
"We'll add replicas"	"Replicas don't answer cold-start. What's your warming strategy?"
"Invalidate on write"	"How do you handle race conditions between write and invalidation?"

Jump to Practice

→ Active Drills (§7) — 10 practice prompts with expected answer shapes

System Architecture Overview

Rendering diagram...

Interview Walkthrough: How to Present This in 45 Minutes

The HelloInterview-style guides walk you through each step at tutorial pace. That's fine for Senior candidates. At Staff level, the basics should take 10-12 minutes — fast enough that you spend the remaining 30+ minutes on the invalidation, failure, and consistency questions that actually determine your level.

The six phases below add up to 45 minutes. The ratios matter: phases 1-4 are deliberately compressed so phase 5 gets the lion's share of time. If you're spending more than 12 minutes before the transition to depth, you're pacing like an L5.

Phase 1: Requirements & Framing (2-3 minutes)

State functional requirements in 30 seconds — don't enumerate, state the category:

"We need a distributed caching layer to reduce database load and serve repeated reads at sub-millisecond latency."

That's it. Don't list every data type or cache operation.

Invest time on non-functional requirements (this is the Staff move):

"What's the staleness budget? Product needs to define acceptable staleness per data type — prices need <30s, product descriptions can tolerate 5 minutes, user sessions need zero staleness."
Clarify: read-to-write ratio (100:1 justifies caching, 2:1 probably doesn't), dataset size (does it fit in memory?), consistency model (eventual vs strong)
"I'll assume a read-heavy workload with per-type staleness budgets, because that's the most common production scenario and forces the hardest invalidation decisions."

Phase 2: Core Entities & API (1-2 minutes)

State entities quickly (30 seconds):

CacheEntry — key, value, TTL, version/etag (the version enables conditional invalidation)
InvalidationEvent — key_pattern, source, timestamp, propagation_status (first-class entity, not an afterthought)
CachePolicy — data_type, staleness_budget, eviction_strategy, origin_fallback behavior

API (1 minute) — transparent cache-aside in the application layer, not a separate API:

get(key) → HIT(value, age) | MISS
set(key, value, ttl, invalidation_policy) → OK
invalidate(key_or_pattern, reason) → OK

The invalidation path is the one that matters:

on_write(entity) → invalidate(cache_key(entity), "source_write")

Phase 3: High-Level Architecture (5-7 minutes)

Draw the core cache-aside flow on the whiteboard:

Rendering diagram...

Walk the interviewer through the four data flows (reference the full System Architecture diagram above for the complete multi-layer picture):

Read path → App checks Redis first; on miss, reads from PostgreSQL, populates cache with data-type-specific TTL
Write path → App writes to PostgreSQL, then invalidates the cache key (delete, not update — avoids race conditions)
Invalidation propagation → For critical data, CDC (change data capture) publishes invalidation events as a backstop — so even if the application forgets to invalidate, the database change stream catches it
Failure path → When Redis is unavailable, requests fall through to PostgreSQL with circuit breaker protection and request coalescing (singleflight) to prevent thundering herd

Scripted walkthrough: "Read path: the app server checks Redis first — cache-aside. On miss, reads from PostgreSQL, populates cache with TTL. We use request coalescing so if 100 concurrent requests miss on the same key, only one hits the database. Write path: app writes to PostgreSQL, then invalidates the cache key. CDC publishes invalidation events as a backstop."

Key points to hit on the whiteboard:

Cache-aside pattern — application controls both read and write paths (not write-through, which couples cache to write latency)
Redis Cluster with hash slots — 6 shards for horizontal scaling, consistent hashing for key distribution
Request coalescing — singleflight pattern prevents thundering herd on cache miss
Write-path invalidation — delete on write, not update; avoids race conditions between concurrent writers
CDN as first cache layer — browser cache → CDN edge → Redis → PostgreSQL; four-layer hierarchy

Then immediately flag the key tension: "This works for the happy path. The interesting questions are: what happens when Redis goes down and 100% of traffic hits PostgreSQL? Who owns the invalidation contract when 5 different services write to the same entity? And how do you detect that your cache is serving stale data without anyone noticing?"

Phase 4: Transition to Depth (1 minute)

At this point you have a correct, simple architecture on the board. Now you pivot:

"The basic architecture is well-understood — cache-aside with Redis and TTL-based expiry. What makes this Staff-level is the consistency and operational reasoning. Let me dive into three areas: (1) invalidation strategy and who owns it, (2) failure modes when the cache layer goes down, (3) how to detect and measure staleness in production."

Then offer the interviewer a choice:

"I can go deep on any of these. Which is most interesting to you?"

If the interviewer doesn't have a preference, lead with invalidation strategy — it's the most universally asked and the most misunderstood.

Phase 5: Deep Dives (25-30 minutes)

The interviewer will steer, but be prepared to go deep on any of these. For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → explain who absorbs that cost.

Fault Line 1: Freshness vs performance — the staleness budget (5-7 min)

Open with the business framing:

"Every cache entry has a staleness budget — the maximum age the business will tolerate. For product prices, that's <30 seconds (stale prices lose money). For product descriptions, 5 minutes is fine (nobody cares if a typo fix takes 5 minutes to propagate). For user sessions, it's zero (stale session = security vulnerability)."

Go deeper — walk through the TTL decision framework:

Classify data types by staleness tolerance: real-time (<10s), near-real-time (30s-5min), eventual (>5min)
For real-time data: TTL is a safety net, not the primary mechanism. Use event-driven invalidation (CDC or explicit delete-on-write)
For near-real-time: TTL alone is sufficient. Set TTL = staleness_budget × 0.8 (leave 20% margin for clock skew)
For eventual: Long TTL (hours/days) with background refresh. These entries are the highest-value cache entries — they offload the most database reads

The Staff follow-up: "The dangerous case is when someone sets a 24-hour TTL on price data because 'it rarely changes.' It doesn't change — until it does, and then customers see stale prices for 24 hours. That's why TTL ownership should be in the product spec, not the code."

Cross-reference §3.1 Freshness vs Performance for the full analysis.

Fault Line 2: Cache failure and thundering herd (5-7 min)

"When Redis goes down, every request becomes a cache miss. If you have a 95% hit rate and 10K requests/second, that means 9,500 requests/second that were hitting cache now hit PostgreSQL directly. Your database is provisioned for 500 requests/second. It dies."

Name the mitigations in order of priority:

Request coalescing (singleflight) — if 100 concurrent requests miss on key X, only one hits the database; the other 99 wait for the result. This alone handles 80% of thundering herd scenarios
Circuit breaker on the database — if database latency exceeds threshold, reject new requests with a fallback (stale data from local cache, degraded response, or 503)
Stale-while-revalidate — serve the expired cache entry while refreshing in the background. The client gets slightly stale data instead of a slow or failed response
Local in-process cache (L1) — small LRU cache in each app server (1000 hot keys). Survives Redis failures with degraded consistency

Quantify the recovery: "With singleflight + circuit breaker, a Redis outage degrades latency from 2ms to 50ms (database reads) but doesn't cascade. Without protection, the database dies within 30 seconds and recovery takes 5-10 minutes because the connection pool is exhausted."

Fault Line 3: Invalidation strategy — proactive vs reactive (5-7 min)

"There are three invalidation approaches: (a) TTL-only (reactive — you wait for expiry), (b) explicit invalidation on write (proactive — you delete when data changes), (c) CDC-driven invalidation (proactive + reliable — the database change stream triggers invalidation). I'd use a combination: TTL as a safety net + explicit invalidation for the write path + CDC as a backstop."

The organizational problem: "With 5 microservices writing to the products table, who is responsible for invalidating the product cache? If the price service updates the price but doesn't invalidate the cache, the product service serves stale prices. That's why CDC is the backstop — it doesn't depend on every service remembering to invalidate."

Cache-aside vs write-through vs read-through (3-5 min)

"Cache-aside gives the application full control — it's the most common pattern and the most debuggable. Write-through couples your write latency to cache latency (now every write is DB + cache round trip). Read-through hides the caching logic in a library but makes debugging harder — when data is stale, you can't tell if the cache missed, the TTL is wrong, or the invalidation failed."

Pick a position: "I default to cache-aside for most systems. Write-through only when the write volume is low and you need guaranteed cache warmth (e.g., configuration data). Read-through only when you want the cache to be the primary interface and can accept the debugging cost."

Operational maturity: measuring staleness in production (3-5 min)

"How do you know your cache is serving stale data? You can't just check TTLs — you need to measure actual staleness. The approach: periodically sample cache entries, compare their version/etag against the database source of truth, and report the staleness distribution."

Three metrics that matter:

Hit rate per data type — if product prices have a 99.9% hit rate with a 30s TTL, you're serving a lot of cached prices. Is that safe?
Staleness distribution — p50/p95/p99 age of cache entries at read time. If p99 staleness exceeds the business SLA, your TTL or invalidation is broken
Invalidation propagation delay — time between database write and cache invalidation completion. If this exceeds 5 seconds, your "real-time" invalidation isn't real-time

Phase 6: Wrap-Up (2-3 minutes)

Summarize the key tradeoff — don't just restate your architecture, synthesize the insight:

"Distributed caching is a staleness management problem, not a performance optimization. The Staff-level challenge is: who defines the staleness budget, who owns the invalidation contract, and how do you detect when the contract is violated? The architecture is straightforward — cache-aside with Redis, TTL as safety net, CDC for proactive invalidation. The hard problem is organizational: making sure every write path has a corresponding invalidation path, and measuring staleness in production rather than assuming TTLs are correct."

If time permits, add the counterintuitive insight:

"Sometimes the right answer is to remove the cache. If your cache hit rate is 30%, you're paying for Redis infrastructure to serve 30% of reads while adding invalidation complexity for 100% of writes. At that point, scale the database instead. Caching is not free — it's a complexity trade for latency. Only make that trade when the hit rate justifies it."

Common Timing Mistakes

Level Calibration

Mistake	L5 Does This	L6 Does This
10 min on requirements	Lists every data type to cache	States staleness budget concept in 1 min, moves on
10 min on cache-aside	Explains get-check-miss-fill at tutorial pace	"Cache-aside with Redis. Here's the architecture."
No invalidation discussion	Only mentions TTL	Draws the write-path invalidation + CDC backstop proactively
No thundering herd	Waits for "what if Redis goes down?"	Volunteers singleflight + circuit breaker in the architecture phase
No staleness measurement	Assumes TTLs are correct	Proposes staleness sampling and p99 staleness metrics
No numbers	"It should be fast"	"95% hit rate → database sees 5% of traffic. Redis failure = 20x load increase."

Reading the Interviewer

Interviewer Signal	What They Care About	Where to Go Deep
Asks about consistency	Data correctness	Invalidation strategy, CDC, staleness measurement (§3.1)
Asks about Redis failure	Operational maturity	Thundering herd, circuit breaker, singleflight (§3.3)
Asks about write patterns	Architecture depth	Cache-aside vs write-through, invalidation ownership (§3.2)
Asks "who decides TTLs?"	Organizational design	Staleness budgets, product ownership, per-type policies
Asks about local cache	Performance engineering	L1/L2 cache hierarchy, consistency tradeoffs (§3.4)
Pushes back on cache-aside	Wants to see you reason about alternatives	Write-through for warm cache, read-through for abstraction

What to Deliberately Skip

These topics are traps. L5 candidates spend time on them. Staff candidates name them, dismiss them, and redirect to what matters.

Level Calibration

Topic	Why L5 Goes Here	What L6 Says Instead
Redis vs Memcached	Feels like showing breadth	"Redis — richer data structures, persistence, Pub/Sub for invalidation."
LRU vs LFU eviction	Easy to explain	"LRU for general use, LFU for hot-key workloads. Not the interesting problem."
Cache warming strategies	Seems like completeness	"Pre-warm on deploy from the database. Straightforward."
Redis Sentinel vs Cluster	Infrastructure trivia	"Redis Cluster for sharding, Sentinel for HA on single-shard. Moving on."
Serialization format	Easy to enumerate	"Protocol Buffers for compact binary. JSON for debugging. Not a design decision."

The pattern: acknowledge you know it, state your position in one sentence, redirect to the interesting problem — invalidation, staleness, and organizational ownership.

→ Continue to The Five Fault Lines (§3) for the Staff-grade tradeoff reasoning.

11. The Staff Lens

1.1 Why This Problem Exists in Staff Interviews

This is NOT a "speed up reads" question. Everyone knows how to add Redis.

This is a Consistency & Operational Ownership question that tests:

Whether you understand that caching trades correctness for performance
Whether you reason about invalidation contracts, not just TTLs
Whether you can articulate failure scenarios and their blast radius
Whether you understand the operational burden of cache infrastructure

1.2 The L5 vs L6 Contrast

Level Calibration

Behavior	L5 (Senior) Candidate	L6 (Staff) Candidate
First move	"Add Redis in front of the DB"	Asks "What's the read/write ratio? What's acceptable staleness?"
Invalidation	Defaults to TTL	Designs explicit invalidation contract tied to write path
Failure mode	"Add replicas"	"What's the thundering herd mitigation? Cache-miss circuit breaker?"
Consistency	Assumes caching always helps	Knows when caching hurts (write-heavy, low reuse, consistency-critical)
Ownership	Implementation focus	Platform thinking: who warms, who monitors, who gets paged

Behavior 1: First move (understand the access pattern)

Staff signal: Characterize the workload before proposing architecture.

Why this matters (L5 vs L6)

L5: Jumps to "add a cache" without understanding the access pattern. This leads to caches with low hit rates (write-heavy data), or caches that make consistency bugs worse.

L6: Asks about read/write ratio, access skew (hot keys vs uniform), staleness tolerance, and data size. Then commits to a caching strategy that fits. "This is 95% reads with high locality — caching will help. If it were 50/50 reads/writes, I'd question whether caching adds value."

Behavior 2: Invalidation strategy (TTL is not a strategy)

Staff signal: Design an explicit invalidation contract before discussing eviction.

Why this matters (L5 vs L6)

L5: Says "TTL of 5 minutes" as the primary invalidation mechanism. TTL is a fallback, not a strategy. It means "we'll serve stale data for up to 5 minutes after a write."

L6: Designs invalidation around the write path: "On user update, we invalidate the user cache entry. TTL is a safety net for orphaned entries, not the primary mechanism." The Staff question is: who owns the invalidation contract, and what happens if it fails?

Rendering diagram...

Behavior 3: Failure handling (replicas don't answer thundering herd)

Staff signal: Design for cache failure, not just cache slowness.

Why this matters (L5 vs L6)

L5: Treats cache failure as "add replicas / HA." That improves availability, but doesn't answer: what happens when the cache is cold (restart, failover, new deployment)? What prevents thousands of requests from stampeding the origin?

L6: Designs for thundering herd: request coalescing, cache-miss circuit breaker, probabilistic early expiry. Names the blast radius: "If the cache goes cold, can the origin handle the load? If not, we need a degraded mode."

Behavior 4: Consistency model (know when caching hurts)

Staff signal: Articulate when caching makes the system worse.

Why this matters (L5 vs L6)

L5: Treats caching as universally good. "Caching always improves performance."

L6: Knows caching can hurt: write-heavy workloads (invalidation churn), low-reuse data (cache pollution), consistency-critical paths (stale reads cause bugs), small datasets that fit in DB buffer pool (redundant layer). The Staff move is to state when you wouldn't cache.

Behavior 5: Ownership (who warms, who monitors)

Staff signal: Design for the organization, not just one service.

Why this matters (L5 vs L6)

L5: Focuses on "how to implement caching" in isolation. Doesn't consider who warms the cache on deploy, who monitors hit rate, who gets paged when the cache is slow.

L6: Treats caching as a platform concern: standardized patterns, shared infrastructure, consistent observability. "Cache hit rate below 80% for 5 minutes pages the owning team. Warming is automated on deploy."

1.3 The Key Insight

22. Problem Framing & Intent

2.1 The Three Intents

Before drawing any boxes, ask Why? The caching strategy changes entirely based on intent:

Who Pays Analysis

Intent	Constraint	Strategy	Staleness Tolerance	Failure Mode
Latency Reduction	Speed is everything	Aggressive caching, read-through, local caches	Seconds OK	Serve stale > fail
Origin Protection	Shield backend from load	Cache-aside, circuit breakers, request coalescing	Minutes OK	Degrade gracefully
Cost Optimization	Reduce expensive operations	Precompute + cache, longer TTLs, lazy refresh	Minutes-hours OK	Stale > recompute

This sentence alone separates L5 from L6.

2.2 What's Intentionally Underspecified

The interviewer deliberately avoids specifying:

Read/write ratio
Data size and cardinality
Staleness tolerance
Cache budget (memory cost)
Multi-region requirements

Staff engineers surface these unknowns. Senior engineers assume them away.

2.3 How to Open (The First 2 Minutes)

Ask 1-2 clarifying questions about access pattern and staleness
State your assumption explicitly
Outline your plan: access pattern → caching strategy → invalidation → failure modes → observability

Example opening:

If Asked: How to characterize workload without sounding junior

What interviewers expect you to name:

Read/write ratio (heavy reads = cache helps; balanced = question it)
Access skew (hot keys = cache helps; uniform = less value)
Data freshness requirements (real-time vs eventual)
Data size (fits in memory? needs eviction strategy?)

What NOT to say:

"We need to cache everything" (no selectivity)
"5 minute TTL should be fine" (no reasoning about staleness)
"Redis will handle it" (no architecture)

Staff-calibrated phrasing:

2.4 Terminology (Use Precise Words)

Caching interviews are ambiguous about where and how caching happens. Use precise terms:

Term	What It Means	Consistency Model
Local/In-Process Cache	Same process as application	Per-instance, no coordination
Distributed Cache	Shared cache cluster (Redis, Memcached)	Shared state, single source
CDN Cache	Edge caching for static/semi-static content	Eventually consistent
Database Buffer Pool	DB's internal page cache	Transparent to application

Pattern	Write Path	Read Path	Invalidation
Cache-Aside	Write to DB, invalidate cache	Read cache, miss → read DB → populate cache	Explicit on write
Read-Through	Write to DB	Read cache, miss → cache fetches from DB	TTL or explicit
Write-Through	Write to cache → cache writes to DB	Read from cache	Implicit (write updates cache)
Write-Behind	Write to cache → async write to DB	Read from cache	Implicit (write updates cache)

If Asked: Cache topology you should be able to articulate

Describe the layers, not implementation details:

If pressed for specifics:

L1: In-process LRU, 1000 entries, 10-second TTL
L2: Redis cluster, sharded by key hash, 5-minute TTL
Invalidation: Write path invalidates L2, L1 expires via short TTL

What you do NOT need:

Exact memory sizes
Redis cluster configuration
Serialization format details

Staff insight: The topology is simple. The hard part is the invalidation contract and failure handling.

2.5 When NOT to Cache (Staff Candidates Say No)

Staff candidates win interviews by knowing when to not cache. This is the highest-signal behavior.

Do NOT cache when:

Scenario	Why Caching Hurts
Write-heavy data	Invalidation churn exceeds read savings. Cache becomes a write amplifier.
Low-reuse data	Each key read once. Cache fills with cold data, evicts hot data.
Consistency-critical paths	Auth, permissions, financial data. Stale = wrong = incident.
Data already hot in DB	DB buffer pool already caches it. You're adding a redundant layer.
Small datasets	If it fits in DB memory, caching adds latency (network hop) not savings.
Invalidation can't be owned	If no one knows when to invalidate, you will serve stale data forever.

Staff Move: "Before I add caching, let me check if it's even appropriate. What's the read/write ratio? What's the reuse factor? What's the staleness tolerance? For write-heavy or consistency-critical data, I'd push back on caching entirely."

Bar-Raiser Follow-up: "When would you tell the team NOT to cache this data?"

Expected answer: "If the write rate is close to the read rate, invalidation will dominate. If the data is consistency-critical and staleness causes bugs or compliance issues, caching is the wrong tool."

33. The Five Fault Lines

This section contains the Staff-grade tradeoff reasoning. Each fault line includes:

A tradeoff matrix
Explicit "who pays" analysis
L6 vs L7 calibration
Bar-raiser follow-up questions

3.1 Fault Line 1: Freshness vs Performance

Who Pays Analysis

Choice	What Works	What Breaks	Who Pays
Short TTL / Aggressive Invalidation	Fresh data	Higher origin load, more invalidation traffic	Infra (capacity), Eng (complexity)
Long TTL / Lazy Invalidation	Lower origin load	Stale data, user confusion	Product (UX), Support (complaints)

The tradeoff: Every second of TTL is a second of potential staleness. But shorter TTLs mean more origin traffic.

L6 (Staff) answer: Ties staleness budget to business requirements. "For user profile data, 30 seconds is acceptable — users don't expect instant updates. For inventory counts, we need tighter consistency, so we use write-through invalidation with a short TTL fallback."

L7 (Principal) answer: Establishes org-wide staleness SLAs by data class. "We have three data tiers: real-time (no caching or <1s), near-real-time (sub-minute), and batch (hour+). Each tier has a standard pattern and observability."

3.2 Fault Line 2: Cache-Aside vs Read-Through vs Write-Through

Who Pays Analysis

Pattern	Invalidation Ownership	Pros	Cons	Who Pays
Cache-Aside	Application owns invalidation	Explicit control, cache failure = DB fallback	Invalidation logic scattered, race conditions	Application teams (implement invalidation per service), Infra (cache operational burden)
Read-Through	Cache owns fetch	Simple read path, automatic population	Cache failure = read failure, less control	Infra (tight coupling to cache service), Application teams (lose control, cache downtime = service downtime)
Write-Through	Cache owns persistence	Always consistent, simple reads	Write latency includes cache, tight coupling	Users (slower writes), Infra (cache-DB coupling, complex failure modes)

Rendering diagram...

L6 (Staff) answer: Chooses cache-aside for most cases because it decouples cache availability from read availability. "If the cache is down, we fall back to the DB — slower, but not broken. Write-through couples us too tightly."

L7 (Principal) answer: Evaluates pattern choice based on failure modes and organizational capability. "Cache-aside requires every team to implement invalidation correctly. If we don't have that discipline, read-through with a managed cache might be safer despite the coupling."

3.3 Fault Line 3: Availability vs Consistency

Decision Framework:

Context	Recommended	Why
User-facing reads	Serve stale	UX > perfect consistency
Financial data	Fail or hit origin	Correctness > availability
Inventory/stock	Depends	Oversell vs undersell tolerance

Rendering diagram...

L6 (Staff) answer: Classifies data by consistency requirements. "User preferences can serve stale — users won't notice a 30-second delay. But inventory must hit origin on cache miss because overselling is worse than latency."

L7 (Principal) answer: Defines consistency tiers as organizational policy with standard patterns and observability for each tier.

Ownership note: In practice, availability-first choices (serve stale) shift cost to Users (stale data experience) and Support (complaint volume). Consistency-first choices (fail or hit origin) shift cost to Engineering (circuit breaker complexity) and Infra (origin capacity to handle cache-miss load).

→ For the complete decision framework, see →Degraded Mode Framework — applies to cache failures, origin protection, and graceful degradation.

3.4 Fault Line 4: Local Cache vs Distributed Cache

Who Pays Analysis

Choice	Latency	Consistency	Memory Efficiency	Who Pays
Local (in-process)	Microseconds	Per-instance (inconsistent across instances)	Duplicated per instance	Users (inconsistent experience), Infra (N × cache memory cost)
Distributed (Redis)	Milliseconds	Shared (consistent across instances)	Shared pool	Users (network latency per request), Infra (Redis operational burden)
Multi-tier (L1 + L2)	Microseconds + fallback	L1 inconsistent, L2 consistent	Best of both	Engineering (two-tier invalidation complexity), Infra (manage both systems)

The tradeoff: Local caches are fast but create consistency issues across instances. Distributed caches are consistent but add network latency.

Rendering diagram...

L6 (Staff) answer: Uses multi-tier for high-traffic data. "L1 in-process cache with 10-second TTL absorbs repeat requests within a single instance. L2 Redis handles cross-instance consistency. L1 staleness is bounded by its short TTL."

L7 (Principal) answer: Standardizes multi-tier patterns across the org with clear guidance on when to use each tier and how to reason about aggregate staleness (L1 TTL + L2 TTL).

3.5 Fault Line 5: Proactive vs Reactive Invalidation

Who Pays Analysis

Approach	Freshness	Complexity	Failure Mode	Who Pays
Proactive (push on write)	Immediate	High (coordination)	Failed invalidation = stale data	Engineering (invalidation logic in all write paths), Infra (coordination overhead)
Reactive (TTL expiry)	Delayed	Low	Guaranteed staleness up to TTL	Product (stale data complaints), Support (explain delays to users)
Hybrid (push + TTL fallback)	Immediate with safety	Medium	Best of both	Engineering (implementation complexity), but reduces Product/Support burden

L6 (Staff) answer: Uses hybrid — proactive invalidation on write with TTL as safety net. "On user update, we delete the cache key. TTL handles cases where invalidation fails or the write path changes."

L7 (Principal) answer: Implements change data capture (CDC) for invalidation at scale. "Rather than coupling invalidation to every write path, we tail the DB changelog and invalidate asynchronously. This decouples writers from cache knowledge."

→ For invalidation coordination patterns, see →Distributed Coordination Framework.

44. Failure Modes & Degradation

→ This section applies the →Degraded Mode Framework. Review it if you need the full availability vs consistency decision tree.

4.1 Thundering Herd (Cache Stampede)

The most common cache failure mode. When a popular cache entry expires or the cache restarts, many requests simultaneously miss the cache and hit the origin.

Scenario: Popular Cache Key Expires

Timeline:

t=0:     Cache entry for "homepage_data" expires (TTL)
t=0-1s:  1000 concurrent requests hit cache → all miss
t=0-1s:  1000 requests hit database simultaneously
t=1-2s:  Database CPU spikes to 100%, queries slow
t=2-5s:  Database connection pool exhausted
t=5s+:   Cascading failures, homepage down

What breaks first: Database connection pool, then query latency, then availability.

Rendering diagram...

Mitigations:

Mitigation	How It Works	Tradeoff
Request coalescing	First request fetches; others wait	Added latency for waiters
Probabilistic early expiry	Expire slightly before TTL (jitter)	Some extra origin load
Cache-miss circuit breaker	Limit concurrent origin requests	Some requests fail
Background refresh	Refresh before expiry	Complexity, always slightly stale

Staff answer:

4.2 Hot Key Problem

Similar to rate limiting hot keys. One cache key receives disproportionate traffic, overwhelming a single cache shard.

Scenario: Viral Content

Timeline:

t=0:     Content goes viral
t=1min:  Traffic to one key grows 100x
t=2min:  Single Redis shard CPU at 100%
t=3min:  Shard latency spikes, timeouts begin
t=5min:  Shard becomes unavailable
t=5min+: All requests for hot key fail

Mitigations:

Mitigation	How It Works	Tradeoff
Local cache (L1) for hot keys	Absorb traffic at application tier	Staleness across instances
Key replication	Replicate hot keys across shards	Memory overhead, invalidation complexity
Read replicas	Direct hot-key reads to replicas	Eventual consistency

Staff answer:

4.3 Cold Start (Cache Miss Storm)

Scenario: Cache Cluster Restart

Timeline:

t=0:     Redis cluster restarts (maintenance, failure)
t=0:     100% cache miss rate
t=0-1m:  All reads hit database
t=1m:    Database connection pool exhausted
t=2m:    Service degradation or outage

Why "just restart" doesn't work: If your cache handles 90% of read load, a cold cache means 10x the origin load overnight.

Mitigations:

Who Pays Analysis

Mitigation	How It Works	Tradeoff
Cache warming on deploy	Pre-populate cache before traffic shift	Deployment complexity, warming time
Gradual traffic shift	Slowly move traffic to new cache	Longer rollout, coordination
Origin capacity headroom	Size origin for cache-miss load	Cost (overprovisioned DB)
Stale-while-revalidate	Serve old cache + async refresh	Need persistent cache

Staff answer:

4.4 Cache Inconsistency (Stale Data Bugs)

Scenario: Failed Invalidation

Timeline:

t=0:     User updates their email
t=0:     DB write succeeds
t=0:     Cache invalidation fails (network blip)
t=0-5m:  User sees old email in UI
t=5m:    TTL expires, fresh data served

What makes this insidious: Silent failure. No errors, no alerts. Just wrong data.

Mitigations:

Mitigation	How It Works	Tradeoff
Write-through invalidation	Invalidate in same transaction	Coupling, latency
Version/generation stamps	Include version in cache key	Key cardinality
Idempotent invalidation	Retry invalidation	Complexity
Short TTL fallback	Limit staleness window	More origin load

Staff answer:

4.5 Operational Reality Matrix

Failure	Loud/Silent	User Impact	Detection Time
Cache down	Loud	Latency spike or errors	Seconds
Cache slow	Medium	Latency degradation	Minutes
Thundering herd	Loud	Origin overload	Seconds to minutes
Hot key	Medium	Single-key latency	Minutes
Stale data	Silent	Wrong data shown	Hours to never
Cold start	Loud	Origin overload	Seconds

55. Evaluation Rubric

5.1 Level-Based Signals

Level Calibration

Dimension	L5/Senior	L6/Staff	L7/Principal
Access Pattern	Assumes caching helps	Characterizes workload; knows when caching hurts	Establishes patterns by data classification
Invalidation	TTL-only	Explicit invalidation contract + TTL fallback	CDC-based invalidation, org-wide patterns
Consistency	Assumes fresh data	Articulates staleness budget by data type	Defines consistency tiers as policy
Failure Modes	"Add replicas"	Thundering herd, hot key, cold start mitigations	Capacity planning for cache-miss scenarios
Multi-tier	Single cache layer	L1 + L2 with clear reasoning	Standardized multi-tier patterns
Ownership	Implementation focus	Warming, monitoring, paging ownership	Platform-wide caching strategy

5.2 Strong Hire Signals

Signal	What It Looks Like
Staleness Reasoning	"30 seconds of staleness is acceptable for this use case because..."
Invalidation Design	"We invalidate on write, with TTL as safety net for failed invalidations"
Failure Awareness	"When the cache is cold, we need to protect the origin with circuit breakers"
Ownership Thinking	"Who warms the cache on deploy? Who gets paged when hit rate drops?"

5.3 Lean No Hire Signals

Signal	What It Looks Like
Redis Fixation	15 minutes on Redis internals without tradeoffs
TTL-Only Thinking	"We'll set a 5 minute TTL" with no invalidation strategy
Ignoring Failures	No mention of thundering herd, cold start, or stale data
Missing Intent	Caches everything without reasoning about staleness tolerance

5.4 Common False Positives

Knows Redis deeply: Deep Redis knowledge ≠ good cache design
Mentions all patterns: Breadth without depth is Senior, not Staff
Complex diagrams: Multi-tier diagrams without invalidation reasoning

66. Interview Flow & Pivots

6.1 Typical 45-Minute Structure

Phase	Time	What Happens
Framing	5 min	Clarify access pattern, staleness tolerance
Requirements	5 min	Read/write ratio, data size, consistency needs
High-Level Design	10 min	Caching pattern, invalidation strategy
Deep Dive	15 min	Failure modes, thundering herd, consistency
Wrap-Up	10 min	Operations, monitoring, evolution

6.2 How Interviewers Pivot

After You Say...	They Will Probe...
After "add Redis"	"What's your invalidation strategy?"
After invalidation discussion	"What happens during thundering herd?"
After scaling discussion	"How do you handle hot keys?"
After happy path	"What if the cache is completely cold?"

6.3 What Silence Means

After tradeoff question: Interviewer wants you to reason aloud
After "what about consistency?": You're missing staleness reasoning
After definitive answer: They may disagree or want nuance

6.4 Follow-Up Questions to Expect

"What happens when the cache is cold?"
"How do you handle thundering herd?"
"What's your invalidation strategy on write?"
"How do you detect stale data bugs?"
"What's the staleness budget for this data?"
"When would you NOT cache this data?"

77. Active Drills

Practice these scenarios to internalize Staff-level thinking. Try answering before revealing the Staff approach.

Drill 1: The Opening (Access Pattern + Staleness)

Interview Prompt

Interview prompt: "Design a caching layer for our product catalog."

Staff Answer

Step	Staff Answer
Clarify	Ask about read/write ratio, catalog size, staleness tolerance, traffic patterns
Assume	"I'll assume 95% reads, 100K products, 1-minute staleness acceptable"
Outline	Access pattern → caching strategy → invalidation → failure modes → observability

Why this is L6:

Starts with access-pattern discovery before choosing a technology — intent-driven design, not solution-first
States explicit staleness assumptions up front — shows product-awareness and frames the tradeoff space
Includes failure modes and observability in the outline — proves ownership extends beyond the happy path

Drill 2: Invalidation Strategy

Interview Prompt

Interview prompt: "How do you keep the cache consistent with the database?"

Staff Answer

Step	Staff Answer
Primary	Write-path invalidation: "On product update, delete cache key"
Fallback	TTL as safety net: "5-minute TTL catches failed invalidations"
Edge cases	Race conditions, eventual consistency, version stamps if needed

Why this is L6:

Layers primary invalidation with a TTL safety net — defense-in-depth thinking, not single-strategy reliance
Calls out race conditions and version stamps — anticipates failure modes a Senior would overlook
Treats consistency as a spectrum, not a binary — articulates tradeoffs rather than picking one extreme

Drill 3: Thundering Herd

Interview Prompt

Interview prompt: "A popular cache key expires. What happens?"

Staff Answer

Step	Staff Answer
Problem	1000 requests hit origin simultaneously
Mitigation	Request coalescing: first request fetches, others wait
Prevention	Probabilistic early expiry (jitter), background refresh

Why this is L6:

Separates mitigation (coalescing) from prevention (jitter, background refresh) — shows systems thinking across time horizons
Names the specific failure cascade (1000 requests hit origin) — quantifies blast radius rather than hand-waving
Proposes probabilistic early expiry — demonstrates awareness of techniques that prevent problems at scale, not just react to them

Drill 4: Cache Failure

Interview Prompt

Interview prompt: "Redis is down. What happens to your service?"

Staff Answer

Step	Staff Answer
Decide	By data type: user data falls back to DB, financial data fails explicitly
Protect	Circuit breaker on origin, request limiting
Observe	Alert on cache unavailability, monitor origin load

→ Review →Degraded Mode Framework for the complete decision tree.

Why this is L6:

Differentiates fallback strategy by data type (user data vs financial data) — not a one-size-fits-all answer
Explicitly chooses "fail loudly" for financial data — demonstrates safety-first thinking over availability bias
Includes circuit breakers and origin protection — owns the downstream impact, not just the cache layer

Drill 5: Hot Key

Interview Prompt

Interview prompt: "One product gets 90% of traffic. What breaks?"

Staff Answer

Step	Staff Answer
Problem	Single cache shard overwhelmed
Mitigation	L1 local cache for hot keys, key replication
Detection	Metrics on per-key QPS, automated hot-key detection

Why this is L6:

Identifies the root infrastructure failure (single shard overwhelmed) — reasons about the system layer, not just the application layer
Proposes L1 local cache and key replication — shows multi-tier caching awareness beyond basic Redis usage
Includes automated hot-key detection — builds operational feedback loops rather than relying on manual monitoring

Drill 6: Consistency vs Performance

Interview Prompt

Interview prompt: "Users complain they see stale data after updating their profile."

Staff Answer

Step	Staff Answer
Diagnose	Is invalidation failing, or is TTL too long?
Fix	Write-through invalidation, shorter TTL, or read-your-writes pattern
Communicate	Product decision: "changes may take a moment" vs immediate consistency

Why this is L6:

Starts with diagnosis before jumping to a fix — distinguishes root cause from symptoms
Offers read-your-writes as a targeted pattern — applies the right consistency model for the use case, not blanket strong consistency
Brings product communication into a technical answer — recognizes that user expectation is an engineering constraint, not just a PM concern

Drill 7: Build vs Buy

Interview Prompt

Interview prompt: "Should we use Redis, Memcached, or a managed service?"

Staff Answer

Who Pays Analysis

Step	Staff Answer
Evaluate	Data structures needed, persistence requirements, operational capacity
Compare	Self-managed (control, cost) vs managed (ops burden, features)
Recommend	Usually managed unless specific requirements demand self-hosted

Why this is L6:

Evaluates operational capacity as a first-class criterion — understands that team ability to run infrastructure matters as much as features
Frames the decision around requirements, not preferences — avoids the "I like Redis" trap that signals Senior-level thinking
Defaults to managed with an explicit escape hatch — shows organizational awareness of where engineering time is best spent

→ For the complete framework, see →Build vs Buy Framework.

Drill 8: Multi-Region Caching

Interview Prompt

Interview prompt: "We're expanding to Europe. How does caching change?"

Staff Answer

Step	Staff Answer
Options	Per-region caches (simple, inconsistent) vs global cache (complex, consistent)
Tradeoff	Cross-region latency (100ms+) vs staleness across regions
Recommend	Per-region caches with async invalidation for most use cases

Why this is L6:

Lays out both architecture options with clear tradeoff dimensions (latency vs consistency) — structured decision-making, not gut feel
Quantifies the cross-region latency cost (100ms+) — grounds the tradeoff in real numbers that drive the recommendation
Recommends async invalidation as the default — balances pragmatism with correctness rather than chasing perfect global consistency

Drill 9: Ownership Conflict — TTL Disagreement

Interview Prompt

Interview prompt: "Product wants 1-hour TTL for faster page loads. Infra says 5 minutes max because of stale data risk. You're the Staff engineer. How do you resolve this?"

Staff Answer

Step	Staff Answer
Reframe	This isn't a TTL debate — it's a staleness tolerance question. What's the actual business impact of stale data?
Investigate	What data is this? User profile (1hr OK)? Inventory (5min risky)? Pricing (unacceptable)?
Propose	Data classification: different TTLs for different data sensitivity. Not one global TTL.
Ownership	Product owns staleness SLA per data type. Infra provides the mechanisms.
Document	Write down: "User preferences: 1hr TTL approved by Product. Inventory: 5min with write-through invalidation."

Staff insight: The conflict exists because nobody defined the staleness contract. The fix is explicit ownership, not splitting the difference on TTL.

Why this is L6:

Reframes a technical disagreement as a missing contract — solves the organizational root cause, not the surface argument
Introduces data classification with per-type TTLs — shows that the right answer is nuanced, not a single compromise number
Assigns explicit ownership (Product owns staleness SLA, Infra provides mechanisms) — demonstrates cross-team boundary thinking

Drill 10: Ownership Conflict — Cache Hides a Bug

Interview Prompt

Interview prompt: "A pricing bug went unnoticed for 2 weeks because the cache kept serving correct (cached) prices while the DB had wrong values. Now leadership wants to know why QA didn't catch it. What do you say?"

Staff Answer

Step	Staff Answer
Diagnose	The cache masked the bug. QA tested via the cache (fast path), never hit the DB (slow path).
Root cause	No cache-bypass testing. No correctness monitoring comparing cache vs source.
Immediate fix	Add cache-bypass test cases. Add periodic cache-vs-DB consistency checks.
Systemic fix	Cache correctness metrics: sample reads and compare to source of truth. Alert on divergence.
Ownership	Who owns cache correctness? Not QA — they test features. Infra or platform team owns cache health.

Staff insight: Caches amplify bugs by serving wrong answers faster and more consistently. The fix is correctness observability, not blaming QA.

Why this is L6:

Redirects blame away from QA toward a systemic gap (no cache-bypass testing) — shows leadership maturity in incident response
Proposes cache-vs-source correctness monitoring — builds continuous verification, not just one-time test coverage
Identifies that caches amplify bugs as an architectural insight — reasons about emergent system behavior, not just component behavior

88. Deep Dive Scenarios

Scenario-based analysis for Staff-level depth

These scenarios test Staff-level operational thinking. Unlike drills (which test interview responses), deep dives test ownership reasoning.

Deep Dive 1: Black Friday Cache Failure

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would focus on the cache itself — checking Redis metrics, looking for memory pressure or high CPU on the cache nodes, and likely proposing to scale up the Redis cluster or add read replicas. They would treat this as a caching infrastructure problem and try to restore hit rate by increasing cache capacity. Their response is technically sound but scoped entirely to the cache layer without considering downstream blast radius.

Staff Approach: The Staff answer shifts the focus from the cache to the origin. A 25% hit rate drop on Black Friday could mean a 5x increase in database load, so the real patient is the database, not the cache. The Staff engineer triages by asking whether the origin can survive the extra load, identifies the root cause pattern (hot keys from a viral deal vs. memory-driven evictions vs. traffic exceeding capacity plans), and activates guardrails like circuit breakers and traffic shedding while diagnosing. Critically, the post-mortem question — why didn't capacity planning model Black Friday traffic patterns? — reveals organizational thinking about preventing recurrence, not just resolving the incident.

Staff Answer

Phase	What to do
Immediate (0-5 min)	Is this a cache problem or an origin problem? Check if origin can handle the extra 25% load.
Triage	Hot keys (one product viral)? Memory pressure (evictions)? Network saturation?
Quick fix	If hot keys, enable L1 caching. If memory, increase cluster size or reduce TTL for low-value keys.
Guardrails	Circuit breaker on origin, shed low-priority traffic if needed.
Post-mortem	Why didn't capacity planning catch this? Load test with Black Friday traffic patterns.

Staff insight: A 25% cache hit rate drop on Black Friday could mean 5x increase in DB load. The cache isn't the patient — the origin is.

Deep Dive 2: Stale Data Incident

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would investigate the specific cache entry, find the 3-hour TTL, and propose shortening it to something like 5 minutes. They might also add write-through invalidation for the payment method cache key so updates take effect immediately. The fix is correct for this one key and this one incident, but it treats the problem as a TTL tuning exercise rather than a systemic data classification gap.

Staff Approach: The Staff answer recognizes this as a data classification failure, not a TTL problem. Payment data should never have been cached with a long TTL in the first place — or arguably should not be cached at all. The Staff engineer asks the broader question: what other sensitive data (billing, authentication tokens, permissions) is sitting in the cache with similarly inappropriate TTLs? They propose an organizational remedy — a data classification policy where financial and compliance-sensitive data requires explicit approval before caching, with mandatory write-through invalidation if caching is approved. The fix is not patching one key; it is ensuring the class of failure cannot recur across any service.

Staff Answer

Dimension	Staff Answer
Root cause	Payment data was cached with 3-hour TTL, no write-through invalidation
Immediate	Purge cache for affected customer, verify current state
System fix	Payment data should not be cached, OR write-through invalidation with <1min TTL
Process fix	Data classification policy: financial data requires explicit cache approval
Broader question	What other sensitive data is cached with long TTLs?

Staff insight: Some data should never be cached, or only with write-through invalidation. This is a data classification problem, not a TTL tuning problem.

Deep Dive 3: Cache Warming Gone Wrong

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would fix the warming job — profile why it took 2 hours instead of 10 minutes, optimize the queries or parallelism, and perhaps add a timeout so the team knows when warming is running long. They might also add a check to avoid decommissioning the old cache until warming finishes. These are correct fixes to the proximate cause, but they treat cache warming as a background optimization task rather than a deployment-critical gate.

Staff Approach: The Staff answer treats cache warming as a deployment-blocking operation that should never run in parallel with cutover. The deeper question is why the warming job's runtime grew 12x without anyone noticing — the dataset grew, but no one tracked warming duration as a capacity metric. The Staff engineer designs warming with progress tracking and an explicit readiness gate: traffic does not shift until warming reaches a target hit-rate threshold. They also establish ongoing monitoring of warming time trends so that dataset growth is caught weeks before it turns a 10-minute job into a 2-hour outage during a deployment.

Staff Answer

Phase	What to do
Immediate	Why was old cache decommissioned before warming completed? Process failure.
Root cause	Warming job didn't account for dataset growth. What was 10 min last month is 2 hours now.
System fix	Warming job with progress tracking, don't cutover until warming complete
Process fix	Warming is part of deployment checklist, not a background task
Capacity planning	Monitor warming time, alert if it grows significantly

Staff insight: Cache warming is deployment-critical. It should block cutover, not run in parallel.

Deep Dive 4: Multi-Tier Cache Debugging

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would investigate the L2 invalidation logic, confirm it fires on write, and then shorten the L1 TTL from 10 seconds to something smaller to reduce the staleness window. They might also look at whether certain application instances are missing invalidation events. The debugging is methodical and correct, but it stays within the bounds of "fix the bug in front of me" without questioning the architectural choice that created the complexity.

Staff Approach: The Staff answer starts by framing the worst-case staleness arithmetic: with L1 at 10 seconds and L2 at 5 minutes, users can see data up to 5 minutes and 10 seconds stale if both tiers are populated just before an invalidation. The intermittent nature of the bug — some users see fresh, some see stale — is the signature of per-instance L1 caches that were never wired into the invalidation path. The Staff engineer forces an explicit architectural decision: either accept and document the 10-second L1 staleness window (with product sign-off), or implement L1 invalidation via pub/sub broadcast, understanding that this adds operational complexity. They also establish aggregate staleness observability so the team can detect multi-tier consistency drift before users report it.

Staff Answer

Dimension	Staff Answer
Hypothesis	L2 was invalidated on write, but L1 on some instances still has stale data
Investigation	Which instances have stale data? Are they missing invalidation events?
Root cause	L1 invalidation wasn't implemented — relied on TTL only
Fix	Either accept 10s staleness (document it) or implement L1 invalidation
Tradeoff	L1 invalidation adds complexity; short TTL may be acceptable

Staff insight: Multi-tier caching multiplies consistency complexity. Aggregate staleness is L1 TTL + L2 TTL in the worst case.

Deep Dive 5: Cache Cost Optimization

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would look at the Redis cluster configuration, identify overprovisioned nodes, and apply standard cost-reduction techniques: compress large values, reduce TTLs for infrequently accessed keys, and switch to a smaller instance type where possible. They might also recommend enabling Redis memory-efficient data structures. These optimizations are valid and may achieve the 40% target, but they optimize the existing cache footprint without questioning what should be cached in the first place.

Staff Approach: The Staff answer begins with a usage audit — not how much memory the cache uses, but what is in it and whether it should be there. Often 20% of cached keys drive 80% of hit-rate value, meaning a large fraction of cache memory is occupied by low-reuse or never-reused entries that were cached by default rather than by design. The Staff engineer examines hit rate by key pattern to identify entire categories of data that can be removed from the cache without meaningful performance impact. They also investigate why costs tripled — was it organic traffic growth, a new service caching aggressively without review, or dataset bloat from missing eviction? The fix is a caching governance process where new cache usage is reviewed for cost-benefit, not just a one-time optimization pass.

Staff Answer

Phase	What to do
Analysis	What's in the cache? Key cardinality, size distribution, hit rate by key pattern
Low-hanging fruit	Remove low-hit-rate keys, reduce TTL for rarely-accessed data
Architecture	Can we move cold data to cheaper storage? Tiered caching?
Compression	Compress large values, more efficient serialization
Eviction tuning	Are we caching data that's never reused? Adjust eviction policy

Staff insight: Cache cost optimization starts with understanding what's being cached and why. Often 20% of keys drive 80% of value.

99. Level Expectations Summary

What gets you each level in a caching interview:

Level Calibration

Level	Minimum Bar	Key Signals
L5 (Senior)	Knows cache-aside pattern + Redis basics + understands TTL	Can implement a working cache layer
L6 (Staff)	Access pattern analysis + invalidation contracts + failure modes + ownership thinking	Designs a cache you can operate
L7 (Principal)	Data classification + org-wide patterns + consistency tiers + build-vs-buy reasoning	Designs a caching platform

What Separates Each Level

Level Calibration

Transition	The Gap
L5 → L6	From "add a cache" to "what's the staleness contract and who owns it"
L6 → L7	From "my service's cache" to "the organization's caching strategy"

Quick Self-Check

Before your interview, verify you can answer:

What's the read/write ratio threshold where caching stops helping?
What's your invalidation strategy, and what's the TTL fallback?
How do you handle thundering herd on cache miss?
What's the staleness budget, and who signed off on it?
When would you NOT cache this data?

The Bar for This Question

Mid-level (L4/E4): You should be able to implement cache-aside with Redis, set reasonable TTLs, and explain the basic read path (check cache → miss → query DB → populate cache → return). You can describe cache hits and misses and why caching improves latency. Understanding cache eviction policies (LRU) or invalidation challenges would be a bonus but isn't expected.

Senior (L5/E5): You should quickly establish the caching pattern (cache-aside vs read-through vs write-through) based on the access pattern and spend time on the real problems: cache invalidation strategy (TTL vs event-driven), thundering herd on cache miss (locking, request coalescing), cache key design and its impact on hit rate, and the staleness contract — how stale is acceptable and who signed off on it. You should quantify: "We cache product catalog with a 5-minute TTL because the business accepts 5 minutes of stale pricing in exchange for 10x lower DB load." Having an opinion on consistent hashing for cache distribution would be strong.

Staff+ (L6/E6+): You should dispatch the baseline architecture in 5 minutes and spend 25+ minutes on operational depth: multi-tier caching (L1 in-process → L2 Redis → L3 CDN), cache warming strategies for cold starts and deploys, the organizational question of who owns the staleness contract (engineering proposes TTLs, product signs off on user-facing staleness), and failure mode analysis — what happens when Redis goes down (do you fall through to DB and crush it, or serve stale from a local cache?). You should reason about cache sizing economics (memory cost vs DB query cost), hot key detection and mitigation (dedicated cache nodes or key splitting), and how caching intersects with consistency requirements across services. The interviewer should see you treat caching as a data freshness contract, not just a performance optimization.

1010. Staff Insiders: Controversial Opinions

These are uncomfortable truths that distinguish Staff engineers from Seniors. They're based on operating caches at scale, not on textbook knowledge. Strong engineers disagree on some of these — that's the point.

Most Stale Data Incidents Are Never Detected

The uncomfortable truth: Your cache is probably serving stale data right now. You just don't know it.

Why it's invisible:

Factor	Why It Hides Staleness
No correctness metrics	You measure hit rate, not accuracy
Intermittent symptoms	Users refresh and it "fixes itself"
Blame shifting	"The data was always like that"
TTL masks evidence	By the time you investigate, the stale entry expired

The Staff position: If you can't measure staleness, you can't claim your cache is correct. Most teams measure cache performance but not cache correctness.

Bar-raiser question: "How would you know if your cache served incorrect data for the last hour?"

High Hit Rate Can Mask Correctness Bugs

The uncomfortable truth: 99% hit rate might mean you're serving confidently wrong answers 99% of the time.

Why this happens:

Cache is fast, so nobody questions its answers
Bugs in the origin path are never exercised
Stale data looks like correct data if you don't check

Real-world example: A payment cache served 3-hour-old card data with 99.5% hit rate. Nobody noticed until a customer complained about being charged on a canceled card.

The Staff position: Hit rate is a performance metric, not a correctness metric. A cache with 99% hit rate that's 1% wrong can cause more damage than a cache with 80% hit rate that's always correct.

Removing a Cache Is Often the Right Fix

The uncomfortable truth: Many caching problems are best solved by removing the cache entirely.

Signs you should delete the cache:

Hit rate below 50% (you're paying for misses)
Write rate approaches read rate (invalidation churn)
Consistency bugs that nobody can debug
The origin can handle the load without the cache
Multiple incidents traced to cache staleness

Why teams don't remove caches:

"We already built it"
"It must be helping somehow"
Fear of origin load (often unfounded)
Nobody owns the decision to remove

The Staff position: Adding a cache is easy. Removing one requires courage. The Staff engineer asks: "What if we just... didn't cache this?"

Cache Invalidation Is an Ownership Problem, Not a Technical One

The uncomfortable truth: "Cache invalidation is hard" is a cop-out. It's hard because nobody owns it.

Why invalidation fails:

Failure Mode	Root Cause
Missed invalidation	Writer doesn't know about cache
Partial invalidation	Multiple caches, one forgot
Race conditions	Nobody designed for concurrency
Schema changes break it	Cache contract undocumented

The Staff position: Invalidation is hard because it's a coordination problem across teams, not a technical problem within one service. The fix is ownership clarity, not better algorithms.

Bar-raiser question: "Who is responsible for invalidation correctness across all services that write this data?"

Caches Amplify Bugs — They Don't Just Hide Them

The uncomfortable truth: A bug that affects 1% of requests without a cache might affect 99% of requests with a cache.

How caches amplify:

Without cache:
  - Bug writes bad data to DB
  - 1% of reads hit the bug
  - 99% read correct data from DB

With cache:
  - Bug writes bad data to DB
  - Bad data cached
  - 99% of reads hit cache → 99% see bad data

The Staff position: Caches turn transient bugs into persistent outages. A single bad write + aggressive caching = widespread incorrect data served for TTL duration.

"TTL Was Too Long" Is Never the Root Cause

The uncomfortable truth: When stale data causes an incident, "reduce TTL" is the wrong fix.

Why TTL tuning fails:

It treats symptoms, not causes
Shorter TTL = more origin load
The real question: why wasn't invalidation triggered?

The Staff position: Every "TTL was too long" incident is actually an invalidation ownership failure. The fix is better invalidation, not shorter TTL. Shorter TTL is a band-aid that increases cost.

Real root causes:

Writer didn't know to invalidate
Invalidation code had a bug
Invalidation was async and lost
Nobody owned the cache contract

Appendices (Deep Dive)

Appendix A: Caching Patterns Deep Dive — Cache-aside, read-through, write-through, write-behind

A.1 Cache-Aside (Lazy Loading)

The most common pattern. Application manages cache explicitly.

Read path:

Check cache
On miss: read from DB, populate cache
Return data

Write path:

Write to DB
Invalidate cache (delete key)

Rendering diagram...

Pros:

Cache failure = DB fallback (resilient)
Only cache what's actually read (efficient)
Simple mental model

Cons:

Cache miss = two round trips (latency)
Invalidation logic in application (scattered)
Race condition window between write and invalidate

When to use: Most read-heavy workloads where you want cache failure to be non-fatal.

A.2 Read-Through

Cache handles fetching. Application always reads from cache; cache fetches from DB on miss.

Read path:

Read from cache
On miss: cache fetches from DB, stores, returns

Write path:

Write to DB
Invalidate cache OR let TTL expire

Pros:

Simple application code (always read cache)
Automatic population

Cons:

Cache failure = read failure (coupled)
Cache must understand DB schema
Less control over fetch logic

When to use: When you want to centralize caching logic and can accept cache-as-dependency.

A.3 Write-Through

Cache handles persistence. Write to cache; cache synchronously writes to DB.

Write path:

Write to cache
Cache writes to DB (synchronous)
Confirm to client

Read path:

Always read from cache (always fresh)

Pros:

Cache always consistent with DB
Simple read path
No invalidation needed

Cons:

Write latency includes cache
Cache failure = write failure
Cache must understand DB schema

When to use: When consistency is critical and you can accept cache-as-dependency on writes.

A.4 Write-Behind (Write-Back)

Async persistence. Write to cache; cache asynchronously writes to DB.

Write path:

Write to cache (immediate return)
Cache queues DB write
Background process persists to DB

Read path:

Always read from cache

Pros:

Lowest write latency
Batching opportunities
Absorbs write spikes

Cons:

Data loss risk (cache crash before DB write)
Consistency complexity
Requires durable cache or careful failure handling

When to use: Write-heavy workloads where you can tolerate some data loss risk (analytics, logs, non-critical counters).

A.5 Pattern Comparison

Pattern	Read Path	Write Path	Consistency	Failure Impact
Cache-Aside	App → Cache → (miss) → DB	App → DB → Invalidate Cache	Eventual	Cache down = DB fallback
Read-Through	App → Cache → (miss) → Cache fetches DB	App → DB → Invalidate Cache	Eventual	Cache down = reads fail
Write-Through	App → Cache	App → Cache → Cache writes DB	Strong	Cache down = writes fail
Write-Behind	App → Cache	App → Cache → async DB	Eventual	Cache down = data loss risk

Appendix B: Eviction Strategies — LRU, LFU, TTL, and when each matters

B.1 Why Eviction Matters

Cache memory is finite. When full, something must go. The eviction policy determines what.

The wrong eviction policy:

Evicts hot data → cache misses → origin load
Keeps cold data → wasted memory → poor hit rate

B.2 Common Eviction Policies

Policy	Evicts	Best For	Weakness
LRU	Least Recently Used	General workloads	One-time scans pollute cache
LFU	Least Frequently Used	Stable hot set	Slow to adapt to changing access patterns
TTL	Expired entries	Time-sensitive data	Doesn't handle memory pressure
Random	Random entry	Large caches, uniform access	Can evict hot data

B.3 LRU (Least Recently Used)

How it works: Evict the entry that hasn't been accessed for the longest time.

Good for: Workloads with temporal locality (recently accessed = likely accessed again).

Bad for: Scan workloads — one-time reads push out hot data.

Access pattern: A B C D A B E F G H ...
LRU cache (size 4):
  After A B C D: [D, C, B, A]
  After A B:     [B, A, D, C]
  After E:       [E, B, A, D] (C evicted)
  After F:       [F, E, B, A] (D evicted)
  ...
  Hot keys A, B survive; scan keys E, F, G, H cycle through

B.4 LFU (Least Frequently Used)

How it works: Evict the entry with the lowest access count.

Good for: Stable hot sets where popular items stay popular.

Bad for: Changing popularity — old popular items block new popular items.

Variants:

LFU with decay: Counts decay over time, allowing popularity shifts
Window LFU: Only count recent accesses

B.5 TTL-Based Eviction

How it works: Entries expire after a fixed time, regardless of access.

Good for: Data with known staleness windows, session data.

Not a memory pressure solution: TTL doesn't help when cache is full but entries haven't expired.

Best practice: Combine TTL with LRU/LFU — TTL for freshness, LRU for capacity.

B.6 Choosing an Eviction Policy

Workload	Recommended	Why
General web app	LRU	Good default, temporal locality
Stable hot set (popular products)	LFU	Protects frequently-accessed items
Session data	TTL	Natural expiration
Scan-heavy (batch reads)	LRU + scan resistance	Prevent scans from polluting

Staff insight: Most production systems use LRU with TTL. LFU is rarely worth the complexity unless you have a proven stable hot set.

Appendix C: Thundering Herd Mitigations — Request coalescing, probabilistic expiry, circuit breakers

C.1 The Problem

When a popular cache entry expires or is invalidated:

Many requests arrive simultaneously
All miss the cache
All query the origin
Origin is overwhelmed

C.2 Request Coalescing (Singleflight)

How it works: First request triggers the fetch; concurrent requests wait for that result.

Rendering diagram...

Implementation: Use a lock or promise per key. First request acquires lock and fetches; others wait on the promise.

Tradeoffs:

✅ Origin sees single request instead of N
❌ Waiting requests add latency
❌ If fetch fails, all waiters fail

C.3 Probabilistic Early Expiry

How it works: Entries expire slightly before their TTL, with randomness to spread refreshes.

Formula: should_refresh = (now > expiry - TTL * beta * log(random()))

Where beta controls how early refreshes can happen.

Effect: Instead of 1000 requests hitting exactly at TTL, refreshes spread over a window.

Tradeoffs:

✅ Spreads load over time
❌ Some entries refresh "too early" (wasted work)
❌ Requires tuning beta

C.4 Background Refresh (Stale-While-Revalidate)

How it works: Serve stale data immediately while refreshing in background.

Request arrives → Cache has stale entry
  → Return stale data immediately
  → Trigger background refresh
  → Next request gets fresh data

Tradeoffs:

✅ No latency spike on refresh
✅ Origin load is smoothed
❌ Guaranteed staleness during refresh
❌ Complexity (background job, stale tracking)

C.5 Cache-Miss Circuit Breaker

How it works: Limit concurrent requests to origin on cache miss.

Cache miss:
  if (concurrent_origin_requests < limit):
    fetch from origin
  else:
    fail fast (or serve stale if available)

Tradeoffs:

✅ Protects origin from overload
❌ Some requests fail or get stale data
❌ Requires tuning limit

C.6 Choosing a Mitigation

Scenario	Recommended	Why
High-traffic keys	Request coalescing	Prevents duplicate origin fetches
Many keys expiring together	Probabilistic early expiry	Spreads refresh load
Latency-sensitive	Stale-while-revalidate	No refresh latency visible to user
Origin fragile	Circuit breaker	Hard limit on origin load

Staff answer: "I'll use request coalescing as the primary defense, with probabilistic early expiry to prevent synchronized refreshes. Circuit breaker protects the origin if coalescing isn't enough."

Appendix D: Cache Consistency Patterns — Invalidation, versioning, read-your-writes

D.1 The Fundamental Problem

Cache and database are separate systems. Writes go to DB; reads may come from cache. Keeping them consistent is hard.

D.2 Invalidation Patterns

Delete on Write

How it works: After writing to DB, delete the cache key.

Write: UPDATE users SET email='new' WHERE id=123
Then:  DELETE cache:user:123

Problem: Race condition.

t=0: Thread A reads user:123 from DB (old data)
t=1: Thread B writes user:123 to DB (new data)
t=2: Thread B deletes cache:user:123
t=3: Thread A writes old data to cache
Result: Cache has stale data until TTL

Mitigation: Version stamps or short TTL.

Update on Write (Write-Through)

How it works: After writing to DB, update the cache with new value.

Write: UPDATE users SET email='new' WHERE id=123
Then:  SET cache:user:123 = {new data}

Problem: Same race condition, plus you're computing cache value in write path.

D.3 Version Stamps

How it works: Include a version number in the cache key or value.

Cache key: user:123:v7
On write: increment version → user:123:v8
Old cached data at v7 is naturally orphaned

Tradeoffs:

✅ No race conditions
❌ Key cardinality increases
❌ Need to track current version somewhere

D.4 Read-Your-Writes Consistency

Problem: User updates data but immediately sees old cached data.

Solution: After write, user's session bypasses cache for that key.

Write: User updates profile
Set:   session.bypass_cache['user:123'] = now + 30s
Read:  If bypass active, read from DB, not cache

Tradeoffs:

✅ User always sees their own writes
❌ Complexity in read path
❌ Per-session state required

D.5 Change Data Capture (CDC)

How it works: Subscribe to DB changelog; invalidate cache asynchronously.

DB write → Changelog (binlog, WAL)
         → CDC consumer
         → Cache invalidation

Tradeoffs:

✅ Decouples writers from cache knowledge
✅ Guaranteed to catch all writes
❌ Eventual consistency (lag between write and invalidation)
❌ Infrastructure complexity

When to use: Large-scale systems where write paths are diverse and can't all know about caching.

Appendix E: Metrics & Observability — Hit rate, latency, eviction monitoring

E.1 Core Metrics (Non-Negotiable)

cache_hit_total
cache_miss_total
cache_hit_rate (derived: hit / (hit + miss))
cache_latency_ms
origin_latency_ms
cache_eviction_total
cache_size_bytes
cache_key_count

E.2 Hit Rate Is Not Enough

High hit rate can hide problems:

Hit rate 95% but 5% misses overwhelm origin
Hit rate 99% but misses are the important requests
Hit rate 90% but most hits are stale data

Better signals:

Origin load during cache operation
Miss latency vs hit latency ratio
Staleness metrics (if trackable)

E.3 Metric Dimensions

Slice metrics by:

Key pattern: user:*, product:*, session:*
Operation: get, set, delete
Result: hit, miss, error

E.4 Alerting

Alert	Threshold	Why
Hit rate drop	< 80% for 5 min	Origin may be overwhelmed
Cache latency spike	p99 > 50ms	Network or capacity issue
Eviction rate spike	> 1000/s	Memory pressure
Cache unavailable	> 30s	Failover or outage

E.5 Dashboards

Operational dashboard:

Hit rate (real-time)
Origin load vs cache load
Latency percentiles (p50, p95, p99)
Error rate

Capacity dashboard:

Memory usage vs capacity
Key count and growth rate
Eviction rate
Connection count

Appendix F: Multi-Region Caching — Per-region, global, invalidation strategies

F.1 The Multi-Region Problem

Users in EU should hit EU cache; users in US should hit US cache. But what happens when data is updated?

F.2 Strategies

Strategy	Consistency	Latency	Complexity
Per-region caches, no sync	Eventually consistent	Low	Low
Per-region with async invalidation	Eventually consistent	Low	Medium
Global cache (single region)	Strong	High (cross-region)	Low
Global cache (replicated)	Eventually consistent	Low reads, high writes	High

F.3 Per-Region with Async Invalidation

How it works:

Write happens in one region
Invalidation message published to message bus
Other regions consume and invalidate their caches

Rendering diagram...

Staleness window: Cross-region propagation delay (typically 100-500ms).

F.4 When to Accept Per-Region Inconsistency

User data: Users rarely switch regions mid-session
Product catalog: Brief inconsistency rarely matters
Sessions: Should be region-sticky anyway

F.5 When to Require Global Consistency

Inventory/stock: Overselling is expensive
Financial: Compliance requirements
Global rate limiting: Abuse protection

Staff insight: Most user-facing data can tolerate per-region caching with async invalidation. Reserve global consistency for data where inconsistency has real cost.

Appendix G: Cache Sizing and Capacity Planning — Memory estimation, hit rate modeling

G.1 Basic Sizing

Formula:

Memory needed = working_set_size × (1 + overhead_factor)
Where:
  working_set_size = num_keys × avg_value_size
  overhead_factor ≈ 1.5-2x (for Redis data structures, fragmentation)

Example:

1M users, 2KB per user profile
Working set: 1M × 2KB = 2GB
With overhead: 2GB × 1.5 = 3GB
Add headroom: 3GB × 1.2 = 3.6GB minimum

G.2 Hit Rate vs Cache Size

The Pareto insight: Often 20% of keys serve 80% of traffic. Caching the hot set gives most of the benefit.

Hit rate curve: Typically logarithmic — doubling cache size doesn't double hit rate.

Cache size: 10%  → Hit rate: 70%
Cache size: 20%  → Hit rate: 85%
Cache size: 50%  → Hit rate: 95%
Cache size: 100% → Hit rate: 99%

Staff insight: Model your access pattern. If you have high skew (few hot keys), a small cache gives great hit rate. If access is uniform, you need to cache more.

G.3 Capacity Planning Questions

What's the working set? All data that could be cached, or just hot data?
What hit rate do we need? 80%? 95%? 99%?
What's the cost of a miss? DB query latency, cost, capacity
What's the cost of cache memory? $/GB for your cache tier
What's the growth rate? How will working set grow?

G.4 Cost Optimization

Technique	Savings	Tradeoff
Compression	2-5x	CPU overhead
Shorter TTL	Less memory for cold data	More origin load
Tiered caching	Hot data in expensive cache, cold in cheap	Complexity
Efficient serialization	10-50%	Developer effort

These frameworks are referenced throughout this playbook and apply to many system design problems:

→Distributed State Coordination
- Cache invalidation coordination, multi-tier consistency, leader election for cache warming
- Applies to: caching, rate limiting, locks, sessions
→Degraded Mode Framework
- Cache failure handling, serving stale vs failing, circuit breakers
- Applies to: caching, rate limiting, dependency isolation
→Build vs Buy Framework
- Redis vs Memcached vs managed services, self-hosted vs cloud
- Applies to: caching, observability, databases, queues