Technologies referenced in this playbook: Redis
How to Use This Playbook
Organized for interview use first, reference second. Read front-to-back once. Return to individual sections for targeted review.
| Mode | Time | What to Read |
|---|---|---|
| Quick Review | 15 min | Executive Summary → Interview Walkthrough → Fault Lines → Active Drills |
| Targeted Study | 1–2 hrs | Executive Summary → Interview Walkthrough → Fault Lines → weak-spot Deep Dives |
| Deep Dive | 3+ hrs | Everything, including appendices |
What is Distributed Caching? — Why interviewers pick this topic
A cache stores copies of frequently accessed data in fast storage (usually memory) to avoid repeatedly hitting slower backends like databases or external APIs. "Distributed" means the cache spans multiple nodes, allowing it to scale beyond a single machine's memory and survive individual node failures. The tradeoff: you're trading consistency for speed — cached data can become stale, and now you have two sources of truth that can disagree.
Why interviewers reach for this question: Caching seems simple but hides brutal complexity. Invalidation is a consistency problem, not a timeout problem. Interviewers want to see if you understand that adding a cache means you've created two sources of truth that can disagree. Can you articulate when staleness is acceptable? Do you know what happens when the cache goes down? This topic reveals whether you've dealt with real production issues — cache stampedes, thundering herds, and the dreaded "why is this showing old data?" bug that nobody noticed for three hours.
The Staff-level frame: Caching is a consistency problem disguised as a performance optimization. The Staff question is: who pays for staleness, and who owns the invalidation contract?
The L5 vs L6 Contrast — Start Here
| Behavior | Senior (L5) | Staff (L6) |
|---|---|---|
| First move | "We'll add Redis in front of the database" | "What's the staleness tolerance? Who's the source of truth? Who owns invalidation?" |
| Invalidation | "TTL of 5 minutes" | "TTL is the safety net, not the strategy. What's our invalidation contract on write?" |
| Failure | "We'll add replicas" | "When the cache fails, do we hit the DB or return errors? What's the thundering herd plan? Who has permission to shed load?" |
| Consistency | Assumes the cache is always helpful | "When does caching make things worse? Write-heavy data, consistency-critical paths, low hit rate — I'd push back on caching all three." |
| Ownership | Focuses on cache implementation | "Who owns cache warming? Who gets paged when hit rate drops? Who signed off on the staleness budget?" |
Why "first move" matters
L5: "We'll use Redis with cache-aside pattern. Here's the read flow: check cache, miss, query DB, populate cache with TTL." Technically correct. But the interviewer hasn't specified what type of data this is, what staleness is acceptable, or who writes to this data.
L6: "Before I add any caching: what's the read/write ratio for this data? What staleness is the business willing to accept? And who owns the write path — because invalidation coordination across multiple writers is where every cache breaks. If the answers to those questions are 'mostly reads, 30 seconds is fine, and one service writes,' this is a great caching candidate. If they're 'balanced reads and writes, must be fresh, and five teams all write,' I'd push back on adding a cache entirely."
Why "TTL as safety net" matters
L5: "We'll use a 5-minute TTL. That balances freshness against origin load." Sounds reasonable. But TTL-only invalidation means: every time data changes, users see stale data for up to 5 minutes. At 1000 writes/minute, that's thousands of cache entries serving wrong data simultaneously.
L6: "TTL is the last line of defense — it caps the worst-case staleness window when everything else fails. Primary invalidation is explicit: when this entity is written, we delete the cache key. TTL catches the cases where the write path missed the invalidation — developer forget, new write path added, async write that happened outside the main flow. I'd set TTL at 2× the acceptable staleness window, not the acceptable staleness window itself."
Why "when not to cache" matters
L5: Never raises this — treats caches as universally beneficial.
L6: "Before I add a cache: write-heavy data is a bad caching candidate — invalidation churn exceeds read savings. Low-reuse data is bad — keys get evicted before they're read twice. Consistency-critical paths are bad — auth, permissions, financial data. Caching makes the system faster but also makes staleness bugs invisible and insidious. Every cache is a consistency downgrade. The question isn't 'can we cache this?' — it's 'should we accept the consistency cost?'"
What This Interview Actually Tests
Distributed caching is not a "make it faster with Redis" question. This is a consistency and organizational ownership question that tests:
- Whether you reason about invalidation before discussing eviction
- Whether you can articulate what "stale" means for your specific use case and who decided
- Whether you understand the blast radius when the cache fails at 3× origin load
- Whether you know when caching makes things worse, not better
The Staff Positions
| Position | Rationale |
|---|---|
| Cache-aside over write-through | Decouples cache availability from write availability; cache failure = DB fallback, not write failure |
| TTL is safety net, not strategy | Primary invalidation is explicit on write; TTL catches missed invalidations |
| Invalidation before eviction | Design the invalidation contract before discussing LRU vs LFU |
| Product owns staleness SLA | Engineering proposes TTL values; product signs off on user-visible staleness per data type |
| Financial/auth data bypasses cache | Correctness cost of staleness exceeds performance benefit |
| If invalidation can't be owned, don't cache | Unowned invalidation = guaranteed stale data incidents |
The Three Intents
Commit to one intent before designing anything.
| Intent | Primary Constraint | Strategy | Staleness Bar |
|---|---|---|---|
| Origin Protection | Shield backend from read load | Cache-aside with request coalescing; circuit breaker on cache failure | Product decides per data type |
| Latency Reduction | Sub-millisecond reads | Aggressive L1 + L2 multi-tier; CDN at edge | Seconds to minutes acceptable |
| Cost Optimization | Reduce expensive computation | Precompute + cache; longer TTLs | Minutes to hours acceptable |
What Interviewers Probe
| If you say... | They will ask... |
|---|---|
| "Add Redis with cache-aside" | "What's your invalidation strategy on write? What if the write path forgets to invalidate?" |
| "TTL of 5 minutes" | "Data changes at 1:00. User reads at 1:04. What do they see? Who decided that was acceptable?" |
| "Invalidate on write" | "Race condition: reader repopulates the cache between your write and your invalidation. Now what?" |
| "We'll add replicas for HA" | "Replicas don't solve cold start. What's your warming strategy?" |
| "Request coalescing for thundering herd" | "You have 100 app instances. One acquires the lock. The other 99 are waiting. How long?" |
Quick-Reference: The 30-Second Cheat Sheet
| Topic | The L5 Answer | The L6 Answer — Say This |
|---|---|---|
| Invalidation | "5-minute TTL" | "Write-path DEL on write, TTL as safety net. TTL caps worst-case window; explicit invalidation handles normal case." |
| Pattern | "Cache-aside" | "Cache-aside because it decouples cache availability from write availability. Cache down = fallback to DB, not write failure." |
| Thundering herd | "That's unlikely" | "Request coalescing (singleflight): first miss fetches, others wait. Probabilistic early expiry spreads load over time." |
| Cache failure | "Add replicas" | "Circuit breaker on origin. Request coalescing to prevent stampede. Decision by data type: user prefs → serve stale; financial → fail explicit." |
| Hot key | "Redis handles it" | "Single shard becomes bottleneck. L1 in-process cache absorbs traffic within instance. Detect via per-key QPS metrics." |
| Ownership | "Infrastructure team" | "Product owns staleness SLA. Engineering owns mechanism. Named person on the pager when hit rate drops below threshold." |
System Architecture Overview
Phase 1: Frame the Problem (2 minutes)
Before drawing any boxes, surface the three questions that change the entire architecture.
Say this:
- "Before I design anything: what's the read/write ratio? What staleness is acceptable per data type? And who owns the write path — because invalidation ownership is where every cache breaks."
Commit to intent and assumption:
- "I'll assume origin protection on a read-heavy workload — 95% reads, mixed data types with per-type staleness budgets. That's the hardest case: it requires both availability reasoning (cache failure) and consistency reasoning (multi-writer invalidation). I'll design for that."
Name the "when not to cache" check explicitly:
- "Quick check before adding any cache: write rate close to read rate? If so, invalidation churn may exceed savings. Consistency-critical data? Financial, auth, permissions bypass the cache entirely. Data already hot in the DB buffer pool? Adding a network hop would make it slower. None of those apply here, so caching is appropriate."
Phase 2: Core Entities (30 seconds)
State three first-class entities — not just GET and SET.
- CacheEntry — key, value, TTL, version/etag (version enables conditional invalidation)
- InvalidationEvent — key_pattern, source, timestamp, propagation_status (first-class, not an afterthought)
- CachePolicy — data_type, staleness_budget, eviction_strategy, origin_fallback behavior
Phase 3: The 3-Minute Architecture (3 minutes)
Walk four layers in under 3 minutes.
1. Read path (cache-aside). App checks L1 in-process cache first (microseconds, 10s TTL), then Redis L2 (milliseconds, 5min TTL), then PostgreSQL with request coalescing on miss. Request coalescing: if 100 concurrent requests miss the same key, only one hits the database. Others wait and receive the same result. Prevents thundering herd without additional infrastructure.
2. Write + invalidation. App writes to PostgreSQL. Synchronously DELs the Redis key — delete, not update, to avoid race conditions between concurrent writers. Publishes to Redis pub/sub channel for L1 eviction across all app instances. CDC consumer provides an async backstop: if the write path missed the explicit DEL, the database changelog catches it within ~1 second.
3. Failure path. Circuit breaker on origin. When Redis is unavailable, requests fall through to PostgreSQL via the circuit breaker. Decision by data type: user preferences → serve stale from L1 if available; financial data → fail explicit with clear error. Request coalescing prevents stampede on the database during cache failure.
4. Layered caching. L1 in-process (per instance, 10s TTL) → L2 Redis cluster (shared, TTL by data type) → CDN edge (semi-static content, 60s). Each layer has different staleness tolerance. Total worst-case staleness: L1 TTL + L2 TTL. Product signed off on that number for each data type.
Phase 4: Transition to Depth (15 seconds)
"The happy-path architecture is well-understood. What makes this Staff-level is the consistency and operational reasoning. Three areas: invalidation ownership — who's responsible when five teams write to the same entity; failure modes — thundering herd, hot key, cold start, and silent staleness bugs; and organizational ownership — who signs the staleness contract. Which do you want to go into?"
Phase 5: Deep Dives When Probed (5–15 minutes)
Probe A: "What happens when invalidation fails?"
"This is the silent failure mode. DB write succeeds. Invalidation fails (network blip, Redis timeout, async drop). No error, no alert. Users see stale data until TTL expires. Three layers of defense: (1) TTL as safety net — caps the worst-case window. (2) CDC backstop — database changelog publishes an invalidation event even if the write path missed it, arriving within ~1 second. (3) Correctness monitoring — sample a small percentage of reads, compare cached value against source of truth, alert on divergence. Most teams measure cache hit rate — a performance metric. Almost nobody measures cache correctness. That's where incidents hide."
Probe B: "Walk me through thundering herd."
"Popular cache key expires. 1,000 concurrent requests check the cache simultaneously. All miss. All hit the database. Database connection pool exhausts in seconds. Service degradation. This is the most common cache failure mode and almost always under-mitigated. Three defenses in order of preference: (1) Request coalescing (singleflight) — first miss acquires a lock, fetches from origin, populates cache. Other 999 wait for the result. Cost: added latency for waiters, but only one origin request. (2) Probabilistic early expiry — entries expire slightly before their TTL with some jitter, spreading refresh load over time instead of at one cliff-edge moment. (3) Background refresh — before expiry, a background process refreshes the entry so it never actually misses. Cost: always slightly stale. Coalescing is the default; probabilistic jitter is the enhancement."
Probe C: "The cache is down. What happens?"
"Distinguish by data type immediately. Financial data, auth tokens, permissions: fail explicit — circuit breaker returns a clear error, no serving of stale data. User preferences, product descriptions, feed content: serve stale from L1 if available; if not, fall through to PostgreSQL with request coalescing and circuit breaker. The circuit breaker on origin is non-negotiable — if Redis handles 90% of read load, a cold fallback means 10× the origin load. Without a circuit breaker, the origin collapses under the cache-failure load, turning a cache incident into a full service outage."
Probe D: "How do you handle hot keys?"
"One product goes viral. One Redis shard gets 100× normal traffic. Shard CPU hits 100%, latency spikes, shard fails. Every request for that key now misses everywhere. Three mitigations: (1) L1 in-process caching for predictable hot keys — absorbs traffic at the application tier with a short TTL. (2) Key replication — distribute the hot key across N shards with {key}:shard_id suffix; read operations choose a random shard. Memory cost: N copies. Invalidation complexity: N DEL operations. (3) Read replicas — route hot-key reads to Redis replicas. Eventual consistency: stale by replication lag. Detection: track per-key QPS metrics; alert when a single key exceeds a percentage of total shard QPS. Hot keys are usually predictable (homepage, trending, viral content) but occasionally surprise — the detection metrics matter."
Phase 6: Close with Constraints (30 seconds)
"Distributed caching is a consistency problem disguised as a performance optimization. The performance part is easy — add Redis, add TTL. The consistency part is the real work: who owns the invalidation contract when five services write to the same entity? Who signed off on 30 seconds of staleness for this data type? Who gets paged when the cache serves wrong data for three hours and nobody notices? Those organizational answers determine whether the cache runs itself or generates an incident every month."
1The Staff Lens
1.1 Why Caching Separates L5 from L6
Caching is the system design topic with the highest gap between apparent simplicity and operational complexity. Everyone has used Redis. Almost nobody has debugged a silent invalidation failure that served wrong data for six hours before anyone noticed.
The interviewer is testing whether you see caching as a consistency contract with organizational ownership — not just a performance optimization with a TTL setting.
1.2 The L5 vs L6 Contrast — Visual
1.3 The Key Insight
2Problem Framing & Intent
2.1 When NOT to Cache
Staff candidates win interviews by knowing when to push back. This is the highest-signal behavior.
| Scenario | Why Caching Hurts |
|---|---|
| Write-heavy data | Invalidation churn exceeds read savings. Cache becomes a write amplifier. |
| Low-reuse data | Each key read once. Cache fills with cold data, evicts hot data, adds latency without benefit. |
| Consistency-critical paths | Auth, permissions, financial data. Stale = wrong = incident. |
| Data already hot in DB | DB buffer pool already caches it. You're adding a network hop. |
| Invalidation can't be owned | If no single owner knows when to invalidate, you will serve stale data indefinitely. |
2.2 The Three Intents
Intent 1: Origin Protection
- The origin can't handle all reads. Caching absorbs the majority of load.
- Failure mode: cache down = DB overload. Circuit breaker and request coalescing are non-negotiable.
- Key metric: origin RPS (not just hit rate). Origin RPS rising while hit rate holds means you have a new hot key.
Intent 2: Latency Reduction
- Reads must be sub-millisecond. Network hops to the DB are too slow.
- Multi-tier caching: L1 in-process for the hottest keys, L2 Redis for the warm tier.
- Staleness tolerance defines which data can be in L1 (shorter TTL) vs L2 (longer TTL).
Intent 3: Cost Optimization
- Expensive computation (ML inference, complex joins) is cached to avoid recomputation.
- Longer TTLs are acceptable — these results change infrequently.
- Invalidation on model update or data change; TTL as the outer bound.
2.3 What the Interviewer Leaves Underspecified
Interviewers deliberately omit:
- Read/write ratio
- Staleness tolerance per data type
- Whether consistency is more or less important than availability
- Number of write services (invalidation coordination complexity)
- Cache budget (memory)
Staff engineers surface these. Senior engineers assume them away.
2.4 Terminology Reference
| Pattern | Write Path | Read Path | Invalidation Ownership |
|---|---|---|---|
| Cache-Aside | Write to DB, DEL cache key | Read cache; miss → DB → populate | Application teams |
| Read-Through | Write to DB | Cache fetches from DB on miss | Cache service |
| Write-Through | Write to cache → cache writes to DB | Read from cache | Implicit on write |
| Write-Behind | Write to cache → async write to DB | Read from cache | Implicit on write |
3Fault Lines
3.1 Fault Line 1: Freshness vs Performance
The tension: Shorter TTLs mean fresher data but more origin load. Longer TTLs reduce origin load but guarantee staleness. The resolution: staleness is a business decision owned by Product, not a technical constant owned by Engineering.
| Strategy | Data Freshness | Origin Load | Who Pays |
|---|---|---|---|
| Short TTL (< 30s) | High | High — frequent cache misses | Infra (origin capacity) |
| Long TTL (> 5min) | Low | Low | Product (stale UX) + Support (complaints) |
| Per-type TTL | High where needed | Balanced | Engineering (classification complexity) |
| Explicit invalidation + short TTL backstop | Highest | Low for most traffic | Engineering (write-path invalidation) |
L6 answer: Per-type staleness budget with explicit invalidation as primary mechanism and TTL as safety net. Classify data types: real-time (<10s) → inventory, pricing; near-real-time (30s–5min) → user profiles, catalog descriptions; batch (>5min) → aggregations, pre-computed feeds. Product signs off on each classification. Engineering sets the TTL to match the classification plus buffer.
3.2 Fault Line 2: Cache-Aside vs Write-Through vs Write-Behind
The tension: Write-through guarantees cache-DB consistency but couples write availability to cache availability. Cache-aside decouples availability but requires the application to own invalidation. Write-behind improves write performance but risks data loss if the cache fails before the async write completes.
| Pattern | Consistency | Write Availability | Read Availability | Who Pays |
|---|---|---|---|---|
| Cache-Aside | Eventual (miss or TTL-based) | Independent of cache | DB fallback on cache failure | Application teams (must invalidate on write) |
| Write-Through | Strong (always in cache) | Coupled — cache failure = write failure | Always from cache | All services (write latency includes cache) + Users (writes slower) |
| Write-Behind | Eventual (async lag) | Decoupled | Always from cache | Infra (async write complexity) + Data integrity risk (cache failure = lost writes) |
L6 answer: Cache-aside for most cases, with explicit write-path invalidation as primary mechanism and CDC backstop for anything that gets missed. "Cache-aside means: if the cache is down, writes still succeed and reads fall through to the database. Write-through means: if the cache is down, writes fail. At the scale where caching matters, cache-aside's availability advantage outweighs write-through's consistency simplicity."
3.3 Fault Line 3: Availability vs Consistency on Cache Failure
The tension: When the cache fails, serving stale data maximizes availability but may serve incorrect data. Hitting the origin maximizes consistency but risks overloading it. Failing explicit is honest but degrades the user experience. The right answer depends on the data type — and it's a product decision.
| Data Type | Cache Failure Strategy | Reason |
|---|---|---|
| User preferences, feed content | Serve stale → async refresh | UX > correctness |
| Product catalog, pricing | Fall through to origin with circuit breaker | Correctness > UX |
| Financial data, auth tokens, permissions | Fail explicit | Stale = security or compliance violation |
| Inventory / stock | Fall through + error if origin unavailable | Overselling is expensive |
L6 answer: Classify every data type before the incident, not during it. Document the failure behavior in the service's runbook. The distinction between "serve stale" and "fail explicit" is a product decision that engineering should confirm in writing before shipping a cache layer.
3.4 Fault Line 4: Local vs Distributed Cache
The tension: In-process (local) caches are microsecond-fast but create per-instance consistency issues — different app instances serve different data. Distributed caches (Redis) are consistent across instances but add network latency per request. Multi-tier uses both.
| Tier | Latency | Consistency | Memory | Who Pays |
|---|---|---|---|---|
| L1 In-process | Microseconds | Per-instance (inconsistent) | Duplicated × N instances | Users (inconsistent experience across instances) |
| L2 Distributed (Redis) | 1–3ms | Shared state (consistent) | Shared pool | Infra (Redis operational burden) |
| Multi-tier (L1 + L2) | Microseconds (hot), ms (warm) | L1 stale by its TTL | Best of both | Engineering (two-tier invalidation complexity) |
L6 answer: Multi-tier for high-traffic systems. L1 absorbs hot-key pressure — if one product gets viral traffic, all instances have it in L1 and the Redis shard never sees the load spike. L1 TTL must be short (5–10s) because there is no push invalidation to L1 by default — unless you implement a pub/sub channel for L1 eviction, which is the right answer for any data where 10-second staleness is unacceptable.
3.5 Fault Line 5: Proactive vs Reactive Invalidation
The tension: Proactive invalidation (DEL on write) gives immediate freshness but requires every write path to know about and correctly implement invalidation. Reactive invalidation (TTL expiry) is simple but guarantees staleness up to TTL. The Staff answer is hybrid with a formal backstop.
| Approach | Freshness | Complexity | Failure Mode | Who Pays |
|---|---|---|---|---|
| Reactive (TTL only) | Delayed by TTL | Low | Guaranteed staleness up to TTL | Product (stale UX) |
| Proactive (DEL on write) | Immediate | Medium | Missed invalidation = stale forever | Engineering (must instrument every write path) |
| Hybrid (DEL + TTL safety net) | Immediate for known writes | Medium | Missed invalidation = stale up to TTL | Balanced |
| CDC backstop (DB changelog) | ~1s lag | Medium-High | Infrastructure complexity | Infra (CDC pipeline ownership) |
L6 answer: Hybrid with CDC backstop. On every write: synchronously DEL the cache key. TTL of 2× the acceptable staleness window as safety net. CDC consumer as backstop: tails the database changelog, publishes invalidation events for any row change regardless of which write path caused it. This means: even if a developer adds a new write path and forgets to add cache invalidation, the CDC consumer catches it within ~1 second.
4Failure Modes & Operational Reality
4.1 Thundering Herd (Cache Stampede)
What happens: A popular cache key expires. Hundreds of concurrent requests miss simultaneously. All hit the database. DB CPU spikes, connections exhaust, service degrades.
Timeline:
t=0s: Cache entry for "homepage_data" expires (TTL)
t=0-1s: 1,000 concurrent requests → all miss
t=0-1s: 1,000 DB queries → DB CPU 100%
t=1-2s: DB connection pool exhausted
t=2-5s: Query latency spikes to 10s
t=5s+: Cascading failure — homepage down
Mitigations in priority order:
| Mitigation | How | Tradeoff |
|---|---|---|
| Request coalescing (singleflight) | First miss fetches, others wait for same result | Latency for waiters; only 1 origin request |
| Probabilistic early expiry | Key expires slightly before TTL (random jitter) | Some extra origin load; prevents cliff-edge |
| Background refresh | Refresh before expiry; never actually misses | Always slightly stale; complexity |
| Cache-miss circuit breaker | Limit concurrent origin requests | Some requests fail under pressure |
Staff answer: Request coalescing is the default. One lock per key; first acquirer fetches from origin; others wait. Combined with probabilistic early expiry to spread refresh load over time rather than at one expiry cliff.
4.2 Hot Key Problem
What happens: One cache key receives 100× normal traffic (viral content, trending product, popular user). Single Redis shard CPU hits 100%. Shard latency spikes. Requests fail.
Timeline:
t=0: Content goes viral
t=1min: One key: 10,000 req/s → single Redis shard overloaded
t=2min: Shard CPU at 100%, latency spikes
t=3min: Shard timeouts → fallback to DB
t=4min: DB overwhelmed by thundering herd from failed shard
Mitigations:
| Mitigation | How | Tradeoff |
|---|---|---|
| L1 in-process cache | Absorb hot key within each instance | Per-instance staleness, memory × N |
| Key replication | Copy key to N shards with suffix | N DEL on invalidation; memory overhead |
| Read replicas | Route reads to Redis replicas | Replication lag staleness |
Detection: Per-key QPS metrics. Alert when a single key exceeds X% of total shard QPS. Hot keys for predictable events (Black Friday homepage, scheduled content drops) should be pre-populated in L1 before traffic arrives.
4.3 Cold Start (Cache Miss Storm)
What happens: Redis cluster restarts (maintenance or failure). 100% miss rate. All requests hit the database. Service degrades within seconds.
Timeline:
t=0: Redis cluster unavailable
t=0: 100% cache miss rate
t=0-30s: All reads hit PostgreSQL
t=30s: DB connection pool approaches capacity
t=1min: Service degradation or outage
Mitigations:
| Mitigation | How | Tradeoff |
|---|---|---|
| Cache warming on deploy | Pre-populate before traffic shift | Deployment complexity; warming time |
| Gradual traffic shift | Route new traffic slowly | Longer rollout |
| Origin capacity headroom | Size origin for 100% cache-miss load | Cost (overprovisioned DB) |
| Persistent cache | RDB/AOF persistence in Redis | Memory cost; slower restarts |
Staff answer: Warming is deployment-critical, not a background task. Traffic does not shift until the new cache reaches a hit-rate threshold (e.g., 80%). Warming duration is tracked as a trend metric — alert if it grows >50% over baseline (dataset growth outpacing the warming job).
4.4 Silent Staleness (The Invisible Failure)
What happens: Cache invalidation fails silently. Users see stale data. No errors, no alerts — just wrong answers.
Timeline:
t=0: User updates email in UI
t=0: DB write succeeds
t=0: Invalidation call silently times out
t=0-5min: User sees old email in every view
t=5min: TTL expires, correct data served
Why it's dangerous: No error rate. No latency spike. No alert fires. The only signal is a user complaint — or worse, nothing.
Mitigations:
- CDC backstop catches missed invalidations within ~1 second
- Correctness monitoring: sample 0.1% of reads, compare cached value to source of truth, alert on divergence
- Version stamps: make stale entries unreachable, not just wrong
- Short TTL fallback: cap the worst-case window
4.5 Operational Reality Matrix
| Failure | Loud vs Silent | User Impact | Detection Time | Who Pays |
|---|---|---|---|---|
| Cache cluster down | Loud | Latency spike or errors | Seconds | Infra + users |
| Thundering herd | Loud | Origin overload, cascading failure | Seconds | Origin + all users |
| Hot key | Medium | Single-key latency, then cascade | Minutes | Hot-key users |
| Cold start | Loud | Origin overload on restart | Seconds | All users |
| Silent staleness | Silent | Wrong data served | Hours to never | Product + user trust |
| Cache amplifying a bug | Silent | Bug affects 99% instead of 1% | Hours to never | Product + user trust |
5Evaluation Rubric
5.1 Level-Based Signals
| Dimension | Senior (L5) | Staff (L6) | Principal (L7) |
|---|---|---|---|
| First move | "Add Redis cache-aside" | "What's the staleness budget, who owns invalidation, when do we not cache?" | Defines org-wide caching strategy by data classification |
| Invalidation | TTL-only | Explicit DEL + TTL safety net + CDC backstop | CDC-based org-wide invalidation platform |
| Consistency | Assumed | Staleness budget per data type; product sign-off documented | Consistency tiers as org policy |
| Failure modes | "Add replicas" | Thundering herd, hot key, cold start, silent staleness — mitigations for each | Capacity planning for cache-miss scenarios |
| Multi-tier | Single cache layer | L1 + L2 with explicit aggregate staleness math | Standardized multi-tier patterns |
| Ownership | Implementation focus | Named owner for warming, monitoring, paging | Platform-wide caching governance |
5.2 Strong Hire Signals
| Signal | What It Sounds Like |
|---|---|
| Staleness is a product decision | "30 seconds is the TTL because product approved 30 seconds of staleness for this data type." |
| Invalidation contract designed | "Explicit DEL on write, TTL as safety net, CDC as backstop — defense in depth." |
| Failure modes classified by data type | "Financial data fails explicit. User prefs serve stale. Pricing falls through to origin." |
| When not to cache | "Write rate close to read rate here — I'd push back on adding a cache." |
| Correctness over hit rate | "Hit rate is a performance metric. Correctness is the safety metric. Most teams measure the wrong thing." |
5.3 Lean No-Hire Signals
| Signal | Why It Misses the Bar |
|---|---|
| "Add Redis with 5-minute TTL" | No staleness reasoning, no invalidation contract, no failure plan |
| TTL-only thinking | "TTL handles staleness" without any write-path invalidation |
| Missing thundering herd | No mention of request coalescing on cache miss |
| "Cache is always helpful" | Never pushes back on whether caching is appropriate |
| No ownership thinking | No mention of who warms, who monitors, who gets paged |
5.4 Common False Positives
- Knows Redis data structures deeply ≠ good cache design. Sorted sets and HyperLogLog knowledge is not the bar.
- Multi-tier diagram ≠ multi-tier design. Drawing L1 + L2 + CDN without explaining aggregate staleness, L1 invalidation via pub/sub, and the failure mode of each tier is architectural decoration.
- Mentions CDC ≠ understands CDC. "Use CDC for invalidation" without explaining the lag, the infrastructure complexity, and when it's appropriate signals pattern-matching rather than operational experience.
6Interview Flow & Pivots
6.1 Typical 45-Minute Shape
| Phase | Time | Goal |
|---|---|---|
| Framing | 0–5 min | Staleness budget, read/write ratio, data types, "when not to cache" |
| High-level design | 5–12 min | Cache-aside, invalidation path, failure path — dispatch fast |
| Invalidation deep dive | 12–25 min | Write-path DEL, CDC backstop, race conditions, ownership |
| Failure modes | 25–35 min | Thundering herd, hot key, cold start, silent staleness |
| Operations + wrap-up | 35–45 min | Warming, monitoring, correctness metrics, organizational ownership |
6.2 How Interviewers Pivot
| After You Say... | They Will Probe... | What They're Evaluating |
|---|---|---|
| "Add Redis cache-aside" | "What's your invalidation strategy on write?" | Whether invalidation is explicit or TTL-only |
| "DEL on write" | "Five services write this entity. One adds a new write path without invalidation. How long is the cache stale?" | Whether you have a CDC backstop and correctness monitoring |
| "TTL of 5 minutes" | "Data changes. User reads at T+4min. What do they see? Who approved 5 minutes?" | Whether staleness is a product contract or an assumption |
| "Request coalescing" | "You have 100 app instances. One acquires the lock. The other 99 wait. Lock holder crashes. What happens?" | Whether coalescing handles failure |
| "Serve stale on failure" | "Cache is serving stale prices. User pays wrong amount. Who's responsible?" | Whether data-type classification was done before shipping |
6.3 Follow-Up Questions to Expect
- "What happens when the cache is completely cold?"
- "How do you handle thundering herd when a popular key expires?"
- "What's your invalidation strategy on write? What if the invalidation fails?"
- "How do you detect that your cache is serving stale data without anyone noticing?"
- "When would you NOT cache this data?"
- "Your cache costs tripled last quarter. How do you investigate?"
7Active Drills
Drill 1: Product Catalog Cache Design
Staff Answer
Before designing: read/write ratio, catalog size, staleness tolerance per field type (descriptions vs pricing vs inventory), and who writes to catalog. Assume: 95% reads, 100K products, product descriptions tolerate 5 minutes, pricing tolerates 30 seconds, inventory bypasses cache entirely.
Cache-aside with per-field TTL: prod:{id}:desc (5min), prod:{id}:price (30s). Explicit DEL on write plus CDC backstop. Request coalescing on miss. Pricing falls through to origin on cache miss — never serve stale price. Description serves stale on origin failure. L1 in-process for top 1K products (high access skew; absorbs hot-key pressure on specific product launches).
Why this is L6:
- Different TTLs per field type based on explicit staleness tolerances — not one global TTL
- Inventory explicitly excluded from caching — pushes back on caching everything
- Pricing in "fall through to origin" bucket — not "serve stale"
❌ Common L5 Trap
"Cache the full product object with a 5-minute TTL. Invalidate on update."
Why this misses: One TTL for the entire product object means pricing is stale for up to 5 minutes — the same as the description. A price change during a flash sale is invisible to users for 5 minutes. Inventory count is cached and may show items as in-stock that are sold out. The interviewer asks: "A flash sale drops the price by 50% at noon. Customers visiting between noon and 12:05 see the old price. What's the revenue and trust impact?" The product object TTL conflates fields with very different staleness requirements.
Drill 2: Invalidation Strategy
Staff Answer
Three layers, applied in sequence. Primary: explicit DEL on every write path — when this entity is written, delete the cache key synchronously before returning success to the caller. Safety net: TTL set to 2× the acceptable staleness window — if the DEL fires at T=0 and a concurrent read repopulates with stale data, the TTL expires it within the window. Backstop: CDC consumer tails the database changelog and fires DEL for any row change regardless of which write path caused it — within ~1 second. This means a developer who adds a new write path without cache invalidation logic is covered by CDC within 1 second.
What I don't do: TTL-only. TTL as the primary invalidation mechanism guarantees staleness up to TTL for every write. That's a worst-case bug amplifier — a single bad write caches wrong data for the full TTL duration.
Why this is L6:
- Three-layer defense in depth (DEL + TTL + CDC) — not single-mechanism reliance
- CDC backstop decouples invalidation from write-path correctness — new write paths are automatically covered
- Frames TTL as a safety net, not a strategy
❌ Common L5 Trap
"Invalidate on write — delete the cache key when the data changes. Use TTL as a fallback."
Why this misses: Technically correct for a single-writer system. The interviewer asks: "You have 8 write endpoints across 3 microservices. A new microservice was added 3 months ago that also writes this data. Nobody added cache invalidation to the new service. How long has the cache been serving stale data for any writes from that service?" Answer: 3 months. Without a CDC backstop, every new write path is a potential invalidation gap. The explicit DEL on existing write paths is insufficient organizational defense.
Drill 3: Thundering Herd
Staff Answer
1,000 concurrent requests check the cache simultaneously. All miss. Without protection: all 1,000 query the database. Database CPU spikes. Connection pool exhausts. Service degrades within 2–5 seconds. This is not a corner case — it's a predictable failure mode for any high-traffic key.
Mitigation: request coalescing (singleflight). The first miss acquires a distributed lock on the cache key. The other 999 see the lock and wait. The first request queries the database, populates the cache, releases the lock. The waiters receive the cached result. One database query instead of 1,000. Latency cost: waiters experience slightly higher latency, but this is bounded by the one DB query, not 1,000 parallel queries.
Prevention: probabilistic early expiry. Keys don't expire at a hard TTL boundary — they have a probability of expiring slightly before TTL based on elapsed time and a random factor. This spreads refresh load over a window instead of a single cliff-edge moment. Eliminates the thundering herd at the cost of slight average staleness increase.
Why this is L6:
- Quantifies the blast radius (1,000 simultaneous DB queries, not "that's unlikely")
- Distinguishes mitigation (coalescing — reactive) from prevention (jitter — proactive)
- Explains the latency cost tradeoff of coalescing honestly
❌ Common L5 Trap
"The database handles the extra load during the miss window."
Why this misses: At 1,000 req/s on a cache key with 100ms average response time, a single miss moment generates 100 concurrent database queries (1,000/s × 0.1s). Without coalescing, at 10,000 req/s it generates 1,000 concurrent queries — likely 10–50× the connection pool capacity. The interviewer asks: "What's your database's connection pool size?" The answer is almost always smaller than 1,000. The database doesn't "handle the extra load" — it collapses under it.
Drill 4: Cache Failure
Staff Answer
Three separate questions: what type of data is this, what's the circuit breaker configuration on origin, and was cold-start capacity planned for?
By data type: user preferences and feed content → serve stale from L1 in-process if available; if not, fall through to origin with circuit breaker and request coalescing. Pricing and catalog → fall through to origin, circuit breaker protects the database. Financial data and auth → fail explicit with a clear error message, never serve stale.
The critical concern: if Redis handles 90% of read load, a cache failure means 10× origin load. Without a circuit breaker and load shedding, the database collapses — the cache incident becomes a database incident. Pre-plan: circuit breaker trips at 2× normal origin QPS, request coalescing reduces duplicate queries, and low-priority traffic (recommendations, analytics) is shed first.
Why this is L6:
- Data-type classification determines failure behavior — not a single answer for everything
- Quantifies the origin load multiplier (10×) — makes the blast radius concrete
- Includes load shedding as the missing piece most candidates forget
❌ Common L5 Trap
"We fall back to the database — the service is still functional, just slower."
Why this misses: "Just slower" hides the blast radius. A cache handling 90% of read load failing means the database receives 10× its normal traffic. A database sized for cache-hit-scenario load is not sized for 10× overload. The service doesn't become "slower" — it likely becomes fully unavailable within 30–60 seconds as the database connection pool exhausts. "Fall back to DB" without a circuit breaker and load shedding plan is not resilience — it's a cascade waiting to be triggered.
Drill 5: Hot Key
Staff Answer
One cache key receives 90% of total requests. With 12 Redis shards, roughly one shard handles ~90% of all traffic. That shard CPU hits 100%, latency spikes, and requests start failing — not just for the hot key but for all other keys on the same shard.
Fix 1 (immediate, if predictable): Pre-populate L1 in-process caches on all instances before the flash sale. Each of the 50 application instances absorbs traffic locally. Redis never sees the hot-key traffic spike because L1 cache hits in process. Requires L1 TTL management and L1 invalidation via pub/sub when the price changes.
Fix 2 (architectural): Key replication. Create N copies of the hot key across shards: prod:{id}:shard_0 through prod:{id}:shard_N. Reads randomly select a shard. Writes DEL all N copies. Distributes reads across N shards at the cost of N invalidation calls and N × memory.
Detection: per-key QPS metric. Alert when any single key exceeds X% of its shard's total QPS. For predictable hot keys, build a pre-warming playbook for high-traffic events.
Why this is L6:
- Identifies that one hot key can take down an entire shard affecting all keys — not just the hot key itself
- Distinguishes predictable (pre-warm) from unpredictable (key replication) hot key scenarios
- Includes automated detection rather than relying on incident response
❌ Common L5 Trap
"Add more Redis replicas to distribute the read load."
Why this misses: Redis read replicas require explicit routing by the client to replicas for specific keys. If the client isn't aware of which key is hot and isn't routing that key to replicas, all reads still go to the primary shard. More importantly, replicas don't solve the hot-key problem — they distribute reads but the primary still receives all writes and replication traffic. The interviewer asks: "The flash sale starts in 10 minutes. Your provisioning takes 15 minutes. What do you do right now?" Without a pre-warming strategy, the answer is nothing.
Drill 6: Stale Data Incident
Staff Answer
Diagnose before fixing: is invalidation failing (write-path DEL not firing or timing out), is CDC lag unusually high (backstop delayed), or is L1 still holding the old value (pub/sub eviction not wired)?
Investigation: check invalidation success rate for the profile write path. Check CDC consumer lag. Check if the app instances have different values in their L1 caches (would appear as intermittent — some requests stale, some fresh, depending on which instance serves the request).
If invalidation is failing: fix the write-path DEL and add retry. If CDC lag is high: investigate the CDC consumer — likely a topic lag or consumer group issue. If L1 is the problem: add or fix the L1 pub/sub eviction.
The product communication: "Profile changes may take up to 30 seconds to reflect everywhere" — this is the staleness contract. If users expect instant updates (which they do for their own profile), the staleness contract was wrong. Engineering adjusts to read-your-writes: the session that made the update reads from origin for 30 seconds after the write, bypassing cache for that user-session pair.
Why this is L6:
- Diagnoses root cause (which tier) before proposing a fix
- Surfaces read-your-writes as the user-expectation-appropriate pattern
- Recognizes that "stale data" can be intermittent (per-instance L1) without the user noticing the pattern
❌ Common L5 Trap
"Reduce the TTL to 30 seconds so changes propagate faster."
Why this misses: TTL reduction treats the symptom (staleness duration) without fixing the root cause (invalidation not firing). If the write-path DEL is failing, a shorter TTL reduces the staleness window but doesn't eliminate it — and increases origin load for every profile read. The interviewer asks: "After reducing TTL, the user updates their profile and still sees stale data for 28 seconds. Is this fixed?" No — the invalidation still failed; now the window is just shorter. The TTL-tuning instinct is the canonical L5 anti-pattern for this interview.
Drill 7: Multi-Region Expansion
Staff Answer
Per-region caches (EU Redis cluster, US Redis cluster) with async invalidation via message bus. Users in EU hit EU cache — low latency, no cross-continental roundtrip. When data is updated in one region, an invalidation event is published to a cross-region message bus (Kafka topic, SNS) and consumed by all regional caches. Cross-region propagation latency: 100–500ms.
What tolerates per-region eventual consistency: user profiles (users don't switch regions mid-session), product descriptions (brief inconsistency rarely matters), session data (should be region-sticky regardless).
What requires global consistency: inventory/stock (overselling is expensive), global rate limiting (abuse protection), financial data (compliance). For these: bypass regional caches entirely and read from a global authoritative store, accepting cross-region latency.
Important: don't replicate the entire cache globally — that doubles cost and adds replication complexity for data that doesn't need it. Per-region caches with async invalidation for 90% of data; global bypass for the remaining 10%.
Why this is L6:
- Explicitly classifies what tolerates regional eventual consistency vs what requires global consistency
- Doesn't default to "global cache" — recognizes that 90% of data doesn't need global consistency
- Quantifies propagation latency (100–500ms) so the product team can make an informed decision
❌ Common L5 Trap
"Use a global Redis cluster so all regions see the same data."
Why this misses: A global Redis cluster in us-east-1 means EU users pay 100–200ms for every cache operation — the same as going to the database directly. Caching adds latency instead of removing it. The interviewer asks: "Your EU users are experiencing 150ms cache latency instead of 2ms. What did adding a cache accomplish for them?" Nothing — and the operational complexity increased. Global consistency is only appropriate for the small subset of data where regional inconsistency has real cost.
Drill 8: TTL Ownership Conflict
Staff Answer
This isn't a TTL debate — it's a missing staleness contract. The conflict exists because nobody classified the data or defined acceptable staleness. Both Product and Infra are trying to fill a governance gap by lobbying for a number.
Reframe: "What data is this specifically? What's the worst case if a user sees an hour-old value? For user preferences — completely fine, a stale theme setting is harmless. For pricing — unacceptable, a user could pay wrong. For personalized recommendations — probably fine, a stale recommendation is a minor UX miss."
Propose data classification: different TTLs for different data types based on actual business impact. User preferences → 1hr TTL (Product's 1hr is fine). Pricing → 30s + write-through invalidation (Infra's concern is valid for this type). Inventory → no cache.
Write it down: "User preferences: 1hr TTL, approved by Product. Pricing: 30s + explicit invalidation, approved by Product and Finance." This makes the staleness a named contract, not a number someone picked.
Why this is L6:
- Reframes a technical disagreement as a missing organizational contract
- Proposes data classification rather than a compromise number
- Assigns explicit ownership (Product approves staleness per data type)
❌ Common L5 Trap
"Compromise at 15 minutes — split the difference between 1 hour and 5 minutes."
Why this misses: Splitting the difference satisfies nobody and solves nothing. Product wanted 1 hour for a reason (fewer cache misses = better performance). Infra wanted 5 minutes for a reason (stale data risk). 15 minutes is wrong for both motivations. The interviewer asks: "A pricing bug goes unnoticed for 12 minutes because the 15-minute cache hides it. Who is accountable?" Nobody — because there was never an explicit agreement about what 15 minutes of staleness means for pricing data.
Drill 9: Cache Correctness Incident
Staff Answer
This is a cache correctness gap, not a QA gap. QA tested through the cache (the fast path that most users take). The bug was in the database (the slow path that rarely gets tested directly). The cache served the old, correct price with high confidence — 99% hit rate, zero errors, zero alerts — while the database silently had wrong data.
Root cause: no cache-bypass testing (QA never validated DB values directly), no correctness monitoring (nobody compared cached values against source of truth), and caches amplify bugs — a bug that affected 1% of DB reads was invisible because 99% of reads hit the correctly-cached value.
Immediate fixes: add cache-bypass test cases that read from origin directly. Add periodic correctness sampling: read 0.1% of requests from origin, compare against cached value, alert on divergence above threshold.
Structural fix: caching governance policy for financial data — pricing requires write-through invalidation (<30s) and mandatory correctness monitoring. PR checklist for any code touching pricing must include cache invalidation review and a correctness test.
Why this is L6:
- Redirects from blaming QA to identifying a systemic correctness monitoring gap
- Explains how caches amplify bugs (1% DB error → 0% user-visible error → bug invisible)
- Proposes correctness sampling as the ongoing defense, not just one-time test coverage
❌ Common L5 Trap
"QA should have tested the DB path directly. Add a test case that bypasses cache."
Why this misses: Technically correct as an immediate fix, but it treats a one-time occurrence rather than a class of problem. The interviewer asks: "Next quarter, a similar bug happens in the inventory count calculation. Does the same QA test catch it?" No — because no correctness monitoring was added to detect cache-vs-source divergence at runtime. Test coverage catches known gaps. Correctness monitoring catches unknown gaps.
Drill 10: Cache Cost Optimization
Staff Answer
Start with a usage audit before making any changes. Three questions: why did costs triple (organic traffic growth, a new service caching aggressively, or dataset bloat from missing eviction), what percentage of cached keys are accessed more than once before eviction (keys read only once add cost with no hit-rate benefit), and what's the hit rate by key pattern (are there entire categories with <50% hit rate that shouldn't be cached at all).
Identify the quick wins: remove low-hit-rate key categories, reduce TTL for infrequently-accessed data (less memory occupied by cold entries), abort cached computations that are faster to recompute than to store. These are removal decisions, not optimization decisions.
Then optimize what remains: compress large values (2–5× reduction in some cases), use more efficient serialization for common objects (MessagePack vs JSON), tiered caching for cold data (slower Redis tier or in-memory store with lower cost-per-GB).
Structural fix: caching governance process — new cache usage requires a cost-benefit review before shipping. Per-team cache cost attribution (tag every key with owning team) so Finance conversations are data-driven. Quarterly cache audit.
Why this is L6:
- Audits usage before optimizing — often finds data that shouldn't be cached at all
- Establishes a governance process so the problem doesn't recur
- Per-team attribution makes cost ownership visible to the teams creating the spend
❌ Common L5 Trap
"Reduce TTLs across the board to save memory. Compress all cached values. Rightsize the Redis cluster."
Why this misses: These optimizations reduce the cost of the cache you have, without questioning what you have. If 30% of cache memory is occupied by keys with <10% hit rate (never actually reused before eviction), optimizing their storage is wasted effort. The interviewer asks: "After all optimizations, what's the hit rate for the keys you kept? If it's 60%, you're still paying for a lot of misses." The real win is finding data that shouldn't be cached and removing it — not making the wrong cache more efficient.
8Deep Dive Scenarios
Scenario-based analysis for Staff-level depth
Deep Dive 1: Black Friday Cache Failure
Context: It's Black Friday. Your Redis cluster shows 50% higher latency than normal and cache hit rate has dropped from 95% to 70%. On-call escalates to you.
Questions to Surface First:
- Is this a cache problem or an origin problem? A 25% hit rate drop on Black Friday could mean 5× increase in database load — the database may be the real patient.
- What caused the hit rate drop — hot keys from a viral deal, memory pressure causing evictions, or traffic exceeding capacity plans?
- Were circuit breakers and fallbacks configured in advance, or are they being improvised during the incident?
- Was Black Friday traffic included in capacity planning? If not, this is a process failure, not a technical one.
Staff Approach — Full Reasoning
| Phase | Action |
|---|---|
| T+0–5 min | Is the database OK? Check DB CPU, connection pool, and query latency. This determines urgency. |
| Triage | Hot keys (one viral deal)? Memory pressure (eviction spike)? Traffic exceeding design? |
| Immediate fix | Hot keys → enable L1 pre-warming for the affected keys. Memory pressure → evict low-value TTL buckets. Traffic → activate circuit breaker, shed recommendations and analytics first. |
| Guardrails | Circuit breaker on origin. Request coalescing active. Shed P2 traffic (recommendations, analytics) to protect P0 (checkout, cart, payment). |
| Post-mortem | Why didn't capacity planning catch Black Friday patterns? Load test with 3× traffic + cache-miss scenario. |
Metrics to Watch:
cache.hit_rate (alert on drop >5% for 3 min), origin.rps (inverse of hit rate effectiveness), origin.error_rate, cache.eviction_rate (spike = memory pressure), cache.memory_utilization
Organizational Follow-up: Add Black Friday traffic patterns to annual load testing. Schedule pre-event capacity review with SRE 2 weeks before peak. Create a runbook for "cache degradation during peak" with pre-approved remediation steps (L1 pre-warming, traffic shedding, circuit breaker thresholds).
Staff Signals:
- Checks origin health before cache health — recognizes 25% hit rate drop amplifies backend load as the real risk
- Activates load shedding by priority immediately rather than just investigating
- Post-mortem frames it as a capacity planning process failure, not a technical failure
Deep Dive 2: Stale Payment Data Incident
Context: A customer reports that after updating their payment method, the old card kept getting charged for 3 hours. Investigation shows the payment cache had a 3-hour TTL with no write-through invalidation.
Questions to Surface First:
- What category is this data — financial, PII, preferences? Different categories have different staleness tolerances, and financial data should have been classified as "never stale."
- Was the 3-hour TTL an intentional design choice, or a default that nobody reviewed?
- What other sensitive data (billing, auth tokens, permissions) might have similarly misconfigured cache settings?
- Who approved caching payment method data at all?
Staff Approach — Full Reasoning
| Dimension | Staff Answer |
|---|---|
| Root cause | Payment data was classified as "cacheable with TTL" when it should have been classified as "no staleness tolerated" |
| Immediate | Purge the affected customer's cache entry; verify correct state |
| System fix | Payment method data: either no cache, or write-through invalidation with <60s TTL and mandatory correctness monitoring |
| Process fix | Data classification policy: financial data requires explicit cache approval with a staleness contract signed by Finance and Legal |
| Broader audit | What other sensitive data is cached with long TTLs? Run a key-pattern audit immediately. |
Metrics to Watch:
cache.staleness_age_by_key_pattern (for classified data types), cache.invalidation_latency (time from write to cache update), cache.sensitive_data_access_count (track which sensitive data types are being served from cache)
Organizational Follow-up: Create a data classification policy requiring explicit approval for caching financial and PII data. Audit all existing cache key patterns against the classification. Add cache configuration review to the PR checklist for any code touching payment, billing, or auth data. Add correctness monitoring for classified data types.
Staff Signals:
- Reframes from "TTL was wrong" to "financial data requires a data classification policy"
- Asks "what else is misconfigured" — broadens from one key to an org-wide audit
- Adds cache config review to PR checklist as a preventive structural fix
Deep Dive 3: Cache Warming Gone Wrong
Context: During deployment, the cache warming job ran for 2 hours instead of the expected 10 minutes. The old cache was decommissioned on schedule. The site was slow for 2 hours.
Questions to Surface First:
- Why was the old cache decommissioned before warming completed? Was there a readiness gate, or was the decommission time-based?
- What caused warming to take 12× longer — dataset growth, degraded query performance, or insufficient parallelism?
- Is warming time tracked as a capacity trend metric? Does anyone monitor it?
- Should cache warming block deployment traffic shift?
Staff Approach — Full Reasoning
| Phase | Action |
|---|---|
| Root cause | Old cache decommissioned before warming completed (process failure). Warming duration grew 12× without detection (capacity metric not tracked). |
| System fix | Warming job with real-time progress tracking + hit-rate reporting. Traffic shift gates on "hit rate > 80% in new cache" — not a fixed timer. |
| Process fix | Cache warming is a deployment-blocking checklist item. Rollback criteria: "warming must complete within 2× expected duration or deployment is rolled back." |
| Capacity planning | Track warming duration as a weekly trend metric. Alert if it grows >50% over a rolling 4-week baseline. |
Metrics to Watch:
cache.warming_duration_seconds (trend over time), cache.warming_progress_pct (real-time), cache.hit_rate_during_warmup, origin.load_during_warmup
Organizational Follow-up: Add warming duration to quarterly capacity planning. Make warming a deployment-blocking gate in the deployment runbook. Create a warming SLA with an explicit rollback trigger.
Staff Signals:
- Treats warming as deployment-critical, not background
- Tracks warming duration as a trend metric — catches dataset growth before it becomes an incident
- Creates explicit rollback criteria so the decision isn't made under pressure
Deep Dive 4: Multi-Tier Intermittent Stale Data
Context: Users report intermittent stale data — sometimes fresh, sometimes stale. You have L1 (10s TTL) and L2 (5min TTL) caching. L2 is invalidated on write; L1 is not.
Questions to Surface First:
- What's the worst-case staleness window? With L1 at 10s and L2 at 5min, data can be stale for up to 5min 10s in the worst case.
- Is L1 invalidation wired to the write path, or does it rely solely on TTL expiration?
- Is the intermittent pattern because different app instances have different L1 state?
- Has product approved "up to 10 seconds stale from L1" explicitly?
Staff Approach — Full Reasoning
| Dimension | Staff Answer |
|---|---|
| Hypothesis | L2 is invalidated on write. L1 on some instances has a stale entry that hasn't expired yet. Different instances serve different data — intermittent from the user's perspective. |
| Investigation | Which instances serve stale data? Are they all stale (L2 invalidation failing) or only some (L1 issue)? |
| Root cause | L1 invalidation not wired — L1 relies on TTL expiry only. Worst-case staleness: L1 TTL + potential L2 TTL = 10s + 0s (L2 was invalidated) = 10 seconds. |
| Decision | Accept 10s L1 staleness (document + get product sign-off) OR implement L1 invalidation via Redis pub/sub to all instances. |
| Staleness math | Document explicitly: "L1 may serve data up to 10 seconds stale. Approved by product on [date] for [data type]." |
Metrics to Watch:
cache.l1_staleness_age (per instance), cache.l2_staleness_age, cache.cross_instance_consistency (compare L1 values across instances on a sample), cache.invalidation_propagation_latency
Organizational Follow-up: Document the multi-tier staleness contract explicitly. If L1 invalidation via pub/sub is implemented, add monitoring for pub/sub delivery lag. Add "aggregate staleness = L1 TTL + L2 TTL" to the caching design checklist for multi-tier systems.
Staff Signals:
- Calculates worst-case aggregate staleness (L1 TTL + L2 TTL) — reasons about the full system
- Documents staleness contract explicitly with product approval
- Frames the architectural question: accept TTL-based L1 staleness or implement pub/sub invalidation
Deep Dive 5: Cache Cost Tripled
Context: Finance reports Redis costs tripled in 6 months. Leadership asks for a 40% reduction without performance impact.
Questions to Surface First:
- Why did costs triple — organic traffic growth, a new service caching aggressively without review, or dataset bloat from missing eviction policies?
- What percentage of cached keys are accessed more than once before eviction? (Single-use cached data is pure waste.)
- What's the hit rate by key pattern? Are there entire categories with no measurable benefit?
- Is the 40% target based on analysis, or is it an arbitrary Finance number?
Staff Approach — Full Reasoning
| Phase | Action |
|---|---|
| Analysis | Key cardinality, size distribution, hit rate by key pattern, key reuse rate (% accessed >1× before eviction) |
| Quick wins | Remove low-hit-rate categories, reduce TTL for rarely accessed data, abort multi-step cached computations cheaper to recompute |
| Architecture | Move cold-tier data to cheaper storage. Tiered caching: hot data in Redis, warm data in a lower-cost memory store. |
| Compression | Compress large values (2–5× savings), more efficient serialization (MessagePack vs JSON, 10–40% savings) |
| Governance | New cache usage requires a cost-benefit review. Per-team attribution so Finance conversations are data-driven. |
Metrics to Watch:
cache.hit_rate_by_key_pattern, cache.key_reuse_rate, cache.memory_cost_per_hit, cache.new_key_categories_per_week
Organizational Follow-up: Caching governance process: new cache usage requires cost-benefit review before shipping. Quarterly cache audit to catch organic growth before it becomes a Finance escalation. Per-team cache cost attribution so teams own their own cache spend.
Staff Signals:
- Usage audit before optimization — finds data that shouldn't be cached at all
- Caching governance process so the tripling doesn't happen again
- Per-team attribution makes cost a team-level accountability, not a platform team problem
9Level Expectations Summary
After studying this playbook, you should be able to:
- Ask "what's the staleness budget and who signed off on it?" before drawing any boxes
- Design explicit write-path invalidation + TTL safety net + CDC backstop
- Explain thundering herd and design request coalescing with probabilistic early expiry
- Classify data types into failure behaviors (serve stale / fall through to origin / fail explicit)
- Calculate worst-case aggregate staleness for multi-tier caching
- Know when to push back on adding a cache entirely
The Bar for This Question
Mid-level (L4/E4): Implements cache-aside with Redis, sets reasonable TTLs, explains the basic read path. Understands cache hits and misses and why caching improves latency. Can describe LRU eviction.
Senior (L5/E5): Quickly establishes the caching pattern based on access patterns and spends time on real problems: cache invalidation strategy (TTL vs event-driven), thundering herd on cache miss, cache key design and hit rate, and the staleness contract. Can quantify: "We cache product catalog with a 5-minute TTL because the business accepts 5 minutes of stale pricing in exchange for 10× lower DB load."
Staff+ (L6/E6+): Dispatches the baseline architecture in 5 minutes and spends 25+ minutes on operational depth: multi-tier caching with aggregate staleness arithmetic, cache warming strategies for cold starts, the organizational question of who owns the staleness contract (product signs off on user-facing staleness, engineering provides the mechanism), failure mode analysis classified by data type, correctness monitoring (not just hit rate), hot key detection and mitigation, and when NOT to cache. The interviewer should see you treat caching as a consistency contract with organizational ownership — not a performance optimization with a TTL value.
10Staff Insiders: Controversial Opinions
10.1 "Most Stale Data Incidents Are Never Detected"
Your cache is probably serving stale data right now. You just don't know it. No errors, no alerts, no latency spike — just confidently wrong answers. Stale data is the only failure mode that generates no technical signal. The only signals are user complaints (if users notice) or business metrics (if the data is business-critical). Most teams measure cache hit rate. Almost nobody measures cache correctness. If you can't tell me the last time you validated cached values against source of truth, your cache correctness is an assumption, not a guarantee.
10.2 "High Hit Rate Can Mask Correctness Bugs"
99% hit rate might mean you're serving confidently wrong answers 99% of the time. A pricing bug writes an incorrect value to the database. That incorrect value gets cached. For the next 5 minutes (TTL), 99% of reads hit the cache — and all of them get the wrong price. Without caching, the bug affects 1% of reads (direct DB reads); some users notice immediately. With caching, the bug affects 99% of reads for the full TTL duration. The cache turned a transient bug into a widespread incident. Hit rate is a performance metric. Correctness is the safety metric.
10.3 "Removing a Cache Is Often the Right Fix"
Signs you should delete the cache: hit rate below 50%, write rate approaching read rate (invalidation churn), multiple incidents traced to staleness, origin can handle the load without it, or nobody can explain the staleness contract. Teams don't remove caches because "we already built it" and "it must be helping somehow." The Staff engineer asks: "What if we just didn't cache this?" This requires courage — removing infrastructure is politically harder than adding it. But the most resilient cache is the one that doesn't exist.
10.4 "Cache Invalidation Is an Ownership Problem, Not a Technical One"
"Cache invalidation is hard" is a cop-out. It's hard because nobody owns it. Invalidation fails because: the writer doesn't know about the cache, there are multiple caches (one forgot), nobody designed for concurrent write race conditions, or the cache contract was never documented. The technical fixes (CDC, version stamps, pub/sub) exist and work. The organizational fix is harder: every write path must have a named owner of its cache contract. Without that, you'll add CDC, and then a new write path will appear that bypasses the CDC pipeline, and the bug returns.
10.5 "TTL Was Too Long Is Never the Root Cause"
When stale data causes an incident, "reduce TTL" is the wrong fix. It treats the symptom (staleness duration) not the cause (invalidation not firing). Shorter TTL means more origin load — you pay in performance to paper over an ownership gap. The real root causes: writer didn't know to invalidate, invalidation code had a bug, invalidation was async and lost, nobody owned the cache contract. Every "TTL was too long" incident is actually an invalidation ownership failure. Fix the ownership, not the number.
Appendix A: Caching Pattern Reference
A.1 Cache-Aside (Lazy Loading)
Most common pattern. Application manages cache explicitly.
Read: Check cache → miss → query DB → populate cache → return. Write: Write to DB → DEL cache key.
Pros: Cache failure = DB fallback (not write failure). Explicit control over both paths. Cons: Invalidation logic must exist in every write path. Race condition between write and DEL.
When to use: Default for most cases. Decouples cache availability from service availability.
A.2 Write-Through
Write: Write to cache → cache writes to DB. Read: Always from cache (fully populated).
Pros: Cache always consistent with DB. Simple read path. Cons: Write latency includes cache (doubled). Cache failure = write failure. Tight coupling.
When to use: When write latency is not user-facing (async write paths), or when cache must always be populated.
A.3 Read-Through
Read: Cache handles miss by fetching from DB and populating. Application code never touches DB directly.
Pros: Simpler application code. Cache handles all DB interaction. Cons: Cache failure = read failure. Less control. Tight coupling.
When to use: Managed cache services where the cache service handles origin fetch. Reduces per-team invalidation burden at the cost of cache availability coupling.
A.4 Write-Behind (Write-Back)
Write: Write to cache → async write to DB (via queue or background job). Read: Always from cache.
Pros: Lowest write latency. Absorbs write bursts. Cons: Data loss risk if cache fails before async write completes. Eventual consistency lag. Complex failure modes.
When to use: Rarely. High-write workloads where loss tolerance exists (counters, analytics). Not for authoritative data.
Appendix B: Thundering Herd Mitigation Reference
Request Coalescing (Singleflight)
Implementation: Redis SET NX EX as distributed lock. singleflight library in Go. In-process deduplication per worker.
Probabilistic Early Expiry
Instead of expiring at exactly TTL, expire with increasing probability as TTL approaches:
if (current_time - creation_time) > (TTL - random(0, jitter)):
expire()
Jitter spreads refresh load over a window instead of a single cliff-edge moment. Cost: average age of served entries increases slightly.
Appendix C: Cache Sizing Quick Reference
Basic Formula
memory_needed = working_set_size × overhead_factor
working_set_size = num_keys × avg_value_size
overhead_factor = 1.5–2× (Redis data structures, fragmentation)
Example: 1M users, 2KB per user profile = 2GB × 1.5 = 3GB × 1.2 headroom = 3.6GB minimum.
Hit Rate vs Cache Size (Typical Curve)
Cache size: 10% of working set → Hit rate: ~70%
Cache size: 20% of working set → Hit rate: ~85%
Cache size: 50% of working set → Hit rate: ~95%
Cache size: 100% of working set → Hit rate: ~99%
Hit rate curves are logarithmic — doubling cache size does not double hit rate. If access is highly skewed (20% of keys serve 80% of requests), a small cache gives outsized hit rate. If access is uniform, you need to cache most of the working set.
Capacity Questions to Answer
- What's the working set? Hot data only, or everything potentially cached?
- What hit rate do you need? 80%? 95%? 99%?
- What's the cost of a miss (DB query latency, capacity impact)?
- What's the cost of cache memory ($/GB for your tier)?
- What's the growth rate (how does working set grow over 12 months)?
Appendix D: Invalidation Strategy Reference
D.1 Write-Path DEL (Primary)
On every write: synchronously delete the cache key. Immediate freshness. Requires every write path to know about and correctly implement the DEL.
Race condition: Between the DB write and the DEL, a concurrent reader may repopulate the cache with the old value. Mitigation: version stamps, or accept the brief window (usually <1ms at normal latency).
D.2 TTL Safety Net
Always set TTL. Even with perfect write-path invalidation, set TTL to 2× the acceptable staleness window as a safety net for:
- New write paths added without invalidation
- Network failures on the DEL call
- Race conditions that slip through
D.3 CDC Backstop (Change Data Capture)
Subscribe to the DB changelog (Debezium for PostgreSQL WAL, Maxwell for MySQL binlog). Publish invalidation events for every row change. Consume and DEL asynchronously.
Latency: ~100ms–1s depending on CDC pipeline configuration. Benefit: Decouples writers from cache knowledge. Any write path, including ones added after the cache was built, automatically triggers invalidation. Cost: Infrastructure complexity (CDC pipeline, Kafka topic, consumer group). Operational burden.
D.4 Version Stamps
Include version in cache key: user:{id}:v{version}. When user is updated, increment version in DB. Reads check current version, construct the key with the new version, and miss naturally (old key is simply unreachable, not deleted). No DEL required.
Benefit: Eliminates the write-DEL race condition entirely. Stale entries expire naturally by TTL; they're never served to clients checking the current version. Cost: Key cardinality grows. Old-version entries occupy memory until TTL expiry.
Appendix E: Observability Reference
Core Metrics
| Metric | Measures | Alert Threshold |
|---|---|---|
cache.hit_rate | Performance effectiveness | <80% for 5 min → P2 |
origin.rps | Inverse hit rate effectiveness | >150% of baseline → P2 |
cache.eviction_rate | Memory pressure | Spike >1000/s → P2 |
cache.latency_p99_ms | Cache infrastructure health | >50ms → P2 |
cache.invalidation_lag_seconds | Staleness detection | >30s → P1 for financial data |
cache.correctness_divergence_rate | Actual correctness | >0.1% → P1 |
What Hit Rate Doesn't Tell You
High hit rate can hide correctness bugs. A 99% hit rate with a bug that wrote wrong data = 99% of reads confidently returning the wrong answer.
Better signals:
- Origin load during normal cache operation (rising = hit rate falling or hot key)
- Invalidation lag (time from write to cache refresh)
- Correctness sample (compare cached value to source of truth on 0.1% of reads)
Dashboard Structure
One panel per cache tier and per key-pattern family:
- Current hit rate (trend + alert line)
- Eviction rate (memory pressure signal)
- Invalidation lag (staleness signal)
- Origin RPS (blast radius signal)
- Correctness sample divergence rate (correctness signal)
Appendix F: Multi-Region Caching Reference
Strategies
| Strategy | Consistency | Latency | Complexity |
|---|---|---|---|
| Per-region, no sync | Eventually consistent | Low (no cross-region) | Low |
| Per-region + async invalidation | Eventually consistent | Low | Medium |
| Global cache (single region) | Strong | High (cross-region reads) | Low |
| Global cache (replicated) | Eventually consistent | Low reads, high writes | High |
Default: Per-Region with Async Invalidation
Write happens in one region → DB change triggers CDC event → invalidation event published to cross-region message bus → all regional caches consume and DEL. Propagation: 100–500ms.
Acceptable for: User profiles, product catalog, session data. These tolerate brief cross-region inconsistency.
Not acceptable for: Inventory (overselling risk), global rate limiting (abuse protection), financial data (compliance).
When to Accept Per-Region Inconsistency
Users rarely switch regions mid-session. Product catalog brief inconsistency rarely matters. Use per-region caching for these — global consistency adds latency and infrastructure complexity without meaningful user benefit.
When to Require Global Consistency
Inventory/stock, financial transactions, global rate limiting. For these, bypass regional caches entirely and read from a globally authoritative source. Accept cross-region latency. Do not attempt to build global cache consistency — the consistency tradeoffs of distributed caches for strongly-consistent data are not worth the complexity.