How to Use This Playbook
This playbook supports three reading modes:
| Mode | Time | What to Read |
|---|---|---|
| Quick Review | 15 min | Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7) |
| Targeted Study | 1-2 hrs | Executive Summary → Interview Walkthrough → Core Flow, expand appendices where you're weak |
| Deep Dive | 3+ hrs | Everything, including all appendices |
What is Distributed Caching? — Quick primer if you're unfamiliar
The Problem
A cache stores copies of frequently accessed data in fast storage (usually memory) to avoid repeatedly hitting slower backends like databases or external APIs. "Distributed" means the cache spans multiple nodes, allowing it to scale beyond a single machine's memory and survive individual node failures. The tradeoff: you're trading consistency for speed—cached data can become stale.
Common Use Cases
- Database Query Caching: Store expensive query results to reduce database load (e.g., product catalogs, user profiles)
- Session Storage: Keep user sessions in fast-access memory across a cluster of web servers
- API Response Caching: Cache third-party API responses to reduce latency and avoid rate limits
- Computed Result Caching: Store results of expensive computations (ML model outputs, aggregations)
- CDN Edge Caching: Cache static and semi-dynamic content at edge locations for global users
Why Interviewers Ask About This
Caching seems simple but hides brutal complexity: invalidation is a consistency problem, not a timeout problem. Interviewers want to see if you understand that adding a cache means you now have two sources of truth that can disagree. Can you articulate when staleness is acceptable? Do you know what happens when the cache goes down? This topic reveals whether you've dealt with real production issues—cache stampedes, thundering herds, and the dreaded "why is this showing old data?" bug.
What This Interview Actually Tests
Caching is not a "make it faster" question. Everyone knows Redis.
This is a consistency and operational ownership question that tests:
- Whether you understand why caching introduces complexity, not just speed
- Whether you reason about invalidation before discussing eviction
- Whether you can articulate what "stale" means for your use case
- Whether you understand the blast radius when the cache fails
The key insight: Caching is a consistency problem disguised as a performance optimization. Staff engineers reason about who pays for staleness and who owns the invalidation contract.
The L5 vs L6 Contrast (Memorize This)
| Behavior | L5 (Senior) | L6 (Staff) |
|---|---|---|
| First move | "We'll add Redis in front of the database" | Asks "What's the staleness tolerance? Who's the source of truth?" |
| Invalidation | "TTL of 5 minutes" | "TTL is the last resort. What's our invalidation contract?" |
| Failure | "We'll add replicas" | "When the cache fails, do we hit the DB or return errors? What's the thundering herd plan?" |
| Consistency | Assumes cache is always helpful | Articulates when caching makes things worse (write-heavy, low hit rate) |
| Ownership | Focuses on cache implementation | Asks "Who owns cache warming? Who gets paged when hit rate drops?" |
The Three Caching Intents (Pick One and Commit)
| Intent | Constraint | Strategy | Staleness Bar |
|---|---|---|---|
| Latency Reduction | Speed is everything | Aggressive caching, read-through | Seconds to minutes acceptable |
| Origin Protection | Shield backend from load | Cache-aside with circuit breaker | Minutes acceptable, freshness secondary |
| Cost Optimization | Reduce expensive computation/queries | Precompute + cache, longer TTLs | Minutes to hours acceptable |
Staff Move: "I'll assume we're protecting the origin from read load while maintaining sub-second staleness for user-facing data. This is the hardest case because we need both availability and freshness."
The Five Fault Lines (The Core of This Interview)
-
Freshness vs Performance — Shorter TTLs mean fresher data but more origin load. Who decides the staleness budget?
-
Cache-Aside vs Read-Through vs Write-Through — Where does the invalidation logic live? Who owns it?
-
Availability vs Consistency — When the cache is down, do we serve stale data, hit origin, or fail?
-
Local vs Distributed — In-process cache (fast, inconsistent) vs shared cache (slower, consistent)?
-
Proactive vs Reactive Invalidation — Push invalidation on write, or let TTL expire? Who coordinates?
Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.
Default Staff Positions (Unless Proven Otherwise)
| Position | Rationale |
|---|---|
| Cache-aside over write-through | Decouples cache availability from read availability |
| TTL is safety net, not strategy | Primary invalidation should be explicit on write path |
| Invalidation before eviction | Design the invalidation contract before discussing LRU vs LFU |
| Serve stale only with sign-off | Product must explicitly approve staleness budget per data type |
| Financial/auth data bypasses cache | Correctness cost of staleness exceeds performance benefit |
| If invalidation can't be owned, don't cache | Unowned invalidation = guaranteed stale data incidents |
Quick Reference: What Interviewers Probe
| After You Say... | They Will Ask... |
|---|---|
| "Add Redis cache" | "What's your invalidation strategy? What happens on write?" |
| "TTL of 5 minutes" | "What if the data changes? Is 5 minutes of staleness acceptable?" |
| "Cache-aside pattern" | "What about thundering herd on cache miss?" |
| "We'll add replicas" | "Replicas don't answer cold-start. What's your warming strategy?" |
| "Invalidate on write" | "How do you handle race conditions between write and invalidation?" |
Jump to Practice
→ Active Drills (§7) — 10 practice prompts with expected answer shapes
System Architecture Overview
Interview Walkthrough: How to Present This in 45 Minutes
The HelloInterview-style guides walk you through each step at tutorial pace. That's fine for Senior candidates. At Staff level, the basics should take 10-12 minutes — fast enough that you spend the remaining 30+ minutes on the invalidation, failure, and consistency questions that actually determine your level.
The six phases below add up to 45 minutes. The ratios matter: phases 1-4 are deliberately compressed so phase 5 gets the lion's share of time. If you're spending more than 12 minutes before the transition to depth, you're pacing like an L5.
Phase 1: Requirements & Framing (2-3 minutes)
State functional requirements in 30 seconds — don't enumerate, state the category:
- "We need a distributed caching layer to reduce database load and serve repeated reads at sub-millisecond latency."
That's it. Don't list every data type or cache operation.
Invest time on non-functional requirements (this is the Staff move):
- "What's the staleness budget? Product needs to define acceptable staleness per data type — prices need <30s, product descriptions can tolerate 5 minutes, user sessions need zero staleness."
- Clarify: read-to-write ratio (100:1 justifies caching, 2:1 probably doesn't), dataset size (does it fit in memory?), consistency model (eventual vs strong)
- "I'll assume a read-heavy workload with per-type staleness budgets, because that's the most common production scenario and forces the hardest invalidation decisions."
Phase 2: Core Entities & API (1-2 minutes)
State entities quickly (30 seconds):
- CacheEntry — key, value, TTL, version/etag (the version enables conditional invalidation)
- InvalidationEvent — key_pattern, source, timestamp, propagation_status (first-class entity, not an afterthought)
- CachePolicy — data_type, staleness_budget, eviction_strategy, origin_fallback behavior
API (1 minute) — transparent cache-aside in the application layer, not a separate API:
get(key) → HIT(value, age) | MISS
set(key, value, ttl, invalidation_policy) → OK
invalidate(key_or_pattern, reason) → OK
The invalidation path is the one that matters:
on_write(entity) → invalidate(cache_key(entity), "source_write")
Phase 3: High-Level Architecture (5-7 minutes)
Draw the core cache-aside flow on the whiteboard:
Walk the interviewer through the four data flows (reference the full System Architecture diagram above for the complete multi-layer picture):
- Read path → App checks Redis first; on miss, reads from PostgreSQL, populates cache with data-type-specific TTL
- Write path → App writes to PostgreSQL, then invalidates the cache key (delete, not update — avoids race conditions)
- Invalidation propagation → For critical data, CDC (change data capture) publishes invalidation events as a backstop — so even if the application forgets to invalidate, the database change stream catches it
- Failure path → When Redis is unavailable, requests fall through to PostgreSQL with circuit breaker protection and request coalescing (singleflight) to prevent thundering herd
Scripted walkthrough: "Read path: the app server checks Redis first — cache-aside. On miss, reads from PostgreSQL, populates cache with TTL. We use request coalescing so if 100 concurrent requests miss on the same key, only one hits the database. Write path: app writes to PostgreSQL, then invalidates the cache key. CDC publishes invalidation events as a backstop."
Key points to hit on the whiteboard:
- Cache-aside pattern — application controls both read and write paths (not write-through, which couples cache to write latency)
- Redis Cluster with hash slots — 6 shards for horizontal scaling, consistent hashing for key distribution
- Request coalescing — singleflight pattern prevents thundering herd on cache miss
- Write-path invalidation — delete on write, not update; avoids race conditions between concurrent writers
- CDN as first cache layer — browser cache → CDN edge → Redis → PostgreSQL; four-layer hierarchy
Then immediately flag the key tension: "This works for the happy path. The interesting questions are: what happens when Redis goes down and 100% of traffic hits PostgreSQL? Who owns the invalidation contract when 5 different services write to the same entity? And how do you detect that your cache is serving stale data without anyone noticing?"
Phase 4: Transition to Depth (1 minute)
At this point you have a correct, simple architecture on the board. Now you pivot:
"The basic architecture is well-understood — cache-aside with Redis and TTL-based expiry. What makes this Staff-level is the consistency and operational reasoning. Let me dive into three areas: (1) invalidation strategy and who owns it, (2) failure modes when the cache layer goes down, (3) how to detect and measure staleness in production."
Then offer the interviewer a choice:
"I can go deep on any of these. Which is most interesting to you?"
If the interviewer doesn't have a preference, lead with invalidation strategy — it's the most universally asked and the most misunderstood.
Phase 5: Deep Dives (25-30 minutes)
The interviewer will steer, but be prepared to go deep on any of these. For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → explain who absorbs that cost.
Fault Line 1: Freshness vs performance — the staleness budget (5-7 min)
Open with the business framing:
"Every cache entry has a staleness budget — the maximum age the business will tolerate. For product prices, that's <30 seconds (stale prices lose money). For product descriptions, 5 minutes is fine (nobody cares if a typo fix takes 5 minutes to propagate). For user sessions, it's zero (stale session = security vulnerability)."
Go deeper — walk through the TTL decision framework:
- Classify data types by staleness tolerance: real-time (<10s), near-real-time (30s-5min), eventual (>5min)
- For real-time data: TTL is a safety net, not the primary mechanism. Use event-driven invalidation (CDC or explicit delete-on-write)
- For near-real-time: TTL alone is sufficient. Set TTL = staleness_budget × 0.8 (leave 20% margin for clock skew)
- For eventual: Long TTL (hours/days) with background refresh. These entries are the highest-value cache entries — they offload the most database reads
The Staff follow-up: "The dangerous case is when someone sets a 24-hour TTL on price data because 'it rarely changes.' It doesn't change — until it does, and then customers see stale prices for 24 hours. That's why TTL ownership should be in the product spec, not the code."
Cross-reference §3.1 Freshness vs Performance for the full analysis.
Fault Line 2: Cache failure and thundering herd (5-7 min)
"When Redis goes down, every request becomes a cache miss. If you have a 95% hit rate and 10K requests/second, that means 9,500 requests/second that were hitting cache now hit PostgreSQL directly. Your database is provisioned for 500 requests/second. It dies."
Name the mitigations in order of priority:
- Request coalescing (singleflight) — if 100 concurrent requests miss on key X, only one hits the database; the other 99 wait for the result. This alone handles 80% of thundering herd scenarios
- Circuit breaker on the database — if database latency exceeds threshold, reject new requests with a fallback (stale data from local cache, degraded response, or 503)
- Stale-while-revalidate — serve the expired cache entry while refreshing in the background. The client gets slightly stale data instead of a slow or failed response
- Local in-process cache (L1) — small LRU cache in each app server (1000 hot keys). Survives Redis failures with degraded consistency
Quantify the recovery: "With singleflight + circuit breaker, a Redis outage degrades latency from 2ms to 50ms (database reads) but doesn't cascade. Without protection, the database dies within 30 seconds and recovery takes 5-10 minutes because the connection pool is exhausted."
Fault Line 3: Invalidation strategy — proactive vs reactive (5-7 min)
"There are three invalidation approaches: (a) TTL-only (reactive — you wait for expiry), (b) explicit invalidation on write (proactive — you delete when data changes), (c) CDC-driven invalidation (proactive + reliable — the database change stream triggers invalidation). I'd use a combination: TTL as a safety net + explicit invalidation for the write path + CDC as a backstop."
The organizational problem: "With 5 microservices writing to the products table, who is responsible for invalidating the product cache? If the price service updates the price but doesn't invalidate the cache, the product service serves stale prices. That's why CDC is the backstop — it doesn't depend on every service remembering to invalidate."
Cache-aside vs write-through vs read-through (3-5 min)
"Cache-aside gives the application full control — it's the most common pattern and the most debuggable. Write-through couples your write latency to cache latency (now every write is DB + cache round trip). Read-through hides the caching logic in a library but makes debugging harder — when data is stale, you can't tell if the cache missed, the TTL is wrong, or the invalidation failed."
Pick a position: "I default to cache-aside for most systems. Write-through only when the write volume is low and you need guaranteed cache warmth (e.g., configuration data). Read-through only when you want the cache to be the primary interface and can accept the debugging cost."
Operational maturity: measuring staleness in production (3-5 min)
"How do you know your cache is serving stale data? You can't just check TTLs — you need to measure actual staleness. The approach: periodically sample cache entries, compare their version/etag against the database source of truth, and report the staleness distribution."
Three metrics that matter:
- Hit rate per data type — if product prices have a 99.9% hit rate with a 30s TTL, you're serving a lot of cached prices. Is that safe?
- Staleness distribution — p50/p95/p99 age of cache entries at read time. If p99 staleness exceeds the business SLA, your TTL or invalidation is broken
- Invalidation propagation delay — time between database write and cache invalidation completion. If this exceeds 5 seconds, your "real-time" invalidation isn't real-time
Phase 6: Wrap-Up (2-3 minutes)
Summarize the key tradeoff — don't just restate your architecture, synthesize the insight:
"Distributed caching is a staleness management problem, not a performance optimization. The Staff-level challenge is: who defines the staleness budget, who owns the invalidation contract, and how do you detect when the contract is violated? The architecture is straightforward — cache-aside with Redis, TTL as safety net, CDC for proactive invalidation. The hard problem is organizational: making sure every write path has a corresponding invalidation path, and measuring staleness in production rather than assuming TTLs are correct."
If time permits, add the counterintuitive insight:
"Sometimes the right answer is to remove the cache. If your cache hit rate is 30%, you're paying for Redis infrastructure to serve 30% of reads while adding invalidation complexity for 100% of writes. At that point, scale the database instead. Caching is not free — it's a complexity trade for latency. Only make that trade when the hit rate justifies it."
Common Timing Mistakes
| Mistake | L5 Does This | L6 Does This |
|---|---|---|
| 10 min on requirements | Lists every data type to cache | States staleness budget concept in 1 min, moves on |
| 10 min on cache-aside | Explains get-check-miss-fill at tutorial pace | "Cache-aside with Redis. Here's the architecture." |
| No invalidation discussion | Only mentions TTL | Draws the write-path invalidation + CDC backstop proactively |
| No thundering herd | Waits for "what if Redis goes down?" | Volunteers singleflight + circuit breaker in the architecture phase |
| No staleness measurement | Assumes TTLs are correct | Proposes staleness sampling and p99 staleness metrics |
| No numbers | "It should be fast" | "95% hit rate → database sees 5% of traffic. Redis failure = 20x load increase." |
Reading the Interviewer
| Interviewer Signal | What They Care About | Where to Go Deep |
|---|---|---|
| Asks about consistency | Data correctness | Invalidation strategy, CDC, staleness measurement (§3.1) |
| Asks about Redis failure | Operational maturity | Thundering herd, circuit breaker, singleflight (§3.3) |
| Asks about write patterns | Architecture depth | Cache-aside vs write-through, invalidation ownership (§3.2) |
| Asks "who decides TTLs?" | Organizational design | Staleness budgets, product ownership, per-type policies |
| Asks about local cache | Performance engineering | L1/L2 cache hierarchy, consistency tradeoffs (§3.4) |
| Pushes back on cache-aside | Wants to see you reason about alternatives | Write-through for warm cache, read-through for abstraction |
What to Deliberately Skip
These topics are traps. L5 candidates spend time on them. Staff candidates name them, dismiss them, and redirect to what matters.
| Topic | Why L5 Goes Here | What L6 Says Instead |
|---|---|---|
| Redis vs Memcached | Feels like showing breadth | "Redis — richer data structures, persistence, Pub/Sub for invalidation." |
| LRU vs LFU eviction | Easy to explain | "LRU for general use, LFU for hot-key workloads. Not the interesting problem." |
| Cache warming strategies | Seems like completeness | "Pre-warm on deploy from the database. Straightforward." |
| Redis Sentinel vs Cluster | Infrastructure trivia | "Redis Cluster for sharding, Sentinel for HA on single-shard. Moving on." |
| Serialization format | Easy to enumerate | "Protocol Buffers for compact binary. JSON for debugging. Not a design decision." |
The pattern: acknowledge you know it, state your position in one sentence, redirect to the interesting problem — invalidation, staleness, and organizational ownership.
→ Continue to The Five Fault Lines (§3) for the Staff-grade tradeoff reasoning.
11. The Staff Lens
1.1 Why This Problem Exists in Staff Interviews
This is NOT a "speed up reads" question. Everyone knows how to add Redis.
This is a Consistency & Operational Ownership question that tests:
- Whether you understand that caching trades correctness for performance
- Whether you reason about invalidation contracts, not just TTLs
- Whether you can articulate failure scenarios and their blast radius
- Whether you understand the operational burden of cache infrastructure
1.2 The L5 vs L6 Contrast
| Behavior | L5 (Senior) Candidate | L6 (Staff) Candidate |
|---|---|---|
| First move | "Add Redis in front of the DB" | Asks "What's the read/write ratio? What's acceptable staleness?" |
| Invalidation | Defaults to TTL | Designs explicit invalidation contract tied to write path |
| Failure mode | "Add replicas" | "What's the thundering herd mitigation? Cache-miss circuit breaker?" |
| Consistency | Assumes caching always helps | Knows when caching hurts (write-heavy, low reuse, consistency-critical) |
| Ownership | Implementation focus | Platform thinking: who warms, who monitors, who gets paged |
Behavior 1: First move (understand the access pattern)
Staff signal: Characterize the workload before proposing architecture.
Why this matters (L5 vs L6)
L5: Jumps to "add a cache" without understanding the access pattern. This leads to caches with low hit rates (write-heavy data), or caches that make consistency bugs worse.
L6: Asks about read/write ratio, access skew (hot keys vs uniform), staleness tolerance, and data size. Then commits to a caching strategy that fits. "This is 95% reads with high locality — caching will help. If it were 50/50 reads/writes, I'd question whether caching adds value."
Behavior 2: Invalidation strategy (TTL is not a strategy)
Staff signal: Design an explicit invalidation contract before discussing eviction.
Why this matters (L5 vs L6)
L5: Says "TTL of 5 minutes" as the primary invalidation mechanism. TTL is a fallback, not a strategy. It means "we'll serve stale data for up to 5 minutes after a write."
L6: Designs invalidation around the write path: "On user update, we invalidate the user cache entry. TTL is a safety net for orphaned entries, not the primary mechanism." The Staff question is: who owns the invalidation contract, and what happens if it fails?
Behavior 3: Failure handling (replicas don't answer thundering herd)
Staff signal: Design for cache failure, not just cache slowness.
Why this matters (L5 vs L6)
L5: Treats cache failure as "add replicas / HA." That improves availability, but doesn't answer: what happens when the cache is cold (restart, failover, new deployment)? What prevents thousands of requests from stampeding the origin?
L6: Designs for thundering herd: request coalescing, cache-miss circuit breaker, probabilistic early expiry. Names the blast radius: "If the cache goes cold, can the origin handle the load? If not, we need a degraded mode."
Behavior 4: Consistency model (know when caching hurts)
Staff signal: Articulate when caching makes the system worse.
Why this matters (L5 vs L6)
L5: Treats caching as universally good. "Caching always improves performance."
L6: Knows caching can hurt: write-heavy workloads (invalidation churn), low-reuse data (cache pollution), consistency-critical paths (stale reads cause bugs), small datasets that fit in DB buffer pool (redundant layer). The Staff move is to state when you wouldn't cache.
Behavior 5: Ownership (who warms, who monitors)
Staff signal: Design for the organization, not just one service.
Why this matters (L5 vs L6)
L5: Focuses on "how to implement caching" in isolation. Doesn't consider who warms the cache on deploy, who monitors hit rate, who gets paged when the cache is slow.
L6: Treats caching as a platform concern: standardized patterns, shared infrastructure, consistent observability. "Cache hit rate below 80% for 5 minutes pages the owning team. Warming is automated on deploy."
1.3 The Key Insight
22. Problem Framing & Intent
2.1 The Three Intents
Before drawing any boxes, ask Why? The caching strategy changes entirely based on intent:
| Intent | Constraint | Strategy | Staleness Tolerance | Failure Mode |
|---|---|---|---|---|
| Latency Reduction | Speed is everything | Aggressive caching, read-through, local caches | Seconds OK | Serve stale > fail |
| Origin Protection | Shield backend from load | Cache-aside, circuit breakers, request coalescing | Minutes OK | Degrade gracefully |
| Cost Optimization | Reduce expensive operations | Precompute + cache, longer TTLs, lazy refresh | Minutes-hours OK | Stale > recompute |
This sentence alone separates L5 from L6.
2.2 What's Intentionally Underspecified
The interviewer deliberately avoids specifying:
- Read/write ratio
- Data size and cardinality
- Staleness tolerance
- Cache budget (memory cost)
- Multi-region requirements
Staff engineers surface these unknowns. Senior engineers assume them away.
2.3 How to Open (The First 2 Minutes)
- Ask 1-2 clarifying questions about access pattern and staleness
- State your assumption explicitly
- Outline your plan: access pattern → caching strategy → invalidation → failure modes → observability
Example opening:
If Asked: How to characterize workload without sounding junior
What interviewers expect you to name:
- Read/write ratio (heavy reads = cache helps; balanced = question it)
- Access skew (hot keys = cache helps; uniform = less value)
- Data freshness requirements (real-time vs eventual)
- Data size (fits in memory? needs eviction strategy?)
What NOT to say:
- "We need to cache everything" (no selectivity)
- "5 minute TTL should be fine" (no reasoning about staleness)
- "Redis will handle it" (no architecture)
Staff-calibrated phrasing:
2.4 Terminology (Use Precise Words)
Caching interviews are ambiguous about where and how caching happens. Use precise terms:
| Term | What It Means | Consistency Model |
|---|---|---|
| Local/In-Process Cache | Same process as application | Per-instance, no coordination |
| Distributed Cache | Shared cache cluster (Redis, Memcached) | Shared state, single source |
| CDN Cache | Edge caching for static/semi-static content | Eventually consistent |
| Database Buffer Pool | DB's internal page cache | Transparent to application |
| Pattern | Write Path | Read Path | Invalidation |
|---|---|---|---|
| Cache-Aside | Write to DB, invalidate cache | Read cache, miss → read DB → populate cache | Explicit on write |
| Read-Through | Write to DB | Read cache, miss → cache fetches from DB | TTL or explicit |
| Write-Through | Write to cache → cache writes to DB | Read from cache | Implicit (write updates cache) |
| Write-Behind | Write to cache → async write to DB | Read from cache | Implicit (write updates cache) |
If Asked: Cache topology you should be able to articulate
Describe the layers, not implementation details:
If pressed for specifics:
- L1: In-process LRU, 1000 entries, 10-second TTL
- L2: Redis cluster, sharded by key hash, 5-minute TTL
- Invalidation: Write path invalidates L2, L1 expires via short TTL
What you do NOT need:
- Exact memory sizes
- Redis cluster configuration
- Serialization format details
Staff insight: The topology is simple. The hard part is the invalidation contract and failure handling.
2.5 When NOT to Cache (Staff Candidates Say No)
Staff candidates win interviews by knowing when to not cache. This is the highest-signal behavior.
Do NOT cache when:
| Scenario | Why Caching Hurts |
|---|---|
| Write-heavy data | Invalidation churn exceeds read savings. Cache becomes a write amplifier. |
| Low-reuse data | Each key read once. Cache fills with cold data, evicts hot data. |
| Consistency-critical paths | Auth, permissions, financial data. Stale = wrong = incident. |
| Data already hot in DB | DB buffer pool already caches it. You're adding a redundant layer. |
| Small datasets | If it fits in DB memory, caching adds latency (network hop) not savings. |
| Invalidation can't be owned | If no one knows when to invalidate, you will serve stale data forever. |
Staff Move: "Before I add caching, let me check if it's even appropriate. What's the read/write ratio? What's the reuse factor? What's the staleness tolerance? For write-heavy or consistency-critical data, I'd push back on caching entirely."
Bar-Raiser Follow-up: "When would you tell the team NOT to cache this data?"
Expected answer: "If the write rate is close to the read rate, invalidation will dominate. If the data is consistency-critical and staleness causes bugs or compliance issues, caching is the wrong tool."
33. The Five Fault Lines
This section contains the Staff-grade tradeoff reasoning. Each fault line includes:
- A tradeoff matrix
- Explicit "who pays" analysis
- L6 vs L7 calibration
- Bar-raiser follow-up questions
3.1 Fault Line 1: Freshness vs Performance
| Choice | What Works | What Breaks | Who Pays |
|---|---|---|---|
| Short TTL / Aggressive Invalidation | Fresh data | Higher origin load, more invalidation traffic | Infra (capacity), Eng (complexity) |
| Long TTL / Lazy Invalidation | Lower origin load | Stale data, user confusion | Product (UX), Support (complaints) |
The tradeoff: Every second of TTL is a second of potential staleness. But shorter TTLs mean more origin traffic.
L6 (Staff) answer: Ties staleness budget to business requirements. "For user profile data, 30 seconds is acceptable — users don't expect instant updates. For inventory counts, we need tighter consistency, so we use write-through invalidation with a short TTL fallback."
L7 (Principal) answer: Establishes org-wide staleness SLAs by data class. "We have three data tiers: real-time (no caching or <1s), near-real-time (sub-minute), and batch (hour+). Each tier has a standard pattern and observability."
3.2 Fault Line 2: Cache-Aside vs Read-Through vs Write-Through
| Pattern | Invalidation Ownership | Pros | Cons | Who Pays |
|---|---|---|---|---|
| Cache-Aside | Application owns invalidation | Explicit control, cache failure = DB fallback | Invalidation logic scattered, race conditions | Application teams (implement invalidation per service), Infra (cache operational burden) |
| Read-Through | Cache owns fetch | Simple read path, automatic population | Cache failure = read failure, less control | Infra (tight coupling to cache service), Application teams (lose control, cache downtime = service downtime) |
| Write-Through | Cache owns persistence | Always consistent, simple reads | Write latency includes cache, tight coupling | Users (slower writes), Infra (cache-DB coupling, complex failure modes) |
L6 (Staff) answer: Chooses cache-aside for most cases because it decouples cache availability from read availability. "If the cache is down, we fall back to the DB — slower, but not broken. Write-through couples us too tightly."
L7 (Principal) answer: Evaluates pattern choice based on failure modes and organizational capability. "Cache-aside requires every team to implement invalidation correctly. If we don't have that discipline, read-through with a managed cache might be safer despite the coupling."
3.3 Fault Line 3: Availability vs Consistency
Decision Framework:
| Context | Recommended | Why |
|---|---|---|
| User-facing reads | Serve stale | UX > perfect consistency |
| Financial data | Fail or hit origin | Correctness > availability |
| Inventory/stock | Depends | Oversell vs undersell tolerance |
L6 (Staff) answer: Classifies data by consistency requirements. "User preferences can serve stale — users won't notice a 30-second delay. But inventory must hit origin on cache miss because overselling is worse than latency."
L7 (Principal) answer: Defines consistency tiers as organizational policy with standard patterns and observability for each tier.
Ownership note: In practice, availability-first choices (serve stale) shift cost to Users (stale data experience) and Support (complaint volume). Consistency-first choices (fail or hit origin) shift cost to Engineering (circuit breaker complexity) and Infra (origin capacity to handle cache-miss load).
→ For the complete decision framework, see →Degraded Mode Framework — applies to cache failures, origin protection, and graceful degradation.
3.4 Fault Line 4: Local Cache vs Distributed Cache
| Choice | Latency | Consistency | Memory Efficiency | Who Pays |
|---|---|---|---|---|
| Local (in-process) | Microseconds | Per-instance (inconsistent across instances) | Duplicated per instance | Users (inconsistent experience), Infra (N × cache memory cost) |
| Distributed (Redis) | Milliseconds | Shared (consistent across instances) | Shared pool | Users (network latency per request), Infra (Redis operational burden) |
| Multi-tier (L1 + L2) | Microseconds + fallback | L1 inconsistent, L2 consistent | Best of both | Engineering (two-tier invalidation complexity), Infra (manage both systems) |
The tradeoff: Local caches are fast but create consistency issues across instances. Distributed caches are consistent but add network latency.
L6 (Staff) answer: Uses multi-tier for high-traffic data. "L1 in-process cache with 10-second TTL absorbs repeat requests within a single instance. L2 Redis handles cross-instance consistency. L1 staleness is bounded by its short TTL."
L7 (Principal) answer: Standardizes multi-tier patterns across the org with clear guidance on when to use each tier and how to reason about aggregate staleness (L1 TTL + L2 TTL).
3.5 Fault Line 5: Proactive vs Reactive Invalidation
| Approach | Freshness | Complexity | Failure Mode | Who Pays |
|---|---|---|---|---|
| Proactive (push on write) | Immediate | High (coordination) | Failed invalidation = stale data | Engineering (invalidation logic in all write paths), Infra (coordination overhead) |
| Reactive (TTL expiry) | Delayed | Low | Guaranteed staleness up to TTL | Product (stale data complaints), Support (explain delays to users) |
| Hybrid (push + TTL fallback) | Immediate with safety | Medium | Best of both | Engineering (implementation complexity), but reduces Product/Support burden |
L6 (Staff) answer: Uses hybrid — proactive invalidation on write with TTL as safety net. "On user update, we delete the cache key. TTL handles cases where invalidation fails or the write path changes."
L7 (Principal) answer: Implements change data capture (CDC) for invalidation at scale. "Rather than coupling invalidation to every write path, we tail the DB changelog and invalidate asynchronously. This decouples writers from cache knowledge."
→ For invalidation coordination patterns, see →Distributed Coordination Framework.
44. Failure Modes & Degradation
→ This section applies the →Degraded Mode Framework. Review it if you need the full availability vs consistency decision tree.
4.1 Thundering Herd (Cache Stampede)
The most common cache failure mode. When a popular cache entry expires or the cache restarts, many requests simultaneously miss the cache and hit the origin.
Scenario: Popular Cache Key Expires
Timeline:
t=0: Cache entry for "homepage_data" expires (TTL)
t=0-1s: 1000 concurrent requests hit cache → all miss
t=0-1s: 1000 requests hit database simultaneously
t=1-2s: Database CPU spikes to 100%, queries slow
t=2-5s: Database connection pool exhausted
t=5s+: Cascading failures, homepage down
What breaks first: Database connection pool, then query latency, then availability.
Mitigations:
| Mitigation | How It Works | Tradeoff |
|---|---|---|
| Request coalescing | First request fetches; others wait | Added latency for waiters |
| Probabilistic early expiry | Expire slightly before TTL (jitter) | Some extra origin load |
| Cache-miss circuit breaker | Limit concurrent origin requests | Some requests fail |
| Background refresh | Refresh before expiry | Complexity, always slightly stale |
Staff answer:
4.2 Hot Key Problem
Similar to rate limiting hot keys. One cache key receives disproportionate traffic, overwhelming a single cache shard.
Scenario: Viral Content
Timeline:
t=0: Content goes viral
t=1min: Traffic to one key grows 100x
t=2min: Single Redis shard CPU at 100%
t=3min: Shard latency spikes, timeouts begin
t=5min: Shard becomes unavailable
t=5min+: All requests for hot key fail
Mitigations:
| Mitigation | How It Works | Tradeoff |
|---|---|---|
| Local cache (L1) for hot keys | Absorb traffic at application tier | Staleness across instances |
| Key replication | Replicate hot keys across shards | Memory overhead, invalidation complexity |
| Read replicas | Direct hot-key reads to replicas | Eventual consistency |
Staff answer:
4.3 Cold Start (Cache Miss Storm)
Scenario: Cache Cluster Restart
Timeline:
t=0: Redis cluster restarts (maintenance, failure)
t=0: 100% cache miss rate
t=0-1m: All reads hit database
t=1m: Database connection pool exhausted
t=2m: Service degradation or outage
Why "just restart" doesn't work: If your cache handles 90% of read load, a cold cache means 10x the origin load overnight.
Mitigations:
| Mitigation | How It Works | Tradeoff |
|---|---|---|
| Cache warming on deploy | Pre-populate cache before traffic shift | Deployment complexity, warming time |
| Gradual traffic shift | Slowly move traffic to new cache | Longer rollout, coordination |
| Origin capacity headroom | Size origin for cache-miss load | Cost (overprovisioned DB) |
| Stale-while-revalidate | Serve old cache + async refresh | Need persistent cache |
Staff answer:
4.4 Cache Inconsistency (Stale Data Bugs)
Scenario: Failed Invalidation
Timeline:
t=0: User updates their email
t=0: DB write succeeds
t=0: Cache invalidation fails (network blip)
t=0-5m: User sees old email in UI
t=5m: TTL expires, fresh data served
What makes this insidious: Silent failure. No errors, no alerts. Just wrong data.
Mitigations:
| Mitigation | How It Works | Tradeoff |
|---|---|---|
| Write-through invalidation | Invalidate in same transaction | Coupling, latency |
| Version/generation stamps | Include version in cache key | Key cardinality |
| Idempotent invalidation | Retry invalidation | Complexity |
| Short TTL fallback | Limit staleness window | More origin load |
Staff answer:
4.5 Operational Reality Matrix
| Failure | Loud/Silent | User Impact | Detection Time |
|---|---|---|---|
| Cache down | Loud | Latency spike or errors | Seconds |
| Cache slow | Medium | Latency degradation | Minutes |
| Thundering herd | Loud | Origin overload | Seconds to minutes |
| Hot key | Medium | Single-key latency | Minutes |
| Stale data | Silent | Wrong data shown | Hours to never |
| Cold start | Loud | Origin overload | Seconds |
55. Evaluation Rubric
5.1 Level-Based Signals
| Dimension | L5/Senior | L6/Staff | L7/Principal |
|---|---|---|---|
| Access Pattern | Assumes caching helps | Characterizes workload; knows when caching hurts | Establishes patterns by data classification |
| Invalidation | TTL-only | Explicit invalidation contract + TTL fallback | CDC-based invalidation, org-wide patterns |
| Consistency | Assumes fresh data | Articulates staleness budget by data type | Defines consistency tiers as policy |
| Failure Modes | "Add replicas" | Thundering herd, hot key, cold start mitigations | Capacity planning for cache-miss scenarios |
| Multi-tier | Single cache layer | L1 + L2 with clear reasoning | Standardized multi-tier patterns |
| Ownership | Implementation focus | Warming, monitoring, paging ownership | Platform-wide caching strategy |
5.2 Strong Hire Signals
| Signal | What It Looks Like |
|---|---|
| Staleness Reasoning | "30 seconds of staleness is acceptable for this use case because..." |
| Invalidation Design | "We invalidate on write, with TTL as safety net for failed invalidations" |
| Failure Awareness | "When the cache is cold, we need to protect the origin with circuit breakers" |
| Ownership Thinking | "Who warms the cache on deploy? Who gets paged when hit rate drops?" |
5.3 Lean No Hire Signals
| Signal | What It Looks Like |
|---|---|
| Redis Fixation | 15 minutes on Redis internals without tradeoffs |
| TTL-Only Thinking | "We'll set a 5 minute TTL" with no invalidation strategy |
| Ignoring Failures | No mention of thundering herd, cold start, or stale data |
| Missing Intent | Caches everything without reasoning about staleness tolerance |
5.4 Common False Positives
- Knows Redis deeply: Deep Redis knowledge ≠ good cache design
- Mentions all patterns: Breadth without depth is Senior, not Staff
- Complex diagrams: Multi-tier diagrams without invalidation reasoning
66. Interview Flow & Pivots
6.1 Typical 45-Minute Structure
| Phase | Time | What Happens |
|---|---|---|
| Framing | 5 min | Clarify access pattern, staleness tolerance |
| Requirements | 5 min | Read/write ratio, data size, consistency needs |
| High-Level Design | 10 min | Caching pattern, invalidation strategy |
| Deep Dive | 15 min | Failure modes, thundering herd, consistency |
| Wrap-Up | 10 min | Operations, monitoring, evolution |
6.2 How Interviewers Pivot
| After You Say... | They Will Probe... |
|---|---|
| After "add Redis" | "What's your invalidation strategy?" |
| After invalidation discussion | "What happens during thundering herd?" |
| After scaling discussion | "How do you handle hot keys?" |
| After happy path | "What if the cache is completely cold?" |
6.3 What Silence Means
- After tradeoff question: Interviewer wants you to reason aloud
- After "what about consistency?": You're missing staleness reasoning
- After definitive answer: They may disagree or want nuance
6.4 Follow-Up Questions to Expect
- "What happens when the cache is cold?"
- "How do you handle thundering herd?"
- "What's your invalidation strategy on write?"
- "How do you detect stale data bugs?"
- "What's the staleness budget for this data?"
- "When would you NOT cache this data?"
77. Active Drills
Practice these scenarios to internalize Staff-level thinking. Try answering before revealing the Staff approach.
Drill 1: The Opening (Access Pattern + Staleness)
Interview prompt: "Design a caching layer for our product catalog."
Staff Answer
| Step | Staff Answer |
|---|---|
| Clarify | Ask about read/write ratio, catalog size, staleness tolerance, traffic patterns |
| Assume | "I'll assume 95% reads, 100K products, 1-minute staleness acceptable" |
| Outline | Access pattern → caching strategy → invalidation → failure modes → observability |
Why this is L6:
- Starts with access-pattern discovery before choosing a technology — intent-driven design, not solution-first
- States explicit staleness assumptions up front — shows product-awareness and frames the tradeoff space
- Includes failure modes and observability in the outline — proves ownership extends beyond the happy path
Drill 2: Invalidation Strategy
Interview prompt: "How do you keep the cache consistent with the database?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Primary | Write-path invalidation: "On product update, delete cache key" |
| Fallback | TTL as safety net: "5-minute TTL catches failed invalidations" |
| Edge cases | Race conditions, eventual consistency, version stamps if needed |
Why this is L6:
- Layers primary invalidation with a TTL safety net — defense-in-depth thinking, not single-strategy reliance
- Calls out race conditions and version stamps — anticipates failure modes a Senior would overlook
- Treats consistency as a spectrum, not a binary — articulates tradeoffs rather than picking one extreme
Drill 3: Thundering Herd
Interview prompt: "A popular cache key expires. What happens?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Problem | 1000 requests hit origin simultaneously |
| Mitigation | Request coalescing: first request fetches, others wait |
| Prevention | Probabilistic early expiry (jitter), background refresh |
Why this is L6:
- Separates mitigation (coalescing) from prevention (jitter, background refresh) — shows systems thinking across time horizons
- Names the specific failure cascade (1000 requests hit origin) — quantifies blast radius rather than hand-waving
- Proposes probabilistic early expiry — demonstrates awareness of techniques that prevent problems at scale, not just react to them
Drill 4: Cache Failure
Interview prompt: "Redis is down. What happens to your service?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Decide | By data type: user data falls back to DB, financial data fails explicitly |
| Protect | Circuit breaker on origin, request limiting |
| Observe | Alert on cache unavailability, monitor origin load |
→ Review →Degraded Mode Framework for the complete decision tree.
Why this is L6:
- Differentiates fallback strategy by data type (user data vs financial data) — not a one-size-fits-all answer
- Explicitly chooses "fail loudly" for financial data — demonstrates safety-first thinking over availability bias
- Includes circuit breakers and origin protection — owns the downstream impact, not just the cache layer
Drill 5: Hot Key
Interview prompt: "One product gets 90% of traffic. What breaks?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Problem | Single cache shard overwhelmed |
| Mitigation | L1 local cache for hot keys, key replication |
| Detection | Metrics on per-key QPS, automated hot-key detection |
Why this is L6:
- Identifies the root infrastructure failure (single shard overwhelmed) — reasons about the system layer, not just the application layer
- Proposes L1 local cache and key replication — shows multi-tier caching awareness beyond basic Redis usage
- Includes automated hot-key detection — builds operational feedback loops rather than relying on manual monitoring
Drill 6: Consistency vs Performance
Interview prompt: "Users complain they see stale data after updating their profile."
Staff Answer
| Step | Staff Answer |
|---|---|
| Diagnose | Is invalidation failing, or is TTL too long? |
| Fix | Write-through invalidation, shorter TTL, or read-your-writes pattern |
| Communicate | Product decision: "changes may take a moment" vs immediate consistency |
Why this is L6:
- Starts with diagnosis before jumping to a fix — distinguishes root cause from symptoms
- Offers read-your-writes as a targeted pattern — applies the right consistency model for the use case, not blanket strong consistency
- Brings product communication into a technical answer — recognizes that user expectation is an engineering constraint, not just a PM concern
Drill 7: Build vs Buy
Interview prompt: "Should we use Redis, Memcached, or a managed service?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Evaluate | Data structures needed, persistence requirements, operational capacity |
| Compare | Self-managed (control, cost) vs managed (ops burden, features) |
| Recommend | Usually managed unless specific requirements demand self-hosted |
Why this is L6:
- Evaluates operational capacity as a first-class criterion — understands that team ability to run infrastructure matters as much as features
- Frames the decision around requirements, not preferences — avoids the "I like Redis" trap that signals Senior-level thinking
- Defaults to managed with an explicit escape hatch — shows organizational awareness of where engineering time is best spent
→ For the complete framework, see →Build vs Buy Framework.
Drill 8: Multi-Region Caching
Interview prompt: "We're expanding to Europe. How does caching change?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Options | Per-region caches (simple, inconsistent) vs global cache (complex, consistent) |
| Tradeoff | Cross-region latency (100ms+) vs staleness across regions |
| Recommend | Per-region caches with async invalidation for most use cases |
Why this is L6:
- Lays out both architecture options with clear tradeoff dimensions (latency vs consistency) — structured decision-making, not gut feel
- Quantifies the cross-region latency cost (100ms+) — grounds the tradeoff in real numbers that drive the recommendation
- Recommends async invalidation as the default — balances pragmatism with correctness rather than chasing perfect global consistency
Drill 9: Ownership Conflict — TTL Disagreement
Interview prompt: "Product wants 1-hour TTL for faster page loads. Infra says 5 minutes max because of stale data risk. You're the Staff engineer. How do you resolve this?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Reframe | This isn't a TTL debate — it's a staleness tolerance question. What's the actual business impact of stale data? |
| Investigate | What data is this? User profile (1hr OK)? Inventory (5min risky)? Pricing (unacceptable)? |
| Propose | Data classification: different TTLs for different data sensitivity. Not one global TTL. |
| Ownership | Product owns staleness SLA per data type. Infra provides the mechanisms. |
| Document | Write down: "User preferences: 1hr TTL approved by Product. Inventory: 5min with write-through invalidation." |
Staff insight: The conflict exists because nobody defined the staleness contract. The fix is explicit ownership, not splitting the difference on TTL.
Why this is L6:
- Reframes a technical disagreement as a missing contract — solves the organizational root cause, not the surface argument
- Introduces data classification with per-type TTLs — shows that the right answer is nuanced, not a single compromise number
- Assigns explicit ownership (Product owns staleness SLA, Infra provides mechanisms) — demonstrates cross-team boundary thinking
Drill 10: Ownership Conflict — Cache Hides a Bug
Interview prompt: "A pricing bug went unnoticed for 2 weeks because the cache kept serving correct (cached) prices while the DB had wrong values. Now leadership wants to know why QA didn't catch it. What do you say?"
Staff Answer
| Step | Staff Answer |
|---|---|
| Diagnose | The cache masked the bug. QA tested via the cache (fast path), never hit the DB (slow path). |
| Root cause | No cache-bypass testing. No correctness monitoring comparing cache vs source. |
| Immediate fix | Add cache-bypass test cases. Add periodic cache-vs-DB consistency checks. |
| Systemic fix | Cache correctness metrics: sample reads and compare to source of truth. Alert on divergence. |
| Ownership | Who owns cache correctness? Not QA — they test features. Infra or platform team owns cache health. |
Staff insight: Caches amplify bugs by serving wrong answers faster and more consistently. The fix is correctness observability, not blaming QA.
Why this is L6:
- Redirects blame away from QA toward a systemic gap (no cache-bypass testing) — shows leadership maturity in incident response
- Proposes cache-vs-source correctness monitoring — builds continuous verification, not just one-time test coverage
- Identifies that caches amplify bugs as an architectural insight — reasons about emergent system behavior, not just component behavior
88. Deep Dive Scenarios
Scenario-based analysis for Staff-level depth
These scenarios test Staff-level operational thinking. Unlike drills (which test interview responses), deep dives test ownership reasoning.
Deep Dive 1: Black Friday Cache Failure
Staff Answer
| Phase | What to do |
|---|---|
| Immediate (0-5 min) | Is this a cache problem or an origin problem? Check if origin can handle the extra 25% load. |
| Triage | Hot keys (one product viral)? Memory pressure (evictions)? Network saturation? |
| Quick fix | If hot keys, enable L1 caching. If memory, increase cluster size or reduce TTL for low-value keys. |
| Guardrails | Circuit breaker on origin, shed low-priority traffic if needed. |
| Post-mortem | Why didn't capacity planning catch this? Load test with Black Friday traffic patterns. |
Staff insight: A 25% cache hit rate drop on Black Friday could mean 5x increase in DB load. The cache isn't the patient — the origin is.
Deep Dive 2: Stale Data Incident
Staff Answer
| Dimension | Staff Answer |
|---|---|
| Root cause | Payment data was cached with 3-hour TTL, no write-through invalidation |
| Immediate | Purge cache for affected customer, verify current state |
| System fix | Payment data should not be cached, OR write-through invalidation with <1min TTL |
| Process fix | Data classification policy: financial data requires explicit cache approval |
| Broader question | What other sensitive data is cached with long TTLs? |
Staff insight: Some data should never be cached, or only with write-through invalidation. This is a data classification problem, not a TTL tuning problem.
Deep Dive 3: Cache Warming Gone Wrong
Staff Answer
| Phase | What to do |
|---|---|
| Immediate | Why was old cache decommissioned before warming completed? Process failure. |
| Root cause | Warming job didn't account for dataset growth. What was 10 min last month is 2 hours now. |
| System fix | Warming job with progress tracking, don't cutover until warming complete |
| Process fix | Warming is part of deployment checklist, not a background task |
| Capacity planning | Monitor warming time, alert if it grows significantly |
Staff insight: Cache warming is deployment-critical. It should block cutover, not run in parallel.
Deep Dive 4: Multi-Tier Cache Debugging
Staff Answer
| Dimension | Staff Answer |
|---|---|
| Hypothesis | L2 was invalidated on write, but L1 on some instances still has stale data |
| Investigation | Which instances have stale data? Are they missing invalidation events? |
| Root cause | L1 invalidation wasn't implemented — relied on TTL only |
| Fix | Either accept 10s staleness (document it) or implement L1 invalidation |
| Tradeoff | L1 invalidation adds complexity; short TTL may be acceptable |
Staff insight: Multi-tier caching multiplies consistency complexity. Aggregate staleness is L1 TTL + L2 TTL in the worst case.
Deep Dive 5: Cache Cost Optimization
Staff Answer
| Phase | What to do |
|---|---|
| Analysis | What's in the cache? Key cardinality, size distribution, hit rate by key pattern |
| Low-hanging fruit | Remove low-hit-rate keys, reduce TTL for rarely-accessed data |
| Architecture | Can we move cold data to cheaper storage? Tiered caching? |
| Compression | Compress large values, more efficient serialization |
| Eviction tuning | Are we caching data that's never reused? Adjust eviction policy |
Staff insight: Cache cost optimization starts with understanding what's being cached and why. Often 20% of keys drive 80% of value.
99. Level Expectations Summary
What gets you each level in a caching interview:
| Level | Minimum Bar | Key Signals |
|---|---|---|
| L5 (Senior) | Knows cache-aside pattern + Redis basics + understands TTL | Can implement a working cache layer |
| L6 (Staff) | Access pattern analysis + invalidation contracts + failure modes + ownership thinking | Designs a cache you can operate |
| L7 (Principal) | Data classification + org-wide patterns + consistency tiers + build-vs-buy reasoning | Designs a caching platform |
What Separates Each Level
| Transition | The Gap |
|---|---|
| L5 → L6 | From "add a cache" to "what's the staleness contract and who owns it" |
| L6 → L7 | From "my service's cache" to "the organization's caching strategy" |
Quick Self-Check
Before your interview, verify you can answer:
- What's the read/write ratio threshold where caching stops helping?
- What's your invalidation strategy, and what's the TTL fallback?
- How do you handle thundering herd on cache miss?
- What's the staleness budget, and who signed off on it?
- When would you NOT cache this data?
The Bar for This Question
Mid-level (L4/E4): You should be able to implement cache-aside with Redis, set reasonable TTLs, and explain the basic read path (check cache → miss → query DB → populate cache → return). You can describe cache hits and misses and why caching improves latency. Understanding cache eviction policies (LRU) or invalidation challenges would be a bonus but isn't expected.
Senior (L5/E5): You should quickly establish the caching pattern (cache-aside vs read-through vs write-through) based on the access pattern and spend time on the real problems: cache invalidation strategy (TTL vs event-driven), thundering herd on cache miss (locking, request coalescing), cache key design and its impact on hit rate, and the staleness contract — how stale is acceptable and who signed off on it. You should quantify: "We cache product catalog with a 5-minute TTL because the business accepts 5 minutes of stale pricing in exchange for 10x lower DB load." Having an opinion on consistent hashing for cache distribution would be strong.
Staff+ (L6/E6+): You should dispatch the baseline architecture in 5 minutes and spend 25+ minutes on operational depth: multi-tier caching (L1 in-process → L2 Redis → L3 CDN), cache warming strategies for cold starts and deploys, the organizational question of who owns the staleness contract (engineering proposes TTLs, product signs off on user-facing staleness), and failure mode analysis — what happens when Redis goes down (do you fall through to DB and crush it, or serve stale from a local cache?). You should reason about cache sizing economics (memory cost vs DB query cost), hot key detection and mitigation (dedicated cache nodes or key splitting), and how caching intersects with consistency requirements across services. The interviewer should see you treat caching as a data freshness contract, not just a performance optimization.
1010. Staff Insiders: Controversial Opinions
These are uncomfortable truths that distinguish Staff engineers from Seniors. They're based on operating caches at scale, not on textbook knowledge. Strong engineers disagree on some of these — that's the point.
Most Stale Data Incidents Are Never Detected
The uncomfortable truth: Your cache is probably serving stale data right now. You just don't know it.
Why it's invisible:
| Factor | Why It Hides Staleness |
|---|---|
| No correctness metrics | You measure hit rate, not accuracy |
| Intermittent symptoms | Users refresh and it "fixes itself" |
| Blame shifting | "The data was always like that" |
| TTL masks evidence | By the time you investigate, the stale entry expired |
The Staff position: If you can't measure staleness, you can't claim your cache is correct. Most teams measure cache performance but not cache correctness.
Bar-raiser question: "How would you know if your cache served incorrect data for the last hour?"
High Hit Rate Can Mask Correctness Bugs
The uncomfortable truth: 99% hit rate might mean you're serving confidently wrong answers 99% of the time.
Why this happens:
- Cache is fast, so nobody questions its answers
- Bugs in the origin path are never exercised
- Stale data looks like correct data if you don't check
Real-world example: A payment cache served 3-hour-old card data with 99.5% hit rate. Nobody noticed until a customer complained about being charged on a canceled card.
The Staff position: Hit rate is a performance metric, not a correctness metric. A cache with 99% hit rate that's 1% wrong can cause more damage than a cache with 80% hit rate that's always correct.
Removing a Cache Is Often the Right Fix
The uncomfortable truth: Many caching problems are best solved by removing the cache entirely.
Signs you should delete the cache:
- Hit rate below 50% (you're paying for misses)
- Write rate approaches read rate (invalidation churn)
- Consistency bugs that nobody can debug
- The origin can handle the load without the cache
- Multiple incidents traced to cache staleness
Why teams don't remove caches:
- "We already built it"
- "It must be helping somehow"
- Fear of origin load (often unfounded)
- Nobody owns the decision to remove
The Staff position: Adding a cache is easy. Removing one requires courage. The Staff engineer asks: "What if we just... didn't cache this?"
Cache Invalidation Is an Ownership Problem, Not a Technical One
The uncomfortable truth: "Cache invalidation is hard" is a cop-out. It's hard because nobody owns it.
Why invalidation fails:
| Failure Mode | Root Cause |
|---|---|
| Missed invalidation | Writer doesn't know about cache |
| Partial invalidation | Multiple caches, one forgot |
| Race conditions | Nobody designed for concurrency |
| Schema changes break it | Cache contract undocumented |
The Staff position: Invalidation is hard because it's a coordination problem across teams, not a technical problem within one service. The fix is ownership clarity, not better algorithms.
Bar-raiser question: "Who is responsible for invalidation correctness across all services that write this data?"
Caches Amplify Bugs — They Don't Just Hide Them
The uncomfortable truth: A bug that affects 1% of requests without a cache might affect 99% of requests with a cache.
How caches amplify:
Without cache:
- Bug writes bad data to DB
- 1% of reads hit the bug
- 99% read correct data from DB
With cache:
- Bug writes bad data to DB
- Bad data cached
- 99% of reads hit cache → 99% see bad data
The Staff position: Caches turn transient bugs into persistent outages. A single bad write + aggressive caching = widespread incorrect data served for TTL duration.
"TTL Was Too Long" Is Never the Root Cause
The uncomfortable truth: When stale data causes an incident, "reduce TTL" is the wrong fix.
Why TTL tuning fails:
- It treats symptoms, not causes
- Shorter TTL = more origin load
- The real question: why wasn't invalidation triggered?
The Staff position: Every "TTL was too long" incident is actually an invalidation ownership failure. The fix is better invalidation, not shorter TTL. Shorter TTL is a band-aid that increases cost.
Real root causes:
- Writer didn't know to invalidate
- Invalidation code had a bug
- Invalidation was async and lost
- Nobody owned the cache contract
Appendix A: Caching Patterns Deep Dive — Cache-aside, read-through, write-through, write-behind
A.1 Cache-Aside (Lazy Loading)
The most common pattern. Application manages cache explicitly.
Read path:
- Check cache
- On miss: read from DB, populate cache
- Return data
Write path:
- Write to DB
- Invalidate cache (delete key)
Pros:
- Cache failure = DB fallback (resilient)
- Only cache what's actually read (efficient)
- Simple mental model
Cons:
- Cache miss = two round trips (latency)
- Invalidation logic in application (scattered)
- Race condition window between write and invalidate
When to use: Most read-heavy workloads where you want cache failure to be non-fatal.
A.2 Read-Through
Cache handles fetching. Application always reads from cache; cache fetches from DB on miss.
Read path:
- Read from cache
- On miss: cache fetches from DB, stores, returns
Write path:
- Write to DB
- Invalidate cache OR let TTL expire
Pros:
- Simple application code (always read cache)
- Automatic population
Cons:
- Cache failure = read failure (coupled)
- Cache must understand DB schema
- Less control over fetch logic
When to use: When you want to centralize caching logic and can accept cache-as-dependency.
A.3 Write-Through
Cache handles persistence. Write to cache; cache synchronously writes to DB.
Write path:
- Write to cache
- Cache writes to DB (synchronous)
- Confirm to client
Read path:
- Always read from cache (always fresh)
Pros:
- Cache always consistent with DB
- Simple read path
- No invalidation needed
Cons:
- Write latency includes cache
- Cache failure = write failure
- Cache must understand DB schema
When to use: When consistency is critical and you can accept cache-as-dependency on writes.
A.4 Write-Behind (Write-Back)
Async persistence. Write to cache; cache asynchronously writes to DB.
Write path:
- Write to cache (immediate return)
- Cache queues DB write
- Background process persists to DB
Read path:
- Always read from cache
Pros:
- Lowest write latency
- Batching opportunities
- Absorbs write spikes
Cons:
- Data loss risk (cache crash before DB write)
- Consistency complexity
- Requires durable cache or careful failure handling
When to use: Write-heavy workloads where you can tolerate some data loss risk (analytics, logs, non-critical counters).
A.5 Pattern Comparison
| Pattern | Read Path | Write Path | Consistency | Failure Impact |
|---|---|---|---|---|
| Cache-Aside | App → Cache → (miss) → DB | App → DB → Invalidate Cache | Eventual | Cache down = DB fallback |
| Read-Through | App → Cache → (miss) → Cache fetches DB | App → DB → Invalidate Cache | Eventual | Cache down = reads fail |
| Write-Through | App → Cache | App → Cache → Cache writes DB | Strong | Cache down = writes fail |
| Write-Behind | App → Cache | App → Cache → async DB | Eventual | Cache down = data loss risk |
Appendix B: Eviction Strategies — LRU, LFU, TTL, and when each matters
B.1 Why Eviction Matters
Cache memory is finite. When full, something must go. The eviction policy determines what.
The wrong eviction policy:
- Evicts hot data → cache misses → origin load
- Keeps cold data → wasted memory → poor hit rate
B.2 Common Eviction Policies
| Policy | Evicts | Best For | Weakness |
|---|---|---|---|
| LRU | Least Recently Used | General workloads | One-time scans pollute cache |
| LFU | Least Frequently Used | Stable hot set | Slow to adapt to changing access patterns |
| TTL | Expired entries | Time-sensitive data | Doesn't handle memory pressure |
| Random | Random entry | Large caches, uniform access | Can evict hot data |
B.3 LRU (Least Recently Used)
How it works: Evict the entry that hasn't been accessed for the longest time.
Good for: Workloads with temporal locality (recently accessed = likely accessed again).
Bad for: Scan workloads — one-time reads push out hot data.
Access pattern: A B C D A B E F G H ...
LRU cache (size 4):
After A B C D: [D, C, B, A]
After A B: [B, A, D, C]
After E: [E, B, A, D] (C evicted)
After F: [F, E, B, A] (D evicted)
...
Hot keys A, B survive; scan keys E, F, G, H cycle through
B.4 LFU (Least Frequently Used)
How it works: Evict the entry with the lowest access count.
Good for: Stable hot sets where popular items stay popular.
Bad for: Changing popularity — old popular items block new popular items.
Variants:
- LFU with decay: Counts decay over time, allowing popularity shifts
- Window LFU: Only count recent accesses
B.5 TTL-Based Eviction
How it works: Entries expire after a fixed time, regardless of access.
Good for: Data with known staleness windows, session data.
Not a memory pressure solution: TTL doesn't help when cache is full but entries haven't expired.
Best practice: Combine TTL with LRU/LFU — TTL for freshness, LRU for capacity.
B.6 Choosing an Eviction Policy
| Workload | Recommended | Why |
|---|---|---|
| General web app | LRU | Good default, temporal locality |
| Stable hot set (popular products) | LFU | Protects frequently-accessed items |
| Session data | TTL | Natural expiration |
| Scan-heavy (batch reads) | LRU + scan resistance | Prevent scans from polluting |
Staff insight: Most production systems use LRU with TTL. LFU is rarely worth the complexity unless you have a proven stable hot set.
Appendix C: Thundering Herd Mitigations — Request coalescing, probabilistic expiry, circuit breakers
C.1 The Problem
When a popular cache entry expires or is invalidated:
- Many requests arrive simultaneously
- All miss the cache
- All query the origin
- Origin is overwhelmed
C.2 Request Coalescing (Singleflight)
How it works: First request triggers the fetch; concurrent requests wait for that result.
Implementation: Use a lock or promise per key. First request acquires lock and fetches; others wait on the promise.
Tradeoffs:
- ✅ Origin sees single request instead of N
- ❌ Waiting requests add latency
- ❌ If fetch fails, all waiters fail
C.3 Probabilistic Early Expiry
How it works: Entries expire slightly before their TTL, with randomness to spread refreshes.
Formula: should_refresh = (now > expiry - TTL * beta * log(random()))
Where beta controls how early refreshes can happen.
Effect: Instead of 1000 requests hitting exactly at TTL, refreshes spread over a window.
Tradeoffs:
- ✅ Spreads load over time
- ❌ Some entries refresh "too early" (wasted work)
- ❌ Requires tuning beta
C.4 Background Refresh (Stale-While-Revalidate)
How it works: Serve stale data immediately while refreshing in background.
Request arrives → Cache has stale entry
→ Return stale data immediately
→ Trigger background refresh
→ Next request gets fresh data
Tradeoffs:
- ✅ No latency spike on refresh
- ✅ Origin load is smoothed
- ❌ Guaranteed staleness during refresh
- ❌ Complexity (background job, stale tracking)
C.5 Cache-Miss Circuit Breaker
How it works: Limit concurrent requests to origin on cache miss.
Cache miss:
if (concurrent_origin_requests < limit):
fetch from origin
else:
fail fast (or serve stale if available)
Tradeoffs:
- ✅ Protects origin from overload
- ❌ Some requests fail or get stale data
- ❌ Requires tuning limit
C.6 Choosing a Mitigation
| Scenario | Recommended | Why |
|---|---|---|
| High-traffic keys | Request coalescing | Prevents duplicate origin fetches |
| Many keys expiring together | Probabilistic early expiry | Spreads refresh load |
| Latency-sensitive | Stale-while-revalidate | No refresh latency visible to user |
| Origin fragile | Circuit breaker | Hard limit on origin load |
Staff answer: "I'll use request coalescing as the primary defense, with probabilistic early expiry to prevent synchronized refreshes. Circuit breaker protects the origin if coalescing isn't enough."
Appendix D: Cache Consistency Patterns — Invalidation, versioning, read-your-writes
D.1 The Fundamental Problem
Cache and database are separate systems. Writes go to DB; reads may come from cache. Keeping them consistent is hard.
D.2 Invalidation Patterns
Delete on Write
How it works: After writing to DB, delete the cache key.
Write: UPDATE users SET email='new' WHERE id=123
Then: DELETE cache:user:123
Problem: Race condition.
t=0: Thread A reads user:123 from DB (old data)
t=1: Thread B writes user:123 to DB (new data)
t=2: Thread B deletes cache:user:123
t=3: Thread A writes old data to cache
Result: Cache has stale data until TTL
Mitigation: Version stamps or short TTL.
Update on Write (Write-Through)
How it works: After writing to DB, update the cache with new value.
Write: UPDATE users SET email='new' WHERE id=123
Then: SET cache:user:123 = {new data}
Problem: Same race condition, plus you're computing cache value in write path.
D.3 Version Stamps
How it works: Include a version number in the cache key or value.
Cache key: user:123:v7
On write: increment version → user:123:v8
Old cached data at v7 is naturally orphaned
Tradeoffs:
- ✅ No race conditions
- ❌ Key cardinality increases
- ❌ Need to track current version somewhere
D.4 Read-Your-Writes Consistency
Problem: User updates data but immediately sees old cached data.
Solution: After write, user's session bypasses cache for that key.
Write: User updates profile
Set: session.bypass_cache['user:123'] = now + 30s
Read: If bypass active, read from DB, not cache
Tradeoffs:
- ✅ User always sees their own writes
- ❌ Complexity in read path
- ❌ Per-session state required
D.5 Change Data Capture (CDC)
How it works: Subscribe to DB changelog; invalidate cache asynchronously.
DB write → Changelog (binlog, WAL)
→ CDC consumer
→ Cache invalidation
Tradeoffs:
- ✅ Decouples writers from cache knowledge
- ✅ Guaranteed to catch all writes
- ❌ Eventual consistency (lag between write and invalidation)
- ❌ Infrastructure complexity
When to use: Large-scale systems where write paths are diverse and can't all know about caching.
Appendix E: Metrics & Observability — Hit rate, latency, eviction monitoring
E.1 Core Metrics (Non-Negotiable)
cache_hit_total
cache_miss_total
cache_hit_rate (derived: hit / (hit + miss))
cache_latency_ms
origin_latency_ms
cache_eviction_total
cache_size_bytes
cache_key_count
E.2 Hit Rate Is Not Enough
High hit rate can hide problems:
- Hit rate 95% but 5% misses overwhelm origin
- Hit rate 99% but misses are the important requests
- Hit rate 90% but most hits are stale data
Better signals:
- Origin load during cache operation
- Miss latency vs hit latency ratio
- Staleness metrics (if trackable)
E.3 Metric Dimensions
Slice metrics by:
- Key pattern:
user:*,product:*,session:* - Operation:
get,set,delete - Result:
hit,miss,error
E.4 Alerting
| Alert | Threshold | Why |
|---|---|---|
| Hit rate drop | < 80% for 5 min | Origin may be overwhelmed |
| Cache latency spike | p99 > 50ms | Network or capacity issue |
| Eviction rate spike | > 1000/s | Memory pressure |
| Cache unavailable | > 30s | Failover or outage |
E.5 Dashboards
Operational dashboard:
- Hit rate (real-time)
- Origin load vs cache load
- Latency percentiles (p50, p95, p99)
- Error rate
Capacity dashboard:
- Memory usage vs capacity
- Key count and growth rate
- Eviction rate
- Connection count
Appendix F: Multi-Region Caching — Per-region, global, invalidation strategies
F.1 The Multi-Region Problem
Users in EU should hit EU cache; users in US should hit US cache. But what happens when data is updated?
F.2 Strategies
| Strategy | Consistency | Latency | Complexity |
|---|---|---|---|
| Per-region caches, no sync | Eventually consistent | Low | Low |
| Per-region with async invalidation | Eventually consistent | Low | Medium |
| Global cache (single region) | Strong | High (cross-region) | Low |
| Global cache (replicated) | Eventually consistent | Low reads, high writes | High |
F.3 Per-Region with Async Invalidation
How it works:
- Write happens in one region
- Invalidation message published to message bus
- Other regions consume and invalidate their caches
Staleness window: Cross-region propagation delay (typically 100-500ms).
F.4 When to Accept Per-Region Inconsistency
- User data: Users rarely switch regions mid-session
- Product catalog: Brief inconsistency rarely matters
- Sessions: Should be region-sticky anyway
F.5 When to Require Global Consistency
- Inventory/stock: Overselling is expensive
- Financial: Compliance requirements
- Global rate limiting: Abuse protection
Staff insight: Most user-facing data can tolerate per-region caching with async invalidation. Reserve global consistency for data where inconsistency has real cost.
Appendix G: Cache Sizing and Capacity Planning — Memory estimation, hit rate modeling
G.1 Basic Sizing
Formula:
Memory needed = working_set_size × (1 + overhead_factor)
Where:
working_set_size = num_keys × avg_value_size
overhead_factor ≈ 1.5-2x (for Redis data structures, fragmentation)
Example:
- 1M users, 2KB per user profile
- Working set: 1M × 2KB = 2GB
- With overhead: 2GB × 1.5 = 3GB
- Add headroom: 3GB × 1.2 = 3.6GB minimum
G.2 Hit Rate vs Cache Size
The Pareto insight: Often 20% of keys serve 80% of traffic. Caching the hot set gives most of the benefit.
Hit rate curve: Typically logarithmic — doubling cache size doesn't double hit rate.
Cache size: 10% → Hit rate: 70%
Cache size: 20% → Hit rate: 85%
Cache size: 50% → Hit rate: 95%
Cache size: 100% → Hit rate: 99%
Staff insight: Model your access pattern. If you have high skew (few hot keys), a small cache gives great hit rate. If access is uniform, you need to cache more.
G.3 Capacity Planning Questions
- What's the working set? All data that could be cached, or just hot data?
- What hit rate do we need? 80%? 95%? 99%?
- What's the cost of a miss? DB query latency, cost, capacity
- What's the cost of cache memory? $/GB for your cache tier
- What's the growth rate? How will working set grow?
G.4 Cost Optimization
| Technique | Savings | Tradeoff |
|---|---|---|
| Compression | 2-5x | CPU overhead |
| Shorter TTL | Less memory for cold data | More origin load |
| Tiered caching | Hot data in expensive cache, cold in cheap | Complexity |
| Efficient serialization | 10-50% | Developer effort |
These frameworks are referenced throughout this playbook and apply to many system design problems:
-
→Distributed State Coordination
- Cache invalidation coordination, multi-tier consistency, leader election for cache warming
- Applies to: caching, rate limiting, locks, sessions
-
- Cache failure handling, serving stale vs failing, circuit breakers
- Applies to: caching, rate limiting, dependency isolation
-
- Redis vs Memcached vs managed services, self-hosted vs cloud
- Applies to: caching, observability, databases, queues