StaffSignal
Foundation — Study Guide23 min read

Caching Fundamentals

Cache is consistency debt — every cached value is a promise that staleness is acceptable. Cache-aside, write-through, write-behind tradeoffs, thundering herd prevention, and how Facebook TAO, Netflix EVCache, and Discord's L1 cache handle billions of reads/second.

Why This Matters

Every system design interview touches caching. Not because it is a clever optimization, but because caching is where you make your first real tradeoff: speed vs. correctness. The moment you put data in a cache, you have introduced a second copy of the truth. That copy can go stale, go missing, or go wrong — and how you handle each of those scenarios tells an interviewer whether you think in systems or just in features.

Caching appears in almost every architecture diagram, but most candidates treat it as a black box labeled "Redis" sitting between the app and the database. Staff-level engineers understand caching as a consistency problem first and a performance problem second. They can articulate why a particular caching strategy fits a given workload, what breaks when the cache disappears, and how the business defines "acceptable staleness."

If you can design a caching layer that gracefully handles cold starts, invalidation races, and tail-latency amplification — and explain why you made those choices — you can handle the distributed state reasoning that Staff interviews demand.

The 60-Second Version

  • Cache = consistency debt. Every cached value is a promise that staleness is acceptable for this use case.
  • Three core strategies: cache-aside (app manages reads), write-through (sync to cache + DB), write-behind (async DB writes, durability risk). If you can't explain why you chose one over the other, you haven't made a design decision — you've made a default.
  • TTL is not a technical constant — it is business staleness tolerance expressed in seconds. If product cannot define acceptable staleness, you cannot set a TTL.
  • Invalidation is the hard problem. TTL-based is simple but stale. Event-based is fresh but operationally complex. In production, you use both: event-driven as the primary mechanism, TTL as the consistency backstop.
  • Thundering herd on cold start or mass expiry will turn a cache miss into a database incident. Singleflight / request coalescing is mandatory for any key above 100 RPS.
  • Cache penetration (non-existent keys bypassing cache) is a quiet cost that shows up as unexplained DB load, not as cache errors. Negative caching is the standard mitigation.
  • A cache with a 90% hit rate is not "almost there" — it may be sending 10x the expected load to your database. Segment your hit rate by key type, not in aggregate.

How Caching Works

The Basic Idea

A cache stores a copy of frequently accessed data in a faster storage layer so the application avoids repeatedly computing or fetching the same result from a slower source.

Without a cache, every request follows the same path:

  1. Application receives request
  2. Application queries the database
  3. Database reads from disk (or its own buffer pool), processes the query, returns the result
  4. Application responds to the user

The database query might take 5–50ms. Under high read throughput — thousands of requests per second for the same data — you are paying that cost repeatedly for identical results.

A cache short-circuits this path. The first request still goes to the database, but the result is stored in the cache. Subsequent requests for the same data hit the cache instead, returning in under 1ms.

The cost of this speedup is simple to state and hard to manage: the cached copy can diverge from the source of truth. Every caching decision you make is fundamentally about managing that divergence.

Cache Hit vs. Cache Miss

When the application looks up a key in the cache:

  • Cache hit: The key exists and the value is returned immediately. This is the fast path — sub-millisecond for Redis, microseconds for in-process caches.
  • Cache miss: The key does not exist. The application must fall back to the database, fetch the result, and (usually) populate the cache for future requests.

The hit rate — the percentage of requests served from cache — determines whether your cache is earning its keep. A 95% hit rate means 1 in 20 requests still hits the database. Whether that is acceptable depends entirely on the total request volume and the database's capacity.

Where Caches Live

Caches can sit at multiple levels of the stack, each with different latency and consistency characteristics:

Cache LayerLatencyScopeInvalidation Complexity
CPU cache (L1/L2/L3)1–30nsSingle core/socketManaged by hardware
In-process cache (HashMap, Guava)~100nsSingle application instanceRequires coordination across instances
Distributed cache (Redis, Memcached)0.2–1msShared across all instancesCentral but requires network hop
CDN/Edge cache (Cloudflare, CloudFront)1–50msGlobal edge networkPropagation delay across PoPs

In system design interviews, you will almost always be talking about distributed caches (Redis/Memcached) and occasionally CDN caches for static or semi-static content. In-process caches appear when you need to avoid the network hop for extremely hot keys.

Core Caching Strategies

There are three fundamental caching strategies. Each makes a different tradeoff between read performance, write complexity, and consistency. Every other caching pattern is a variation or combination of these three.

Cache-Aside (Lazy Loading)

The application manages the cache explicitly. On read, the application checks the cache first. On miss, it queries the database, writes the result to the cache, and returns.

read(key):
    value = cache.get(key)
    if value != null:
        return value              # cache hit

    value = db.query(key)         # cache miss → fetch from DB
    cache.set(key, value, ttl)    # populate cache for next time
    return value

write(key, value):
    db.write(key, value)          # write to DB only
    cache.delete(key)             # invalidate stale cache entry

Why this is the default strategy:

  • The cache only contains data that has actually been requested. No wasted memory on unused keys.
  • Read path is simple and fast. Cache misses are self-healing — the next read populates the cache.
  • The application controls exactly what gets cached, with what TTL, and under what conditions.

What can go wrong:

  • Cold start: A new deployment or cache restart means every request is a miss. If your application handles 10K RPS, that is 10K database queries per second until the cache warms up.
  • Race condition on write: If a read and a write happen concurrently, the cache can store stale data. The read fetches old data from the DB while the write updates the DB and deletes the cache entry — but the read's cache.set happens after the delete, putting the old value back.
  • Thundering herd: A popular key expires, and hundreds of concurrent requests all miss at the same time, all query the database for the same row. One request's work would have been enough; instead the database absorbs hundreds of redundant queries.

Write-Through

Every write goes to both the cache and the database synchronously. The cache is always up to date.

write(key, value):
    db.write(key, value)           # persist to DB first (durability)
    cache.set(key, value)          # then update cache (freshness)

read(key):
    return cache.get(key)          # cache is always current

When to use it:

  • The application cannot tolerate any staleness. Financial balances, inventory counts, session state.
  • Write volume is moderate. Every write pays the cost of two operations (cache + DB).

What can go wrong:

  • Write latency doubles. Every write now waits for both the cache and the database. Under high write volume, this becomes the bottleneck.
  • Cache fills with unread data. If you write 100K keys per hour but only 10K are ever read, 90% of your cache memory holds data nobody wants.
  • Partial failure. If the cache write succeeds but the DB write fails (or vice versa), you have an inconsistency. You need to handle this with retries or compensating writes.

Write-Behind (Write-Back)

Writes go to the cache immediately and are asynchronously flushed to the database in the background, typically in batches.

write(key, value):
    cache.set(key, value)            # write to cache immediately
    write_queue.enqueue(key, value)  # async flush to DB later

# Background worker:
flush():
    batch = write_queue.drain()
    db.batch_write(batch)

When to use it:

  • Write throughput is the bottleneck and the application can tolerate a brief window where data exists only in the cache.
  • Batching writes to the database provides significant throughput gains (e.g., gaming leaderboards, analytics counters).

What can go wrong:

  • Data loss on cache failure. If the cache dies before the write queue flushes to the database, those writes are gone. This is the fundamental risk of write-behind — you have accepted a durability window.
  • Ordering issues. If writes arrive out of order and the flush batching does not preserve order, the database may end up with an older value overwriting a newer one.

Cache Invalidation

Populating a cache is easy. Keeping it correct is the hard problem. There are three approaches to invalidation, each with a different reliability-complexity tradeoff.

TTL-Based Expiry

Every cached entry has a time-to-live. After the TTL expires, the entry is removed and the next request re-fetches from the database.

  • Simplicity: No coordination needed. Set the TTL and walk away.
  • Staleness: The cache serves stale data for the entire duration of the TTL window. A 5-minute TTL means up to 5 minutes of stale data after every write.
  • Mass expiry risk: If you batch-load 100K keys with the same TTL, they all expire at the same instant. The result is a thundering herd. Add jitter (random ±10% variation) to stagger expiry times.

TTL is not an invalidation strategy — it is a consistency safety net. Relying on TTL alone means you have chosen to accept the staleness window as a product decision.

Event-Driven Invalidation

The write path publishes an invalidation event (via CDC stream, message queue, or direct call) and the cache deletes or updates the affected entry.

  • Freshness: Near-real-time. The cache is stale only for the propagation latency of the event (milliseconds to low seconds).
  • Complexity: You now have a distributed system problem. Events can be delayed, duplicated, or lost. The invalidation pipeline itself needs monitoring and SLOs.
  • Coupling: Every service that writes data must publish events that the cache layer consumes. This is an organizational dependency, not just a technical one.

Use event-driven invalidation when the business does not tolerate the staleness window that TTL provides, and combine it with a TTL backstop so the cache self-heals if an event is lost.

Manual / Application-Level Invalidation

The application explicitly deletes or updates cache entries on the write path. This is the simplest form of event-driven invalidation.

update_user_profile(user_id, new_data):
    db.update("users", user_id, new_data)
    cache.delete(f"user:{user_id}")           # explicit invalidation

This works well for simple cases but breaks down when:

  • Multiple services write to the same data (which one is responsible for invalidation?)
  • The cached value is derived from multiple database rows (invalidating one row does not invalidate the aggregate)
  • The write path has no knowledge of the cache keys affected

Visual Guide

Rendering diagram...

Implementation Patterns

These are the patterns that separate production-grade caching from textbook caching. Each one addresses a specific failure mode that the basic strategies leave open.

Singleflight / Request Coalescing

When a popular key expires, hundreds of concurrent requests miss the cache simultaneously. Without protection, every one of them queries the database for the same row.

Singleflight collapses concurrent misses for the same key into a single database query. The first request acquires a lock, fetches from the database, and populates the cache. All other concurrent requests for the same key wait for the first to complete and share its result.

# Pseudocode for singleflight
inflight = ConcurrentMap<Key, Future<Value>>

read(key):
    value = cache.get(key)
    if value != null:
        return value

    # Check if another request is already fetching this key
    if key in inflight:
        return inflight[key].await()

    # This request wins — fetch from DB
    future = new Future()
    inflight[key] = future

    value = db.query(key)
    cache.set(key, value, ttl)
    future.resolve(value)
    inflight.remove(key)
    return value

Use this when: you have keys accessed at >100 RPS and cache misses trigger expensive database queries (>10ms).

Negative Caching

If the application frequently looks up keys that do not exist in the database (nonexistent user IDs, deleted records, typos), every lookup is a cache miss followed by a database query that returns nothing. The cache never stores the result because there is no result to store.

Negative caching stores the absence of a value. When a database lookup returns empty, cache a sentinel value with a short TTL (30–120 seconds).

read(key):
    result = cache.get(key)
    if result == NEGATIVE_SENTINEL:
        return null                    # known nonexistent
    if result != null:
        return result

    value = db.query(key)
    if value == null:
        cache.set(key, NEGATIVE_SENTINEL, short_ttl)  # 30-120s
        return null
    cache.set(key, value, normal_ttl)
    return value

Use this when: 10%+ of lookups are for nonexistent keys. Common in user-facing search, URL shorteners, and APIs exposed to external clients.

Multi-Tier Caching

When the network hop to Redis (0.2–1ms) is too costly for extremely hot keys — keys accessed thousands of times per second per instance — add an in-process L1 cache in front of the distributed L2 cache.

TierTechnologyLatencyScope
L1In-process (Caffeine, HashMap)~100nsSingle instance
L2Redis / Memcached0.2–1msShared across all instances
OriginDatabase5–50msSource of truth

L1 caches introduce a new consistency challenge: each application instance has its own copy. When the data changes, L2 is invalidated centrally but L1 copies across N instances may remain stale until their TTL expires. Keep L1 TTLs very short (5–30 seconds) and use this only for data where brief staleness is acceptable.

Cache Warming

Instead of waiting for traffic to populate the cache (lazy loading), pre-populate the cache before traffic arrives.

Use cache warming for:

  • New deployments where the cache is empty and traffic is immediate
  • New regions being brought online with no organic traffic to build up the cache
  • Predictable spikes (Black Friday, marketing campaigns) where you know which keys will be hot

Warming is typically implemented as a background job that reads the most-accessed keys from the database and pre-populates the cache during a maintenance window or before a traffic shift.

The Numbers in Context

Raw numbers are meaningless without context. Here is what each number means for your design decisions.

NumberValueWhat It Means for Your Design
Redis throughput~100K ops/s (single-threaded)One Redis instance handles most applications. If you need more, shard the key space — don't add complexity prematurely.
Memcached throughput~500K ops/s (multithreaded)Choose Memcached over Redis when you need pure key-value throughput and don't need Redis data structures.
Redis p99 latency<1ms in-regionNetwork hop to Redis is ~0.2ms. If your p99 is higher, check serialization costs and key sizes, not Redis itself.
Healthy hit rate95%+ aggregateBut segment by key type. A 95% aggregate rate can hide a 40% rate on long-tail keys that generate 70% of DB load.
Thundering herd threshold1K+ RPS on a single keyAt this request rate, a single key expiry triggers hundreds of redundant DB queries. Singleflight is mandatory.
TTL jitter±10% of base TTLWithout jitter, batch-loaded keys expire simultaneously. A 300s TTL becomes 270–330s with jitter.
Serialization overhead1–5ms for JSON on complex objectsOn hot paths (>1K RPS), JSON serialization can cost more than the cache lookup itself. Use protobuf or msgpack.
Negative cache TTL30–120sLong enough to absorb burst lookups for nonexistent keys. Short enough that real data is visible within 2 minutes of creation.

Hit Rate Segmentation

A 95% aggregate hit rate is a vanity metric. The real story is in the segments:

Key SegmentHit RateQuery ShareDB Load Share
Popular (top 1K)99.5%60%3%
Mid-tail (next 100K)92%30%28%
Long-tail (rest)40%10%69%

The long tail generates the majority of database load despite being a minority of traffic. If you optimize only for aggregate hit rate, you are ignoring the segment that actually stresses your database. Address the long tail with negative caching, shorter TTLs, and bloom filters for nonexistent-key filtering.

How This Shows Up in Interviews

Scenario 1: "Add caching to this read-heavy service"

The interviewer is testing whether you can justify which caching strategy and why. Do not say "add Redis." Say: "The read-to-write ratio is 100:1 and product tolerates 30-second staleness, so cache-aside with a 25-second TTL plus jitter is the right fit. I would add singleflight for the top-1K keys to prevent thundering herd on expiry."

Scenario 2: "The cache goes down — what happens?" (Full Walkthrough)

This tests whether your system degrades gracefully or collapses. Here's how a Staff engineer works through it:

Step 1 — Size the blast radius. "First, let me quantify the impact. If our cache serves 95% of reads at 10K RPS, a full cache failure means 10K RPS hitting the database. Our database is provisioned for 500 RPS of direct traffic. So this isn't 'reads are a bit slower' — this is a cascading failure that takes the database down within seconds."

Step 2 — Immediate protection: circuit breaker. "The application has a circuit breaker on the cache client. When cache error rates exceed 50% over a 10-second window, the circuit opens. In open state, we stop attempting cache reads entirely — no point adding timeout latency to every request."

Step 3 — Degraded mode, not failure mode. "With the circuit open, we enter degraded mode. The critical path (checkout, auth, account balance) routes directly to the database with aggressive connection pooling — this is ~5% of total read traffic, well within database capacity. The non-critical path (feed, recommendations, search suggestions) returns stale data from a local in-process fallback or a simplified response. Users see a slightly degraded experience, not an error page."

Step 4 — Rate limiting to protect the database. "For the non-critical reads that still need fresh data, we apply a request rate limiter: at most 200 RPS to the database for non-critical paths. Requests beyond this threshold get the degraded response. This ensures the database stays healthy for critical operations."

Step 5 — Recovery without a thundering herd. "When the cache comes back online, we don't flip all traffic back at once. We ramp the circuit breaker from open → half-open, allowing 10% of traffic through to warm the cache. As the hit rate climbs, we gradually increase to 50%, then 100%. This prevents a cold-cache thundering herd from immediately overwhelming the database again."

Why this is a Staff answer: It quantifies the impact before proposing a solution, distinguishes critical from non-critical traffic, and addresses the recovery path — not just the failure response.

Scenario 3: "Users are seeing stale data after updating their profile"

This tests invalidation and read-your-writes consistency. The answer involves routing the writing user to the primary for a brief window post-write (5 seconds), while other users continue reading from replicas that converge within the TTL window. Explain that this is a per-user consistency guarantee, not a global one.

Scenario 4: "Your Redis cluster is at 90% memory"

This tests operational maturity. Do not say "add more memory." First audit: What is the TTL distribution? Are there keys with TTLs >24 hours? Run MEMORY USAGE on the largest keys — often 5% of keys consume 50% of memory. Evaluate the eviction policy (allkeys-lru is self-managing; noeviction means writes fail). After optimization, state the remaining capacity need and whether horizontal scaling (more shards) or vertical scaling (bigger instances) is appropriate.

Advanced Patterns

PatternHow It WorksWhen to Use
SingleflightConcurrent misses for the same key collapse into one DB queryPopular keys under >100 RPS
Negative cachingCache the fact that a key doesn't exist (short TTL)10%+ lookups for non-existent keys
Read-throughCache itself fetches from DB on missUniform miss handling across all consumers
Cache warmingPre-populate before traffic arrivesDeployments, new regions, predictable spikes
Multi-tierL1 (in-process) → L2 (Redis) → DBWhen L2 network hop is too costly for hot keys
Stampede lockOn miss, one request acquires lock; others wait or get staleExpensive DB queries (>100ms) on popular keys

In the Wild

Abstract patterns are easier to internalize when you see how they were applied at scale. These are public, documented examples — not speculation.

Facebook TAO: The Graph Cache

Facebook's TAO (The Associations and Objects) cache handles billions of reads per second for the social graph — friend lists, posts, likes, comments. The architecture is a two-tier cache: per-region leader caches that handle writes and follower caches that handle reads. Read-through is the pattern — on miss, TAO fetches from MySQL and populates the cache. Write-through is used for mutations: every write goes to the leader cache, then synchronously to MySQL, then asynchronously invalidates follower caches across regions.

The Staff-level insight: TAO's per-object TTLs are set by access pattern, not by a global default. A celebrity's profile (read millions of times per second) has an aggressive TTL with event-driven invalidation. A dormant user's profile (read once a month) has a long TTL or no cache entry at all. This is hit-rate segmentation implemented at Facebook scale.

Netflix EVCache: Memcached at Planetary Scale

Netflix's EVCache serves 30+ million requests per second across AWS regions. It is a wrapper around Memcached with zone-aware replication: every write goes to Memcached instances in multiple availability zones. Reads prefer the local zone for latency but fall back to cross-zone replicas on miss.

The Staff-level insight: Netflix chose Memcached over Redis for EVCache because their workload is pure key-value — they don't need Redis data structures and Memcached's multithreaded architecture gives 3–5x more throughput per node. Technology selection is a tradeoff decision, not a default.

Discord: In-Process Caching for Hot Guilds

Discord found that their largest servers (guilds with millions of members) generated enough traffic to overwhelm their distributed cache. Their solution: an in-process L1 cache for hot guild data in front of their distributed cache. The L1 cache uses a 5-second TTL — brief enough that staleness is invisible in a chat context, long enough to absorb thousands of redundant reads per second for viral servers.

The Staff-level insight: This is multi-tier caching driven by observed hot-key behavior, not a premature optimization. They measured the problem (specific guilds saturating cache throughput) and applied L1 caching only where the data justified it.


Staff Calibration

The sections below are calibration tools for Staff-level interviews. If you already understand caching mechanics, start here to sharpen the framing that separates L5 from L6 answers.

What Staff Engineers Say (That Seniors Don't)

ConceptSenior ResponseStaff Response
Strategy choice"We use Redis as a cache""We use cache-aside here because eventual staleness up to 30s is acceptable and write volume doesn't justify write-through overhead"
TTL selection"Set TTL to 5 minutes""Product tolerates 60s stale for feed ranking; TTL is 45s with jitter to avoid synchronized expiry storms"
Invalidation"Invalidate on write""Event-driven invalidation via CDC stream, with TTL as a consistency backstop, not the primary mechanism"
Failure mode"Cache miss falls through to DB""Under cache failure we shed load with request coalescing and circuit-break to degraded responses rather than stampeding the database"
Hit rate"Our hit rate is 95%""Hit rate is 95% aggregate but 70% for long-tail queries — that tail drives 40% of DB read load, so we added negative caching with short TTL"

Common Interview Traps

  • Naming a strategy without justifying the tradeoff. "Cache-aside" is not an answer. Why cache-aside and not write-through? What staleness does the business accept?
  • Ignoring cache failure as a mode. Caches fail. If your design falls over when the cache is cold or unavailable, you have a single point of failure you called an "optimization."
  • Treating TTL as the invalidation strategy. TTL is a safety net. If TTL is your only invalidation mechanism, you have chosen to serve stale data for the duration of every TTL window and you should say so explicitly.
  • Overlooking cache penetration. Candidates almost always address thundering herd. They rarely address the steady-state cost of requests for keys that will never exist. Bloom filters or negative caching are the standard mitigations.
  • Using cache as a primary data store. Redis with no persistence is a speed layer. If it dies and data is gone, you had a SPOF you called "caching."
  • TTL jitter neglect. Same TTL on all batch keys = simultaneous expiry. Add ±10% random jitter.
  • Ignoring serialization cost. JSON serialization of complex objects can take 1–5ms. Consider protobuf or msgpack for hot paths.
  • Cache key collision. Using user:{id} for different endpoints caching different projections leads to wrong data. Include query shape in the key.

Practice Drill

Staff-Caliber Answer Shape
Expand
  1. Audit current usage. What's the TTL distribution? Are there keys with TTLs >24h that could be shortened? Stale data from deprecated features?
  2. Analyze key-size distribution. Run redis-cli --bigkeys. Often 5% of keys consume 50% of memory. Compress large values or move them to a dedicated store.
  3. Evaluate eviction policy. allkeys-lru self-manages. noeviction (common mistake) means writes fail when full.
  4. Consider tiering. Move cold keys to a cheaper store. Keep only the hot working set in Redis.
  5. Scale horizontally. Add Redis Cluster shards rather than resizing the instance.

The Staff move: Don't just ask for more memory. Show you've exhausted optimization first: "After TTL tuning and compression, we need 15% more capacity, not 30%."

Where This Appears

These playbooks apply caching foundations to complete system design problems with full Staff-level walkthroughs, evaluator-grade rubrics, and practice drills.

  • Distributed Caching — Multi-region cache topologies, CDC-based invalidation pipelines, cache-aside vs read-through at scale, and 6 Staff-level practice drills with answer shapes
  • CDN & Edge Caching — Edge TTL ownership, cache key design for personalized content, purge propagation, and the operational cost of stale-while-revalidate
  • Feed Generation — Fan-out-on-write vs fan-out-on-read caching, pre-computed feed materialization, and ranking freshness vs relevance tradeoffs
  • Search & Indexing — Query result caching, index warming strategies, and the consistency challenge of caching search results that change with every new document

Related Technologies: Redis · Elasticsearch

This is one of 9 foundation guides. The full library also includes deep-dive system design playbooks with evaluator-grade breakdowns, practice drills, and failure-mode analysis. Explore the full library