Scaling Reads — Cross-Cutting Pattern

The Problem

Every system eventually becomes read-heavy. The naive response is to throw caches at it, but every read optimization is a consistency debt. Staff engineers treat read scaling as a consistency budget allocation — deciding where, how long, and for whom stale data is acceptable — not as a performance trick.

Playbooks That Use This Pattern

Distributed Caching — Cache-aside, write-through, invalidation
CDN & Edge Caching — Edge layer for static and semi-static content
Search & Indexing — Read-heavy query paths at scale
Database Sharding — Partitioned read distribution
Feed Generation — Pre-computed feed reads

The Core Tradeoff

Strategy	What Works	What Breaks	Who Pays
Read replicas	Linear read throughput scaling	Replication lag becomes business staleness; lag spikes during write bursts	Consumers of stale reads — they see old data and make decisions on it
Cache-aside	Sub-millisecond reads for hot data	Invalidation is an organizational problem — who invalidates, when, and what happens if they don't	The team that owns the write path; they inherit cache coherence responsibility
CDN / Edge	Massive scale, zero origin load for cached content	Staleness ownership shifts to whoever sets the TTL; purge propagation is not instant	The team that sets TTL — they own every stale response served
Denormalization / CQRS	Purpose-built read models, zero join overhead	Write amplification — every write fans out to N read models; operational complexity doubles	The write path pays in latency and failure surface; the platform team pays in operational burden
Search index	Full-text, faceted, ranked queries the primary store cannot serve	Index lag means searches miss recent writes; reindexing at scale is a multi-hour operational event	Search infrastructure team owns the lag budget and reindex runbooks

Staff Default Position

Start with read replicas and cache-aside. These are well-understood, operationally simple, and cover 80% of read scaling needs. Add CDN for static or semi-static content where staleness tolerance is measured in minutes. Reach for CQRS or search indexes only when the read model diverges structurally from the write model — not as a performance optimization.

When to Deviate

Strong consistency required on reads — financial balances, inventory counts, access control checks. No caching layer; read from primary or use synchronous replication with read-your-writes guarantees.
Write-heavy workload masquerading as read-heavy — if the bottleneck is actually write contention, adding read replicas solves nothing. Profile before scaling.
Tiny dataset, high QPS — if the working set fits in a single node's memory, a local in-process cache beats a distributed cache. Don't add network hops for data that fits in RAM.
Multi-region with strict ordering — CDN and replicas introduce cross-region lag. If users see their own writes inconsistently across regions, the read layer is the problem, not the solution.

Common Interview Mistakes

What Candidates Say	What Interviewers Hear	What Staff Engineers Say
"We'll add a cache"	No invalidation strategy, no consistency thinking	"We'll use cache-aside with TTL-based expiry. Invalidation is owned by the write service via explicit deletes. Stale window is bounded to [N] seconds."
"Read replicas solve it"	No awareness of replication lag or its business impact	"Replicas handle the throughput. Replication lag is our staleness budget — we need product sign-off that [N]ms lag is acceptable for this read path."
"We'll put a CDN in front"	No TTL reasoning, no purge strategy	"CDN for assets and semi-static API responses. TTL of [N] minutes means users see stale data for up to [N] minutes after a write. Product owns that decision."
"CQRS separates reads and writes"	Doubled operational complexity with no justification	"CQRS is warranted here because the read model is structurally different — [specific shape]. The cost is write amplification and two data paths to monitor."
"We'll index everything in Elasticsearch"	No understanding of index lag or reindex operations	"Search index handles [specific query pattern] the primary store can't. Index lag of [N]s is acceptable for search but not for the detail page — that reads from primary."

Quick Reference

Rendering diagram...

Staff Sentence Templates

Implementation Deep Dive

Cache-Aside with Redis — The Staff Default

Cache-aside is the dominant read-scaling pattern because it is simple, explicit, and gives the application full control over what gets cached, for how long, and when it gets invalidated. Here is how to implement it properly.

Application-Level Pseudocode

function getUser(userId):
    cacheKey = "user:" + userId

    # Step 1 — Check cache
    cached = redis.GET(cacheKey)
    if cached != null:
        metrics.increment("cache.hit", tags=["prefix:user"])
        return deserialize(cached)

    # Step 2 — Cache miss: fetch from DB
    metrics.increment("cache.miss", tags=["prefix:user"])
    user = db.query("SELECT * FROM users WHERE id = ?", userId)
    if user == null:
        # Cache negative result to prevent repeated DB lookups
        redis.SET(cacheKey, "NULL", EX=60)
        return null

    # Step 3 — Populate cache
    redis.SET(cacheKey, serialize(user), EX=3600)
    return user

Why this order matters. The application reads from cache first, falls through to DB on miss, and writes back to cache after. The cache never writes to the DB — the write path is completely separate. This means a cache failure degrades to "all reads hit DB" rather than "writes are lost."

Redis Commands in Practice

Command	Purpose	Example
`GET key`	Cache lookup	`GET user:12345`
`SET key value EX ttl`	Populate cache with TTL	`SET user:12345 '{...}' EX 3600`
`DEL key`	Explicit invalidation on write	`DEL user:12345`
`MGET key1 key2 ...`	Batch cache lookup	`MGET user:1 user:2 user:3`
`SETEX key ttl value`	Atomic set + expire	`SETEX user:12345 3600 '{...}'`

🎯 Staff Move: Use MGET for batch lookups instead of looping GET. Fetching 100 keys in one round-trip takes ~1ms; 100 sequential GETs take ~50ms on a typical network.

Connection Pooling Configuration

Redis is single-threaded, so connection count matters less than connection reuse:

Setting	Recommended Value	Rationale
Pool size per app instance	10–20 connections	Matches typical concurrent request count
Max idle time	300s	Prevents stale connections from accumulating
Connection timeout	250ms	Fail fast on network issues; don't queue behind slow connects
Command timeout	100ms	Cache misses should degrade, not block
Max retries	1	One retry for transient errors; beyond that, fall through to DB

For 20 app servers with pool size 15, you have 300 total connections to Redis. A single Redis 7 node handles ~10K connections comfortably, so this leaves headroom for growth.

Thundering Herd Protection — Singleflight Pattern

When a popular cache key expires, hundreds of requests simultaneously hit the DB for the same row. The singleflight (request coalescing) pattern ensures only one request fetches from DB while others wait for its result:

# In-process lock map (per app instance)
inflight = ConcurrentMap<String, Future>

function getUserCoalesced(userId):
    cacheKey = "user:" + userId

    cached = redis.GET(cacheKey)
    if cached != null:
        return deserialize(cached)

    # Check if another request is already fetching this key
    if inflight.containsKey(cacheKey):
        return inflight.get(cacheKey).await()  # Wait for in-flight request

    # This request wins — it will fetch from DB
    future = new Future()
    inflight.put(cacheKey, future)

    try:
        user = db.query("SELECT * FROM users WHERE id = ?", userId)
        redis.SET(cacheKey, serialize(user), EX=3600)
        future.complete(user)           # Broadcast result to all waiters
        return user
    finally:
        inflight.remove(cacheKey)

🎯 Staff Move: Singleflight is per-process. In a 20-server fleet, up to 20 DB queries fire simultaneously on a cold key. If that is still too many, use a distributed lock (SET key:lock NX EX 5) to coalesce across the fleet — but the added complexity is rarely justified for cache misses.

Monitoring: What to Track

Metric	Target	Alert Threshold	Why
Cache hit rate (per prefix)	>95% for hot paths	<90% sustained 5 min	Low hit rate means the cache is not covering the working set
GET latency p50 / p99	0.5ms / 2ms	p99 >10ms	Network or memory pressure on Redis node
Eviction rate	Near zero	>100/sec sustained	Working set exceeds `maxmemory`; increase memory or reduce TTLs
Connection pool utilization	<70%	>90% sustained	Pool exhaustion causes queuing; increase pool or reduce concurrency
Miss-to-fill latency (DB round trip)	<50ms	p99 >200ms	Slow fills mean slow user requests on every cache miss

The numbers that matter. For 10K QPS reads, a 95% hit rate means 500 DB reads/sec. A 99% hit rate means 100. That is a 5x difference in DB load from a 4-percentage-point change. This is why hit rate is the single most important cache metric — small improvements have outsized impact on backend pressure.

🎯 Staff Move: Track hit rate per key prefix, not globally. A 97% aggregate hit rate can hide a 40% hit rate on your most expensive query pattern. Break out user:, product:, feed: separately.

Architecture Diagram

Rendering diagram...

Read path (numbered): App server checks Redis (1). On miss, reads from a read replica (2). Writes result back to Redis (3). CDN handles static content before it reaches the application tier.

Write path (dashed): Writes go to the primary DB. The write service issues DEL on the cache key immediately after commit. Subsequent reads re-fill the cache from the replica.

🎯 Staff Move: Mention the three-tier cache hierarchy in interviews: in-process LRU (sub-microsecond, tiny capacity) → Redis (sub-millisecond, shared across fleet) → read replica (millisecond, full dataset). Each tier has a different staleness budget and failure mode.

Failure Scenarios

1. Thundering Herd on Cache Cold Start

Timeline: Redis cluster restarts after maintenance. All cache keys are gone. First requests all miss cache simultaneously. DB receives 50x normal read traffic in the first 30 seconds.

Blast radius: Database connection pool exhaustion → cascading timeouts → 500 errors for all users, not just those with cold keys.

Detection: Cache hit rate drops to 0%. DB active connections spike to pool max. Error rate exceeds threshold.

Recovery:

Enable singleflight coalescing to limit concurrent DB fills per key
Use cache warming: pre-populate top 1,000 hot keys from a scheduled job before shifting traffic
Apply request rate limiting per user to cap the fill rate
If DB is overwhelmed, enable a circuit breaker that returns stale data from a backup source

🎯 Staff Insight: A cold cache is not a performance problem — it is an availability problem. Warm the cache as part of the deployment runbook, not as an afterthought.

2. Cache Inconsistency After Failed Invalidation

Timeline: Write service updates a user's email in PostgreSQL, then calls DEL user:12345 on Redis. The Redis DEL fails (network timeout). Cache still holds the old email. User sees stale data for up to the remaining TTL (potentially hours).

Blast radius: One user sees stale data. If the failed invalidation affects a shared entity (product price, feature flag), all users see stale data.

Detection: Hard to detect proactively. Often found via user complaint. Can be detected by comparing cache values to DB values in a background audit job.

Recovery:

Retry invalidation with exponential backoff (fire-and-forget is insufficient for critical keys)
Use a CDC (change data capture) stream as a secondary invalidation channel — Debezium watches the binlog and issues Redis DEL for every changed row
TTL provides a bounded staleness window — stale data eventually expires even if explicit invalidation fails

🎯 Staff Insight: Never rely on a single invalidation path. The write service DEL handles the happy path; CDC handles the failure path. TTL is the backstop.

3. Read Replica Lag Causing Stale Business Decisions

Timeline: Primary DB processes a payment that sets order.status = 'paid'. The read replica has 800ms replication lag. A read request hits the replica and sees order.status = 'pending'. The system triggers a duplicate payment reminder or blocks the user from proceeding.

Blast radius: Any business logic reading from replicas during the lag window makes decisions on stale state. Payment double-charges, inventory oversells, access control violations are all possible.

Detection: Monitor pg_stat_replication.replay_lag. Alert when lag exceeds your staleness budget (e.g., >500ms for order status reads).

Recovery:

For critical reads (payment status, inventory counts, access control), always read from primary
Implement read-your-writes consistency: after a write, the response includes a monotonic position marker; subsequent reads include this marker and are routed to primary if the replica has not caught up
Accept replica lag for non-critical reads (profile display, feed content, analytics)

🎯 Staff Insight: Classify every read path as "critical" or "best-effort." Critical reads go to primary; best-effort reads go to replicas. This classification is a product decision, not an engineering decision.

Staff Interview Application

How to Introduce This Pattern

Open with the read-to-write ratio to anchor the conversation in data, then state the default approach and the consistency tradeoff. This tells the interviewer you have a repeatable mental model, not just a grab-bag of techniques.

When NOT to Use This Pattern

Write-heavy workload: If the bottleneck is write throughput, read scaling techniques waste effort. Profile first — the constraint might be lock contention, not read volume.
Strong consistency required on every read: Financial balances, inventory decrements, access control checks. Caching introduces staleness. Route these reads to the primary.
Tiny dataset, high QPS: If the entire working set fits in application memory (<1 GB), an in-process LRU cache avoids the network round-trip to Redis entirely. Don't add distributed cache infrastructure for data that fits in RAM.
Rapidly changing data: If every cache entry is invalidated within seconds of being set, the hit rate will be too low to justify the cache layer. The cache only helps when data lives long enough to be reused.

Follow-Up Questions to Anticipate

Interviewer Asks	What They Are Testing	How to Respond
"What happens when Redis goes down?"	Graceful degradation thinking	"All traffic falls through to read replicas. Latency increases from ~1ms to ~10ms, but no data loss. We monitor cache availability and alert on connection failures."
"How do you handle cache invalidation?"	Consistency reasoning	"Explicit `DEL` on the write path, backed by a CDC stream as a secondary channel. TTL provides bounded staleness as a backstop."
"How do you size the cache?"	Capacity planning instincts	"Working set analysis: identify the top N keys by access frequency. For user profiles, the 80/20 rule usually means 20% of users generate 80% of reads. Size for that hot set."
"What if the cache hit rate is low?"	Debugging and analysis skills	"Segment hit rate by key prefix. Common causes: TTL too short, key cardinality too high (unique queries), or cold-start after deployment. Fix depends on the root cause."
"Why not use a write-through cache?"	Understanding of pattern tradeoffs	"Write-through couples the write path to cache availability — if Redis is slow, writes are slow. Cache-aside decouples them. I use write-through only when read-after-write consistency is critical."

Capacity Planning Quick Reference

Sizing a Redis Cache-Aside Layer

Parameter	Formula	Example (10K QPS, 95% hit rate)
Required cache memory	`working_set_size × avg_value_size`	5M keys × 2KB = 10GB
Redis node count	`memory / per_node_capacity` (leave 30% headroom)	10GB / 10GB usable = 2 nodes (primary + replica)
Connection pool (total)	`app_servers × pool_size_per_server`	20 × 15 = 300 connections
DB read load (misses)	`total_QPS × (1 - hit_rate)`	10K × 0.05 = 500 reads/sec
CDN offload (cacheable)	`total_requests × cacheable_pct × hit_rate`	50K × 0.40 × 0.95 = 19K req/sec offloaded
Read replica count	`miss_QPS / replica_throughput`	500 / 5,000 = 1 replica (add 1 for HA = 2)

Latency Budget Breakdown

Tier	p50 Latency	p99 Latency	When It Is Used
In-process LRU	<0.01ms	<0.05ms	Ultra-hot keys, config, feature flags
Redis cache hit	0.3ms	1ms	Majority of reads (95%+ of traffic)
Read replica (cache miss)	2ms	15ms	Long tail, cold keys
Primary DB (strong read)	3ms	20ms	Consistency-critical reads only
CDN edge hit	1ms	5ms	Static/semi-static content, geographically distributed

Key Numbers Worth Memorizing

Metric	Value	Why It Matters
Redis GET throughput (single node)	~100K ops/sec	Ceiling for a single cache node; pipeline or cluster beyond this
Redis memory overhead per key	~80-100 bytes	10M keys with 0-byte values still costs ~1GB of overhead
PostgreSQL read replica lag (typical)	1-10ms	The staleness floor for replica-based reads; spikes during write bursts
CDN cache invalidation propagation	5-30 seconds	TTL-based expiry is faster and simpler than explicit invalidation for most use cases
Cache cold-start fill time (10M keys)	5-15 minutes	Warm caches before traffic shift; never cut over cold
Network round-trip (same AZ)	0.1-0.5ms	The baseline cost of any distributed cache lookup
MGET batch lookup (100 keys)	~1ms	100x faster than 100 sequential GETs (~50ms); always batch
Cache hit rate impact on DB	95% → 500 DB reads/sec; 99% → 100	4-percentage-point improvement = 5x reduction in DB load
Typical working set ratio	20% of keys serve 80% of reads	Size the cache for the hot set, not the full dataset

Common Pitfalls Checklist

Pitfall	Symptom	Fix
No negative caching	Cache miss storm on non-existent keys (e.g., user lookup by email typo)	Cache `NULL` results with short TTL (60s)
Global hit rate metric only	Aggregate 97% hides a 40% hit rate on the most expensive query	Track hit rate per key prefix and alert per-prefix
No cache warming on deploy	New deployment starts cold; DB overwhelmed for first 5 minutes	Pre-warm top 10K keys from a scheduled job before traffic shift
TTL too uniform	All keys expire at the same second; thundering herd on mass expiry	Add random jitter: `TTL = base_ttl + random(0, base_ttl * 0.1)`
Unbounded value size	A single 10MB cached value blocks Redis for 5ms, starving other operations	Enforce max value size (e.g., 512KB); compress or paginate larger values