Scaling Reads — Cross-Cutting Pattern
The Problem
Every system eventually becomes read-heavy. The naive response is to throw caches at it, but every read optimization is a consistency debt. Staff engineers treat read scaling as a consistency budget allocation — deciding where, how long, and for whom stale data is acceptable — not as a performance trick.
Playbooks That Use This Pattern
- Distributed Caching — Cache-aside, write-through, invalidation
- CDN & Edge Caching — Edge layer for static and semi-static content
- Search & Indexing — Read-heavy query paths at scale
- Database Sharding — Partitioned read distribution
- Feed Generation — Pre-computed feed reads
The Core Tradeoff
| Strategy | What Works | What Breaks | Who Pays |
|---|---|---|---|
| Read replicas | Linear read throughput scaling | Replication lag becomes business staleness; lag spikes during write bursts | Consumers of stale reads — they see old data and make decisions on it |
| Cache-aside | Sub-millisecond reads for hot data | Invalidation is an organizational problem — who invalidates, when, and what happens if they don't | The team that owns the write path; they inherit cache coherence responsibility |
| CDN / Edge | Massive scale, zero origin load for cached content | Staleness ownership shifts to whoever sets the TTL; purge propagation is not instant | The team that sets TTL — they own every stale response served |
| Denormalization / CQRS | Purpose-built read models, zero join overhead | Write amplification — every write fans out to N read models; operational complexity doubles | The write path pays in latency and failure surface; the platform team pays in operational burden |
| Search index | Full-text, faceted, ranked queries the primary store cannot serve | Index lag means searches miss recent writes; reindexing at scale is a multi-hour operational event | Search infrastructure team owns the lag budget and reindex runbooks |
Staff Default Position
Start with read replicas and cache-aside. These are well-understood, operationally simple, and cover 80% of read scaling needs. Add CDN for static or semi-static content where staleness tolerance is measured in minutes. Reach for CQRS or search indexes only when the read model diverges structurally from the write model — not as a performance optimization.
When to Deviate
- Strong consistency required on reads — financial balances, inventory counts, access control checks. No caching layer; read from primary or use synchronous replication with read-your-writes guarantees.
- Write-heavy workload masquerading as read-heavy — if the bottleneck is actually write contention, adding read replicas solves nothing. Profile before scaling.
- Tiny dataset, high QPS — if the working set fits in a single node's memory, a local in-process cache beats a distributed cache. Don't add network hops for data that fits in RAM.
- Multi-region with strict ordering — CDN and replicas introduce cross-region lag. If users see their own writes inconsistently across regions, the read layer is the problem, not the solution.
Common Interview Mistakes
| What Candidates Say | What Interviewers Hear | What Staff Engineers Say |
|---|---|---|
| "We'll add a cache" | No invalidation strategy, no consistency thinking | "We'll use cache-aside with TTL-based expiry. Invalidation is owned by the write service via explicit deletes. Stale window is bounded to [N] seconds." |
| "Read replicas solve it" | No awareness of replication lag or its business impact | "Replicas handle the throughput. Replication lag is our staleness budget — we need product sign-off that [N]ms lag is acceptable for this read path." |
| "We'll put a CDN in front" | No TTL reasoning, no purge strategy | "CDN for assets and semi-static API responses. TTL of [N] minutes means users see stale data for up to [N] minutes after a write. Product owns that decision." |
| "CQRS separates reads and writes" | Doubled operational complexity with no justification | "CQRS is warranted here because the read model is structurally different — [specific shape]. The cost is write amplification and two data paths to monitor." |
| "We'll index everything in Elasticsearch" | No understanding of index lag or reindex operations | "Search index handles [specific query pattern] the primary store can't. Index lag of [N]s is acceptable for search but not for the detail page — that reads from primary." |
Quick Reference
Staff Sentence Templates
Implementation Deep Dive
Cache-Aside with Redis — The Staff Default
Cache-aside is the dominant read-scaling pattern because it is simple, explicit, and gives the application full control over what gets cached, for how long, and when it gets invalidated. Here is how to implement it properly.
Application-Level Pseudocode
function getUser(userId):
cacheKey = "user:" + userId
# Step 1 — Check cache
cached = redis.GET(cacheKey)
if cached != null:
metrics.increment("cache.hit", tags=["prefix:user"])
return deserialize(cached)
# Step 2 — Cache miss: fetch from DB
metrics.increment("cache.miss", tags=["prefix:user"])
user = db.query("SELECT * FROM users WHERE id = ?", userId)
if user == null:
# Cache negative result to prevent repeated DB lookups
redis.SET(cacheKey, "NULL", EX=60)
return null
# Step 3 — Populate cache
redis.SET(cacheKey, serialize(user), EX=3600)
return user
Why this order matters. The application reads from cache first, falls through to DB on miss, and writes back to cache after. The cache never writes to the DB — the write path is completely separate. This means a cache failure degrades to "all reads hit DB" rather than "writes are lost."
Redis Commands in Practice
| Command | Purpose | Example |
|---|---|---|
GET key | Cache lookup | GET user:12345 |
SET key value EX ttl | Populate cache with TTL | SET user:12345 '{...}' EX 3600 |
DEL key | Explicit invalidation on write | DEL user:12345 |
MGET key1 key2 ... | Batch cache lookup | MGET user:1 user:2 user:3 |
SETEX key ttl value | Atomic set + expire | SETEX user:12345 3600 '{...}' |
🎯 Staff Move: Use MGET for batch lookups instead of looping GET. Fetching 100 keys in one round-trip takes ~1ms; 100 sequential GETs take ~50ms on a typical network.
Connection Pooling Configuration
Redis is single-threaded, so connection count matters less than connection reuse:
| Setting | Recommended Value | Rationale |
|---|---|---|
| Pool size per app instance | 10–20 connections | Matches typical concurrent request count |
| Max idle time | 300s | Prevents stale connections from accumulating |
| Connection timeout | 250ms | Fail fast on network issues; don't queue behind slow connects |
| Command timeout | 100ms | Cache misses should degrade, not block |
| Max retries | 1 | One retry for transient errors; beyond that, fall through to DB |
For 20 app servers with pool size 15, you have 300 total connections to Redis. A single Redis 7 node handles ~10K connections comfortably, so this leaves headroom for growth.
Thundering Herd Protection — Singleflight Pattern
When a popular cache key expires, hundreds of requests simultaneously hit the DB for the same row. The singleflight (request coalescing) pattern ensures only one request fetches from DB while others wait for its result:
# In-process lock map (per app instance)
inflight = ConcurrentMap<String, Future>
function getUserCoalesced(userId):
cacheKey = "user:" + userId
cached = redis.GET(cacheKey)
if cached != null:
return deserialize(cached)
# Check if another request is already fetching this key
if inflight.containsKey(cacheKey):
return inflight.get(cacheKey).await() # Wait for in-flight request
# This request wins — it will fetch from DB
future = new Future()
inflight.put(cacheKey, future)
try:
user = db.query("SELECT * FROM users WHERE id = ?", userId)
redis.SET(cacheKey, serialize(user), EX=3600)
future.complete(user) # Broadcast result to all waiters
return user
finally:
inflight.remove(cacheKey)
🎯 Staff Move: Singleflight is per-process. In a 20-server fleet, up to 20 DB queries fire simultaneously on a cold key. If that is still too many, use a distributed lock (SET key:lock NX EX 5) to coalesce across the fleet — but the added complexity is rarely justified for cache misses.
Monitoring: What to Track
| Metric | Target | Alert Threshold | Why |
|---|---|---|---|
| Cache hit rate (per prefix) | >95% for hot paths | <90% sustained 5 min | Low hit rate means the cache is not covering the working set |
| GET latency p50 / p99 | 0.5ms / 2ms | p99 >10ms | Network or memory pressure on Redis node |
| Eviction rate | Near zero | >100/sec sustained | Working set exceeds maxmemory; increase memory or reduce TTLs |
| Connection pool utilization | <70% | >90% sustained | Pool exhaustion causes queuing; increase pool or reduce concurrency |
| Miss-to-fill latency (DB round trip) | <50ms | p99 >200ms | Slow fills mean slow user requests on every cache miss |
The numbers that matter. For 10K QPS reads, a 95% hit rate means 500 DB reads/sec. A 99% hit rate means 100. That is a 5x difference in DB load from a 4-percentage-point change. This is why hit rate is the single most important cache metric — small improvements have outsized impact on backend pressure.
🎯 Staff Move: Track hit rate per key prefix, not globally. A 97% aggregate hit rate can hide a 40% hit rate on your most expensive query pattern. Break out user:, product:, feed: separately.
Architecture Diagram
Read path (numbered): App server checks Redis (1). On miss, reads from a read replica (2). Writes result back to Redis (3). CDN handles static content before it reaches the application tier.
Write path (dashed): Writes go to the primary DB. The write service issues DEL on the cache key immediately after commit. Subsequent reads re-fill the cache from the replica.
🎯 Staff Move: Mention the three-tier cache hierarchy in interviews: in-process LRU (sub-microsecond, tiny capacity) → Redis (sub-millisecond, shared across fleet) → read replica (millisecond, full dataset). Each tier has a different staleness budget and failure mode.
Failure Scenarios
1. Thundering Herd on Cache Cold Start
Timeline: Redis cluster restarts after maintenance. All cache keys are gone. First requests all miss cache simultaneously. DB receives 50x normal read traffic in the first 30 seconds.
Blast radius: Database connection pool exhaustion → cascading timeouts → 500 errors for all users, not just those with cold keys.
Detection: Cache hit rate drops to 0%. DB active connections spike to pool max. Error rate exceeds threshold.
Recovery:
- Enable singleflight coalescing to limit concurrent DB fills per key
- Use cache warming: pre-populate top 1,000 hot keys from a scheduled job before shifting traffic
- Apply request rate limiting per user to cap the fill rate
- If DB is overwhelmed, enable a circuit breaker that returns stale data from a backup source
🎯 Staff Insight: A cold cache is not a performance problem — it is an availability problem. Warm the cache as part of the deployment runbook, not as an afterthought.
2. Cache Inconsistency After Failed Invalidation
Timeline: Write service updates a user's email in PostgreSQL, then calls DEL user:12345 on Redis. The Redis DEL fails (network timeout). Cache still holds the old email. User sees stale data for up to the remaining TTL (potentially hours).
Blast radius: One user sees stale data. If the failed invalidation affects a shared entity (product price, feature flag), all users see stale data.
Detection: Hard to detect proactively. Often found via user complaint. Can be detected by comparing cache values to DB values in a background audit job.
Recovery:
- Retry invalidation with exponential backoff (fire-and-forget is insufficient for critical keys)
- Use a CDC (change data capture) stream as a secondary invalidation channel — Debezium watches the binlog and issues Redis
DELfor every changed row - TTL provides a bounded staleness window — stale data eventually expires even if explicit invalidation fails
🎯 Staff Insight: Never rely on a single invalidation path. The write service DEL handles the happy path; CDC handles the failure path. TTL is the backstop.
3. Read Replica Lag Causing Stale Business Decisions
Timeline: Primary DB processes a payment that sets order.status = 'paid'. The read replica has 800ms replication lag. A read request hits the replica and sees order.status = 'pending'. The system triggers a duplicate payment reminder or blocks the user from proceeding.
Blast radius: Any business logic reading from replicas during the lag window makes decisions on stale state. Payment double-charges, inventory oversells, access control violations are all possible.
Detection: Monitor pg_stat_replication.replay_lag. Alert when lag exceeds your staleness budget (e.g., >500ms for order status reads).
Recovery:
- For critical reads (payment status, inventory counts, access control), always read from primary
- Implement read-your-writes consistency: after a write, the response includes a monotonic position marker; subsequent reads include this marker and are routed to primary if the replica has not caught up
- Accept replica lag for non-critical reads (profile display, feed content, analytics)
🎯 Staff Insight: Classify every read path as "critical" or "best-effort." Critical reads go to primary; best-effort reads go to replicas. This classification is a product decision, not an engineering decision.
Staff Interview Application
How to Introduce This Pattern
Open with the read-to-write ratio to anchor the conversation in data, then state the default approach and the consistency tradeoff. This tells the interviewer you have a repeatable mental model, not just a grab-bag of techniques.
When NOT to Use This Pattern
- Write-heavy workload: If the bottleneck is write throughput, read scaling techniques waste effort. Profile first — the constraint might be lock contention, not read volume.
- Strong consistency required on every read: Financial balances, inventory decrements, access control checks. Caching introduces staleness. Route these reads to the primary.
- Tiny dataset, high QPS: If the entire working set fits in application memory (<1 GB), an in-process LRU cache avoids the network round-trip to Redis entirely. Don't add distributed cache infrastructure for data that fits in RAM.
- Rapidly changing data: If every cache entry is invalidated within seconds of being set, the hit rate will be too low to justify the cache layer. The cache only helps when data lives long enough to be reused.
Follow-Up Questions to Anticipate
| Interviewer Asks | What They Are Testing | How to Respond |
|---|---|---|
| "What happens when Redis goes down?" | Graceful degradation thinking | "All traffic falls through to read replicas. Latency increases from ~1ms to ~10ms, but no data loss. We monitor cache availability and alert on connection failures." |
| "How do you handle cache invalidation?" | Consistency reasoning | "Explicit DEL on the write path, backed by a CDC stream as a secondary channel. TTL provides bounded staleness as a backstop." |
| "How do you size the cache?" | Capacity planning instincts | "Working set analysis: identify the top N keys by access frequency. For user profiles, the 80/20 rule usually means 20% of users generate 80% of reads. Size for that hot set." |
| "What if the cache hit rate is low?" | Debugging and analysis skills | "Segment hit rate by key prefix. Common causes: TTL too short, key cardinality too high (unique queries), or cold-start after deployment. Fix depends on the root cause." |
| "Why not use a write-through cache?" | Understanding of pattern tradeoffs | "Write-through couples the write path to cache availability — if Redis is slow, writes are slow. Cache-aside decouples them. I use write-through only when read-after-write consistency is critical." |
Capacity Planning Quick Reference
Sizing a Redis Cache-Aside Layer
| Parameter | Formula | Example (10K QPS, 95% hit rate) |
|---|---|---|
| Required cache memory | working_set_size × avg_value_size | 5M keys × 2KB = 10GB |
| Redis node count | memory / per_node_capacity (leave 30% headroom) | 10GB / 10GB usable = 2 nodes (primary + replica) |
| Connection pool (total) | app_servers × pool_size_per_server | 20 × 15 = 300 connections |
| DB read load (misses) | total_QPS × (1 - hit_rate) | 10K × 0.05 = 500 reads/sec |
| CDN offload (cacheable) | total_requests × cacheable_pct × hit_rate | 50K × 0.40 × 0.95 = 19K req/sec offloaded |
| Read replica count | miss_QPS / replica_throughput | 500 / 5,000 = 1 replica (add 1 for HA = 2) |
Latency Budget Breakdown
| Tier | p50 Latency | p99 Latency | When It Is Used |
|---|---|---|---|
| In-process LRU | <0.01ms | <0.05ms | Ultra-hot keys, config, feature flags |
| Redis cache hit | 0.3ms | 1ms | Majority of reads (95%+ of traffic) |
| Read replica (cache miss) | 2ms | 15ms | Long tail, cold keys |
| Primary DB (strong read) | 3ms | 20ms | Consistency-critical reads only |
| CDN edge hit | 1ms | 5ms | Static/semi-static content, geographically distributed |
Key Numbers Worth Memorizing
| Metric | Value | Why It Matters |
|---|---|---|
| Redis GET throughput (single node) | ~100K ops/sec | Ceiling for a single cache node; pipeline or cluster beyond this |
| Redis memory overhead per key | ~80-100 bytes | 10M keys with 0-byte values still costs ~1GB of overhead |
| PostgreSQL read replica lag (typical) | 1-10ms | The staleness floor for replica-based reads; spikes during write bursts |
| CDN cache invalidation propagation | 5-30 seconds | TTL-based expiry is faster and simpler than explicit invalidation for most use cases |
| Cache cold-start fill time (10M keys) | 5-15 minutes | Warm caches before traffic shift; never cut over cold |
| Network round-trip (same AZ) | 0.1-0.5ms | The baseline cost of any distributed cache lookup |
| MGET batch lookup (100 keys) | ~1ms | 100x faster than 100 sequential GETs (~50ms); always batch |
| Cache hit rate impact on DB | 95% → 500 DB reads/sec; 99% → 100 | 4-percentage-point improvement = 5x reduction in DB load |
| Typical working set ratio | 20% of keys serve 80% of reads | Size the cache for the hot set, not the full dataset |
Common Pitfalls Checklist
| Pitfall | Symptom | Fix |
|---|---|---|
| No negative caching | Cache miss storm on non-existent keys (e.g., user lookup by email typo) | Cache NULL results with short TTL (60s) |
| Global hit rate metric only | Aggregate 97% hides a 40% hit rate on the most expensive query | Track hit rate per key prefix and alert per-prefix |
| No cache warming on deploy | New deployment starts cold; DB overwhelmed for first 5 minutes | Pre-warm top 10K keys from a scheduled job before traffic shift |
| TTL too uniform | All keys expire at the same second; thundering herd on mass expiry | Add random jitter: TTL = base_ttl + random(0, base_ttl * 0.1) |
| Unbounded value size | A single 10MB cached value blocks Redis for 5ms, starving other operations | Enforce max value size (e.g., 512KB); compress or paginate larger values |