StaffSignal
Technology Guide

Redis

In-memory data store used as cache, message broker, and real-time leaderboard engine. The most referenced technology in system design interviews.

Design with Redis — Staff-Level Technology Guide

The 60-Second Pitch

Redis is an in-memory data structure store written in C. Single-threaded, sub-millisecond latency, 100K+ ops/sec baseline. In system design interviews, Redis is the Swiss Army knife: cache, counter, queue, pub/sub, leaderboard, geospatial index, distributed lock. Learn Redis deeply and you can address 60% of system design problems with one technology.

The Staff-level insight: Redis is not a database — it is an ephemeral data structure server. The moment you need durability guarantees, complex queries, or cross-key transactions, you have outgrown Redis for that use case. The Staff move is always to name Redis AND its complement: "Redis for the hot path, PostgreSQL as the source of truth."


Architecture & Internals

Single-Threaded Event Loop

Redis processes every command on a single thread. This is a feature, not a limitation. A single-threaded architecture eliminates lock contention, makes latency deterministic, and ensures every operation is serialized — no race conditions between concurrent writes, no mutex overhead, no deadlocks. When an interviewer asks "isn't single-threaded a bottleneck?", the Staff answer is: "Single-threaded is what gives Redis its sub-millisecond p99 guarantee. Lock-free data structures on a single core will beat mutex-heavy multithreaded designs for simple operations every time."

Redis uses I/O multiplexing via epoll (Linux) or kqueue (macOS/BSD) to handle thousands of concurrent connections on that single thread. The event loop reads commands from all connected clients, executes them sequentially, and writes responses — all without context switching. This is why Redis handles 100K+ operations per second on commodity hardware: the bottleneck is network I/O and serialization, not command execution.

The 100K ops/s baseline assumes small values (<1KB), simple commands (GET/SET/INCR), and reasonable pipeline depth. What degrades it: large values (serialization cost scales linearly with value size), CPU-bound Lua scripts that block the event loop, KEYS * in production (O(N) scan of the entire keyspace — never use this), and HGETALL on hashes with thousands of fields. The SLOWLOG command is your diagnostic tool — any command taking >10ms deserves investigation.

Redis 6.0 introduced I/O threading: network read/write is parallelized across multiple threads, but command execution remains single-threaded. This improves throughput by 2x for I/O-bound workloads (large values, many connections) without sacrificing the deterministic execution model. In interviews, mention this only if asked about Redis performance limits — it shows you know the current architecture, not just the textbook version.

Rendering diagram...

Memory Model

Redis stores everything in memory. This is not optional and it is not configurable — every key, every value, every data structure lives in RAM. The optional persistence mechanisms (RDB, AOF) provide recovery after restart, but do not reduce memory requirements during operation. Plan for the full working set to fit in RAM at all times.

Memory overhead per key is significant for small values. Every key in Redis carries approximately 70 bytes of overhead (dict entry, SDS string header, redisObject wrapper, expiry metadata if set). A key with a 10-byte string value costs ~80 bytes total. Redis compensates with encoding optimizations: small hashes (fewer than 128 fields, values under 64 bytes) use ziplist encoding at ~10 bytes per field instead of ~120 bytes per field in hashtable encoding. Understanding these thresholds matters for capacity planning — a hash with 127 fields and one with 129 fields may differ by 10x in memory.

Redis offers six eviction policies, and choosing the wrong one is a production incident waiting to happen. allkeys-lru evicts the least recently used key across the entire keyspace — correct for cache workloads. volatile-lru evicts only keys with a TTL set — correct when some keys are "permanent" and others are cache entries. allkeys-lfu (least frequently used) is better than LRU when access patterns have heavy tails. noeviction returns errors when memory is full — correct for primary data, catastrophic if accidentally set on a cache. Configure maxmemory to 75% of available RAM to leave headroom for fork operations (RDB/AOF rewrite) and connection buffers.

Persistence Options

RDB (Redis Database) snapshots use fork() and copy-on-write to create a point-in-time snapshot while the parent process continues serving requests. Recovery time = snapshot size / disk read speed (a 10GB RDB loads in ~30 seconds from SSD). Data loss = time since last snapshot. RDB is excellent for backups and disaster recovery, but a 15-minute snapshot interval means up to 15 minutes of data loss on crash.

AOF (Append Only File) logs every write operation. Three fsync modes control the durability/performance tradeoff: always (fsync after every command — safest, slowest, ~10x throughput penalty), everysec (fsync once per second — default, max 1 second data loss), and no (OS decides when to flush — fastest, data loss depends on OS buffer size, typically 30 seconds). AOF files grow over time; Redis periodically rewrites them in the background to compact the log.

Hybrid persistence (Redis 4.0+) combines both: RDB for the base snapshot, AOF for incremental changes since the last snapshot. This gives you fast recovery (load RDB, then replay short AOF) with minimal data loss. In interviews, state the tradeoff directly: "RDB for backups and fast cold start, AOF with everysec for production durability. If we need stronger guarantees than 1-second RPO, Redis is the wrong primary store — use PostgreSQL with WAL."

Replication & High Availability

Redis replication is asynchronous by default. The primary sends a stream of write commands to replicas, which apply them in order. Replication lag — the time between a write on the primary and its application on a replica — is your RPO (Recovery Point Objective). Under normal load, lag is 2-5ms. Under write bursts (bulk imports, batch updates), lag can spike to seconds. Every millisecond of replication lag is data you lose if the primary dies before the replica catches up.

Redis Sentinel provides monitoring, notification, and automatic failover. Sentinel is a separate process (not part of Redis itself) that watches your primary and replicas. You need at least 3 Sentinel instances for quorum — 2 Sentinels agreeing that the primary is unreachable triggers failover. Failover time is typically 5-30 seconds: detection timeout (default 30s, tunable via down-after-milliseconds) + election + promotion + client reconfiguration. During this window, writes fail. Your application must handle this gracefully — retry with backoff, queue writes in-process, or serve degraded responses.

Redis Cluster provides horizontal scaling via hash slots. The keyspace is divided into 16,384 slots, and each primary node owns a range of slots. This is not consistent hashing — it is discrete slot assignment, and the client must know the slot map. When a client sends a command to the wrong node, it receives a MOVED redirect. Smart clients (Jedis, ioredis) cache the slot map and route directly, falling back to redirects after topology changes. Resharding (moving slots between nodes) is live but generates network traffic and latency spikes proportional to the data being moved.

The critical Cluster constraint for interviews: all keys in a multi-key command (MGET, MSET, SUNION, SINTER, pipeline with MULTI) must hash to the same slot. Redis enforces this — cross-slot multi-key commands return an error. The workaround is hash tags: keys {user:123}:profile and {user:123}:sessions both hash on user:123, landing on the same slot. This is the most common Redis Cluster gotcha in interviews.

Split-brain scenarios occur when a network partition isolates the primary from Sentinel but clients can still reach it. The old primary accepts writes, Sentinel promotes a replica, and when the partition heals the old primary's writes are lost. Mitigation: min-replicas-to-write 1 + min-replicas-max-lag 10 — the primary refuses writes if no replica has acknowledged within 10 seconds. This trades availability for consistency during partitions.

Rendering diagram...
Rendering diagram...

Core Data Structures for Interviews

Strings & Counters

The simplest Redis data type and the foundation of most interview answers. SET key value, GET key, INCR key, DECR key, INCRBY key amount. All atomic, all O(1). SET key value NX EX 30 is the atomic test-and-set with expiry — the primitive for distributed locks (advisory only; use Redlock for stronger guarantees, ZooKeeper if correctness is critical).

Atomic INCR is the backbone of distributed counters. For rate limiting, the pattern is: INCR api_key:{key}:{window} + EXPIRE in a Lua script. The counter is atomic, the expiry is atomic, and the combination in Lua is atomic. This gives you a sliding window rate limiter in 5 lines of server-side code.

SETNX (SET if Not eXists) returns 1 if the key was set, 0 if it already existed. Combined with EX (expiry), this is the simplest distributed lock: acquire = SET lock:resource NX EX 30, release = DEL lock:resource (but only if you still hold it — check value first). This lock is advisory: clock skew, GC pauses, and network delays can cause double-grants. For financial or inventory operations, use ZooKeeper/etcd with fencing tokens.

Hashes

HSET user:1 name "Alice" age 30, HGET user:1 name, HMGET user:1 name age, HGETALL user:1. Hashes are the idiomatic way to store objects in Redis. Memory-efficient for small objects: hashes with fewer than 128 fields and values under 64 bytes use ziplist encoding, costing ~10 bytes per field versus ~120 bytes in hashtable encoding.

Use cases in interviews: user session storage (session ID → hash of session data), object caching (entity ID → hash of entity fields), configuration storage. HGETALL is O(N) where N is the number of fields — fine for hashes with 10-50 fields, dangerous for objects with 1,000+ fields where it blocks the event loop for milliseconds. Use HSCAN for large hashes.

Sets & Sorted Sets

Sets: SADD, SMEMBERS, SINTER, SUNION, SDIFF, SRANDMEMBER. Unordered, unique members. O(1) add/remove/check, O(N) enumeration. Use cases: tag membership, user interests, mutual friends (SINTER user:1:friends user:2:friends), random sampling (SRANDMEMBER).

Sorted Sets are Redis's most powerful data structure for interviews. ZADD leaderboard 100 alice, ZINCRBY leaderboard 5 alice, ZREVRANGE leaderboard 0 9 (top 10), ZRANK leaderboard alice (position), ZRANGEBYSCORE leaderboard 90 100 (range query). Every operation is O(log N) — for 10 million members, that is ~23 comparisons per operation. Combined with ZRANGEBYLEX for lexicographic pagination, sorted sets give you a distributed priority queue, leaderboard, timeline, and scheduled job queue — all in one data structure.

For the Leaderboard playbook: ZADD for score updates, ZREVRANGE for top-K, ZRANK for a user's global position — all O(log N), all atomic. No other in-memory store matches this combination of operations at this complexity.

For the Feed Generation playbook: sorted set keyed by timestamp for timeline fanout. ZADD user:1:feed {timestamp} {post_id} for fan-out-on-write, ZREVRANGE user:1:feed 0 19 for the latest 20 posts.

Lists & Streams

Lists: LPUSH, RPUSH, LPOP, RPOP, BRPOP (blocking pop), LRANGE. Doubly-linked list — O(1) push/pop at either end, O(N) access by index. Use cases: simple FIFO queues (LPUSH + BRPOP), recent items lists, activity logs. No consumer groups, no acknowledgment, no replay — if the consumer crashes after RPOP, the message is gone.

Streams (Redis 5.0+): XADD stream * field value, XREAD COUNT 10 STREAMS stream 0, XREADGROUP GROUP mygroup consumer1 COUNT 10 STREAMS stream >, XACK stream mygroup id, XPENDING stream mygroup. Streams are Kafka-lite: append-only log with consumer groups, message acknowledgment, pending entry tracking, and replay from any position. Each entry has an auto-generated time-based ID (1614886345000-0), making streams naturally ordered by insertion time.

For the Message Queue playbook: Redis Streams for lightweight queuing when Kafka is overkill. Streams support consumer groups and acknowledgment, but lack Kafka's durability guarantees (replicated partitions, configurable replication factor), partition-level parallelism, and multi-day retention. Use Streams for <10K messages/sec where simplicity matters more than durability.

Geospatial

GEOADD locations 13.361389 38.115556 "Palermo", GEOSEARCH locations FROMLONLAT 15 37 BYRADIUS 100 km ASC COUNT 10, GEODIST locations "Palermo" "Catania" km. Under the hood, Redis encodes coordinates as geohashes and stores them in a sorted set, making GEOSEARCH effectively O(N+log(M)) where N is the number of results and M is the total set size.

For the Proximity Matching (Tinder) playbook and Logistics (Uber) playbook: GEOSEARCH FROMLONLAT for radius queries, combine with sorted set scores for distance-based ranking. Limitations: 2D only (no elevation), no polygon/boundary search, no complex spatial queries. For polygon geofencing or spatial joins, use PostGIS.

Pub/Sub

SUBSCRIBE channel, PUBLISH channel message, PSUBSCRIBE pattern:*. Fire and forget: no persistence, no acknowledgment, no replay, no message ordering guarantees across channels. Messages fan out to all current subscribers on the channel. If no subscriber is connected, the message is silently dropped.

Use cases: real-time UI updates, presence notifications ("user came online"), cache invalidation broadcasts, inter-service signaling where loss is acceptable. For the Chat Messaging playbook: Pub/Sub for presence updates only — message delivery goes through a durable store.

Bloom Filters & HyperLogLog

Bloom Filters (BF.ADD, BF.EXISTS): probabilistic set membership testing. Returns "definitely not in the set" or "probably in the set" (allows false positives at configurable rate, typically 1%). O(1) for both add and check, memory usage is fixed regardless of set size. Use case: URL Shortener playbook — check if a generated short code already exists without hitting the database.

HyperLogLog (PFADD, PFCOUNT, PFMERGE): cardinality estimation (count distinct elements) with ±0.81% error. Uses 12KB of memory regardless of set size — counting 1 billion unique items costs the same 12KB as counting 100. Use case: unique visitor counting, unique event counting, any "count distinct" requirement where ±1% error is acceptable.

Rendering diagram...

Scaling Redis

Cluster Mode

Redis Cluster distributes data across N primary nodes using 16,384 hash slots. A key's slot is CRC16(key) % 16384. Smart clients cache the slot-to-node mapping and route commands directly; if the mapping is stale, the server responds with MOVED slot ip:port and the client updates its map.

Hash tags control slot assignment: {user:123}:profile and {user:123}:sessions both hash on user:123, guaranteeing co-location on the same node. This is mandatory for multi-key operations — MGET {user:123}:profile {user:123}:sessions works; MGET user:123:profile user:456:profile returns a CROSSSLOT error.

Resharding (moving slots between nodes) uses the MIGRATE command under the hood. It is live — the cluster continues serving traffic — but migrating a slot with 1M keys generates significant network traffic and latency spikes on the source node. Plan resharding during low-traffic windows. A 6-node cluster (3 primaries + 3 replicas) is the minimum production configuration; 9 nodes (3+3 with an extra replica per primary) provides better fault tolerance.

Pipelining

Pipelining batches N commands into a single network round trip. Instead of send-wait-receive for each command (100 round trips × 0.5ms RTT = 50ms), pipeline sends all 100 commands at once and reads all 100 responses (1 round trip × 0.5ms = 0.5ms + execution time). Throughput improvement is typically 5-10x for batch operations.

Pipelining has no atomicity guarantee — commands execute sequentially but other clients' commands may interleave. For atomic batch operations, use MULTI/EXEC (transactions) or Lua scripts. Pipelining and MULTI/EXEC can be combined: pipeline a transaction for both atomicity and reduced round trips.

For the Rate Limiter playbook: pipeline 100 counter checks in a single round trip instead of 100 individual INCR commands. For bulk data loading: redis-cli --pipe uses the Redis protocol's pipelining mode to load millions of keys at wire speed.

Lua Scripting

EVAL "script" numkeys key1 key2 arg1 arg2 or EVALSHA sha1 numkeys ... (preloaded scripts). The entire Lua script executes as a single command — atomically, without interleaving. This is the mechanism for operations that need atomicity across multiple keys or multiple commands: the sliding window rate limiter (INCR + EXPIRE + threshold check), the compare-and-swap pattern, the conditional update.

The trap: Lua scripts block the event loop for their entire execution. A script that runs for 100ms blocks ALL other clients for 100ms. The lua-time-limit configuration (default 5 seconds) triggers a BUSY error for new commands after the limit, but does not kill the running script. In practice, keep Lua scripts under 10ms — if your logic is more complex, move it to application code and accept the atomicity tradeoff.


Failure Modes

Memory Exhaustion

Symptom: OOM command not allowed when used memory > maxmemory errors in application logs, or (worse) silent eviction of keys with allkeys-lru causing cache misses that cascade to the database.

Detection: redis_memory_used_bytes approaching maxmemory. Alert at 80% utilization. Monitor evicted_keys counter — any non-zero value in a non-cache workload is a data loss event.

Business Impact: With noeviction, writes fail — users cannot update profiles, place orders, or send messages. With allkeys-lru, arbitrary keys disappear — a session key eviction logs out a user mid-checkout; a rate limiter key eviction lets a burst through.

Staff Response: L5 says "increase memory." L6 says: "First, audit key-size distribution with redis-cli --bigkeys and MEMORY USAGE key. Often 5% of keys consume 50% of memory. Compress large values, shorten TTLs on cold data, move large blobs to dedicated storage. Then right-size the instance based on projected growth with 1.5x headroom for fork operations."

Replication Lag

Symptom: Read-after-write inconsistency — user updates their profile, immediately reloads, sees old data (read hit the replica before replication caught up).

Detection: master_repl_offset - slave_repl_offset on replicas. Alert when lag exceeds 1 second. Monitor repl_backlog_active — if 0, full resync is needed (expensive).

Business Impact: Stale reads. For cache use cases, usually acceptable. For session stores or rate limiters, stale reads mean incorrect authorization or rate limit bypass.

Staff Response: L5 says "replication is async, reads might be stale." L6 says: "Read-after-write consistency for the writing user: route reads back to the primary for N seconds after a write using a read-your-writes proxy or application-level routing. For rate limiting, always read from the primary — stale counters on replicas mean rate limits are ineffective."

Hot Key

Symptom: One Redis node at 100% CPU while others are idle. p99 latency spikes on a subset of operations. Application-level metrics show one key accessed 100x more than the next most popular.

Detection: redis-cli --hotkeys (Redis 4.0+ with LFU policy), OBJECT FREQ key, or application-level key access counters. Monitor per-node CPU — asymmetry indicates hot key/slot.

Business Impact: Throughput ceiling — one key can saturate one Redis node regardless of cluster size. A viral post's counter, a global config key, or a popular user's session can become a bottleneck.

Staff Response: L5 says "add more replicas." L6 says: "Replicas help for read-heavy hot keys (route reads to replicas with READONLY). For write-heavy hot keys, shard the key: counter:popular:{0..7}, hash the writer to a shard, sum all shards on read. For configuration-type hot keys, use client-side caching with server-assisted invalidation (Redis 6.0 tracking)."

Cluster Partition / Split Brain

Symptom: Two primaries accepting writes for the same slot range. After partition heals, one set of writes is discarded.

Detection: Sentinel +sdown and +odown events. Monitor cluster_state (should be ok). Alert on cluster_slots_fail > 0.

Business Impact: Data loss — writes accepted by the old primary during partition are silently dropped when it demotes to replica and resyncs.

Staff Response: L5 says "Redis has Sentinel for failover." L6 says: "Configure min-replicas-to-write 1 and min-replicas-max-lag 10 on the primary. This makes the primary refuse writes if no replica has acknowledged within 10 seconds — trading availability for consistency during partitions. For critical data, complement Redis with a durable store that serves as source of truth."

Thundering Herd on Cache Miss

Symptom: A popular cached key expires. Hundreds of concurrent requests miss the cache simultaneously and all query the database. Database latency spikes, some queries timeout, cache never gets repopulated because the underlying query is too slow under load.

Detection: Cache hit rate drops sharply. Database query latency spikes. Connection pool exhaustion on the database. keyspace_misses counter spikes.

Business Impact: Cascading failure — cache miss → database overload → slower responses → more timeouts → more retries → database collapse.

Staff Response: L5 says "set longer TTLs." L6 says: "Three-layer defense: (1) Singleflight / request coalescing — only one request fetches from DB, others wait for the result. (2) Stale-while-revalidate — serve the expired value while one request refreshes in the background. (3) TTL jitter — add ±10% random variation to prevent synchronized expiry across batch-loaded keys."

Rendering diagram...

When to Use vs. Alternatives

Use CaseRedisAlternativeChoose Redis WhenChoose Alternative When
CacheMemcachedNeed data structures beyond k/v, persistence optionNeed multi-threaded simplicity, no persistence, larger values
CounterPostgreSQL>10K increments/sec, sub-ms latency required<1K/sec and already have PostgreSQL
Queue⚠️KafkaSimple queue, <10K msg/s, no replay neededNeed durability, replay, ordering guarantees, >100K msg/s
Pub/Sub⚠️Kafka, SNS+SQSEphemeral notifications, presenceNeed delivery guarantee, offline delivery
LeaderboardPostgreSQL + indexReal-time ranking, >1K updates/secComplex ranking logic, joins with user data
Session StoreDatabaseHigh throughput, TTL neededNeed durability across full restarts
Distributed Lock⚠️ZooKeeper, etcdAdvisory lock, can tolerate rare double-grantCorrectness-critical (financial, inventory)
GeospatialPostGISSimple radius queries, <1M itemsComplex spatial queries, polygons, geofencing
Full-text SearchElasticsearchNever use Redis for full-text searchAlways prefer Elasticsearch or equivalent
Relational DataPostgreSQLNever use Redis for relational dataAlways prefer a relational database

Deployment Topologies

Rendering diagram...

Topology selection rule: Start with Sentinel HA. Move to Cluster only when you hit the ceiling of a single node's memory (~25GB usable after persistence overhead) or throughput (~100K ops/s sustained). Most production Redis deployments at healthy companies never need Cluster.


Staff-Level Operational Concerns

Monitoring essentials: redis_memory_used_bytes (alert at 80% of maxmemory), redis_connected_clients (alert on sudden spikes — connection leak), keyspace_hits / (keyspace_hits + keyspace_misses) for hit rate (alert below 90% for cache workloads), master_repl_offset - slave_repl_offset for replication lag (alert above 1s), redis_commands_processed_total for throughput trends. Use INFO ALL for the complete metrics dump; SLOWLOG GET 10 for recent slow commands.

Capacity planning formula: (average_key_overhead + average_value_size) × expected_key_count × 1.5 safety_margin. The 1.5x multiplier covers: fork() copy-on-write during persistence (up to 2x under heavy writes), connection buffers (~1KB per connection), replication backlog (default 1MB, increase for high-write workloads), and Lua script memory. Use MEMORY DOCTOR for automated recommendations.

Backup strategy: RDB snapshots to S3 every 4-6 hours for disaster recovery. AOF with everysec for minimal data loss in production. Test restore from RDB monthly — a backup you have never restored is not a backup. Monitor rdb_last_save_time and alert if the last successful save is older than 2× the expected interval.

The "big key" problem: Keys with values >1MB cause latency spikes on serialization, replication, and eviction. A single 50MB hash blocks the event loop for tens of milliseconds during serialization. Detection: redis-cli --bigkeys or MEMORY USAGE key SAMPLES 0. Mitigation: break large values into chunks, move blobs to S3/object storage, compress before storing. Enforce a key-size policy in code review — big keys are always a design smell.


Interview Application

Which Playbooks Use Redis

PlaybookHow Redis Is UsedKey Operation
Rate LimiterSliding window countersINCR + EXPIRE in Lua
Distributed CachePrimary caching layerSET + GET with TTL
LeaderboardReal-time rankingZADD + ZREVRANGE + ZRANK
Chat MessagingPresence, recent messagesPub/Sub + sorted sets
Proximity MatchingNearby user searchGEOADD + GEOSEARCH
Feed GenerationTimeline cacheSorted sets by timestamp
Flash SalesInventory counterDECR with floor check in Lua
URL ShortenerCode existence check, redirect cacheBloom filter + SET with TTL

How to Introduce Redis in an Interview

Use this template: "I'd use Redis here for [specific capability] because [specific reason]. The tradeoff is [durability/consistency concern] which is acceptable because [business justification]."

Example: "I'd use Redis sorted sets for the leaderboard because we need O(log N) ranking at 50K updates per second. The tradeoff is that Redis data is in-memory with async replication — a node failure loses the most recent 2-5ms of score updates. That's acceptable because scores can be reconstructed from the event log within seconds."

What L5 Says vs. What L6 Says

TopicL5 SaysL6 Says
Choice"We'll use Redis""We'll use Redis for the hot counter path because we need sub-ms latency at 50K+ checks/sec. PostgreSQL can't serve that read QPS without significant connection pooling overhead."
Failure"Redis has replication""Redis async replication means our RPO is the replication lag — 2-5ms under normal load, potentially seconds during write bursts. For rate limiting, this means briefly over-counting, which is acceptable for abuse protection but not for billing."
Scaling"We'll add more nodes""Redis Cluster with hash tags ensures all counters for a given API key land on the same shard. We need to monitor for hot keys — one viral endpoint could saturate a single shard."
Durability"Redis is in-memory""AOF with fsync=everysec gives us max 1s data loss window. For rate limiting counters, the blast radius of losing 1s of counter state is briefly resetting some windows — acceptable."
Data Structure"We'll use a key-value store""Sorted sets give us O(log N) insert and range query for the leaderboard. For 10M users, that's ~23 comparisons per operation. ZRANGEBYLEX for pagination avoids the ZRANGE offset trap."

Common Interview Mistakes

MistakeWhy It's WrongWhat to Say Instead
"Redis for everything"Redis is not a database — no complex queries, no joins, limited durability"Redis for the hot path, PostgreSQL for the source of truth"
"Redis is always fast"Hot keys, large values, KEYS command, Lua scripts can all cause latency"Redis is fast for simple operations on well-distributed keys under 1KB. Hot keys and large values are the exceptions."
"Redis Pub/Sub for messaging"No delivery guarantee, no persistence, no replay"Pub/Sub for ephemeral updates like presence. Kafka or SQS for anything that requires delivery guarantee."
"Redlock for distributed locking"Martin Kleppmann demonstrated it is not safe under async clocks and process pauses"Redis lock as advisory, with fencing tokens. ZooKeeper if correctness is critical — the lock must survive network partitions."
"Just add more Redis nodes"Cluster adds cross-slot restrictions, resharding complexity, client compatibility requirements"Sentinel HA first. Cluster only when we genuinely exceed single-node memory or throughput limits."

Staff Insight