StaffSignal
Technology Guide

Cassandra

Wide-column NoSQL database designed for massive write throughput and linear horizontal scaling. Partition key design is the entire game.

Design with Cassandra — Staff-Level Technology Guide

The 60-Second Pitch

Apache Cassandra is a distributed wide-column NoSQL database designed for massive write throughput, linear horizontal scaling, and zero single points of failure. It handles millions of writes per second across hundreds of nodes, with tunable consistency that lets you trade correctness for availability on a per-query basis. In system design interviews, Cassandra is the answer when you need time-series data, messaging history, IoT telemetry, activity feeds, or any workload with known access patterns and writes that vastly outnumber reads.

The Staff-level insight: Cassandra is not a database you query — it is a database you design tables for. Every table is a denormalized, pre-computed answer to exactly one query. If you design your tables around entities and relationships (like you would in PostgreSQL), you will get a system that is slow, expensive, and impossible to scale. If you design your tables around the queries your application will execute, you get a system that is absurdly fast at any scale. The entire game is partition key design. Get it right and Cassandra is unstoppable. Get it wrong and no amount of hardware will save you.


Architecture & Internals

Ring Topology & Consistent Hashing

Cassandra uses a peer-to-peer architecture — there is no master node, no leader election, no single point of failure. Every node in the cluster is identical in role and responsibility. This is fundamentally different from leader-follower systems like PostgreSQL, Kafka, or Elasticsearch. When an interviewer asks "what happens when a node fails?", the answer is: the cluster continues operating at reduced capacity. No failover, no election, no downtime.

Nodes are arranged in a logical ring, where each node owns a range of tokens in a hash space (typically -2^63 to 2^63-1 using the Murmur3 partitioner). When a row is written, Cassandra hashes the partition key to produce a token, and that token determines which node owns the data. The node responsible for a token range is called the primary replica; additional replicas are placed on subsequent nodes in the ring according to the replication strategy.

The critical design decision is the replication factor (RF). An RF of 3 means every piece of data is stored on three different nodes. Combined with tunable consistency levels, this gives you fine-grained control over the consistency-availability tradeoff. RF=3 with QUORUM reads/writes gives you strong consistency while tolerating one node failure — this is the most common production configuration and the default you should propose in interviews.

Rendering diagram...

Virtual Nodes (vnodes)

Early Cassandra assigned one large token range per physical node. This created two problems: adding a node required a manual token split and data migration, and heterogeneous hardware led to uneven load. Virtual nodes solve both. Instead of one token range, each physical node owns many (typically 128-256) small, randomly distributed token ranges across the hash space.

When a new node joins the cluster, it claims vnodes from every existing node rather than splitting one node's range. This means data rebalancing is distributed across the entire cluster rather than concentrated on one or two nodes. The result: adding capacity is as simple as starting a new node and waiting for streaming to complete. At Staff level, know that vnode count is a knob — more vnodes means faster rebalancing and better distribution, but increases the overhead of gossip metadata and repair coordination. The default of 256 vnodes in Cassandra 3.x was reduced to 16 in Cassandra 4.x with token allocation improvements.

Gossip Protocol

Cassandra nodes discover each other and share cluster state through gossip — a peer-to-peer protocol where every node periodically (every second) exchanges state information with one to three other randomly selected nodes. Gossip propagates: node liveness (up/down), token ownership, schema versions, data center/rack topology, and load information.

Gossip convergence is eventually consistent. When a node joins or fails, the entire cluster learns about it within a few gossip rounds (typically 5-10 seconds for a 100-node cluster). The failure detector uses a phi accrual algorithm rather than a simple timeout — it calculates a suspicion level based on historical gossip intervals and flags a node as down when the phi value exceeds a threshold (default 8). This avoids false positives from transient network hiccups.

Coordinator Node & Request Path

Any node can serve as the coordinator for any request — this is the peer-to-peer advantage. The client connects to one node (via a load-balanced contact list), and that node becomes the coordinator for the request. The coordinator determines which nodes hold replicas for the given partition key, forwards the request, waits for the required consistency level responses, and returns the result.

Rendering diagram...

Write path on each replica: The write is first appended to the commit log (sequential disk write — fast), then written to the memtable (in-memory sorted structure). When the memtable reaches a threshold, it is flushed to an immutable SSTable (Sorted String Table) on disk. Writes never modify existing files — they only append. This is why Cassandra writes are so fast: no read-before-write, no random I/O, no lock contention. A typical SSD-backed Cassandra node sustains 10,000-50,000 writes per second on a single core precisely because every write is two sequential operations (commit log append + memtable insert).

The commit log exists solely for crash recovery. If a node dies before memtable flush, the commit log is replayed on restart to reconstruct the memtable. Once an SSTable is flushed, the corresponding commit log segment is discarded. This means the commit log is usually small (hundreds of MB to a few GB) — it is a WAL, not a storage layer.

Read path: Reads are more expensive. The coordinator contacts replicas, and each replica must check the memtable, potentially multiple SSTables, and merge results using timestamps (last-write-wins). For each SSTable, the read path uses a cascade of optimizations:

  1. Bloom filter — a probabilistic data structure that tells you "definitely not in this SSTable" or "maybe in this SSTable." Eliminates 99%+ of unnecessary SSTable reads.
  2. Partition key cache — maps partition keys to their byte offset in the SSTable index, avoiding an index lookup.
  3. Compression offset map — maps logical byte positions to compressed block positions on disk, enabling direct random reads.
  4. Partition index summary — a sampled index of the full partition index, held in memory, that narrows the search to a small disk range.

Even with these optimizations, reads are fundamentally more expensive than writes. This is the core Cassandra tradeoff: writes are cheap, reads pay the cost. In interviews, always acknowledge this asymmetry — it is the reason Cassandra is ideal for write-heavy workloads and why read-heavy workloads should consider LCS compaction or alternative databases.


Data Modeling — The Entire Game

Query-First Design

This is the single most important concept in Cassandra and the one that trips up every candidate with relational database experience. In PostgreSQL, you model your entities and relationships first, then write queries against them. In Cassandra, you identify your queries first, then design one table per query.

Example — Chat messaging application:

QueryTable Design
Get messages for a conversation, newest firstmessages_by_conversation — PK: (conversation_id), CK: (timestamp DESC, message_id)
Get all conversations for a userconversations_by_user — PK: (user_id), CK: (last_activity DESC, conversation_id)
Get unread count per userunread_counts — PK: (user_id), CK: (conversation_id)

Yes, this means storing the same data multiple times. Yes, this means writes are duplicated. This is the correct design. Cassandra trades storage and write amplification for read performance and horizontal scalability. The alternative — a normalized schema with joins — does not exist in Cassandra. There are no joins, no subqueries, no GROUP BY across partitions. If your query requires data from two tables, you denormalize the data into one table or execute two queries from the application layer.

CQL schema for the chat example:

-- Table 1: Messages by conversation (the main read query)
CREATE TABLE messages_by_conversation (
    conversation_id UUID,
    timestamp       TIMESTAMP,
    message_id      UUID,
    sender_id       UUID,
    body            TEXT,
    PRIMARY KEY ((conversation_id), timestamp, message_id)
) WITH CLUSTERING ORDER BY (timestamp DESC, message_id ASC);

-- Table 2: Conversations by user (inbox view)
CREATE TABLE conversations_by_user (
    user_id         UUID,
    last_activity   TIMESTAMP,
    conversation_id UUID,
    title           TEXT,
    last_message    TEXT,
    PRIMARY KEY ((user_id), last_activity, conversation_id)
) WITH CLUSTERING ORDER BY (last_activity DESC, conversation_id ASC);

-- Table 3: Unread counts (badge counter)
CREATE TABLE unread_counts (
    user_id         UUID,
    conversation_id UUID,
    count           COUNTER,
    PRIMARY KEY ((user_id), conversation_id)
);

Notice the pattern: every table name describes the query it serves (messages_by_conversation, conversations_by_user). This naming convention is not cosmetic — it documents the data model's intent and prevents future developers from misusing a table for queries it was not designed for.

Partition Key & Clustering Key

The primary key in Cassandra has two parts:

  • Partition key — determines which node stores the row (hashed to a token). All rows with the same partition key are stored on the same node, in the same physical partition.
  • Clustering key — determines the sort order of rows within a partition. Rows within a partition are stored sorted by clustering columns, enabling efficient range scans.

The syntax is: PRIMARY KEY ((partition_key_columns), clustering_key_columns). The double parentheses around the partition key are mandatory when using compound partition keys.

The golden rules:

  1. Partition key = your WHERE clause equality predicate. Every query must specify the full partition key. There is no alternative.
  2. Clustering key = your ORDER BY clause and range scan target. You can filter on clustering columns with =, >, <, >=, <= — but only in the order they are defined.
  3. One partition, one query. A well-designed query reads from exactly one partition. Multi-partition queries (scatter-gather) scale poorly and indicate a data model problem.
Rendering diagram...

Denormalization-First Design

In relational databases, normalization reduces redundancy. In Cassandra, denormalization is the strategy. The Staff-level perspective: you are trading storage (cheap, linear scaling) for read performance (the bottleneck you cannot fix later). Every time you resist denormalizing, ask yourself: "Am I optimizing for disk space or for query latency at 10x scale?" The answer is always latency.

Practical denormalization patterns:

  • Duplicate columns across tables. If user_name appears in 5 query tables, store it in all 5. Update all 5 on write.
  • Materialized views. Cassandra supports server-side materialized views — the database automatically maintains a second table from a base table. In practice, materialized views have consistency bugs and performance issues. Use application-side dual-writes instead.
  • Write-path aggregation. Instead of computing aggregates at read time (SUM, COUNT, AVG), maintain pre-computed counters at write time. Cassandra has a counter column type, but it has limitations (no TTL, no conditional update). Application-layer counters with lightweight transactions are more reliable.

Consistency Levels — The Tunable Tradeoff

How Consistency Levels Work

Cassandra lets you choose a consistency level (CL) per query — not per table, not per cluster. This means you can use QUORUM for payment-critical writes and ONE for analytics reads within the same application. The consistency level determines how many replicas must acknowledge a read or write before the coordinator returns success to the client.

With RF=3 (three replicas per partition):

Consistency LevelReplicas RequiredLatencyAvailabilityUse Case
ONE1 of 3Lowest (single replica)Highest (survives 2 node failures)Logging, metrics, non-critical reads
TWO2 of 3MediumSurvives 1 node failureModerate durability needs
QUORUM2 of 3 (⌊RF/2⌋+1)Medium-highSurvives 1 node failureMost production workloads
ALL3 of 3Highest (slowest replica)Zero fault toleranceAlmost never used in production
LOCAL_QUORUMQuorum in local DCMediumSurvives 1 local node failureMulti-datacenter deployments
EACH_QUORUMQuorum in every DCHighSurvives 1 node per DCCross-DC strong consistency

Strong Consistency Formula

Cassandra provides strong consistency (linearizability) when: R + W > RF, where R is the read consistency level, W is the write consistency level, and RF is the replication factor. With RF=3 and QUORUM for both reads and writes: 2 + 2 > 3 — at least one replica participates in both the read and the write, guaranteeing you read the latest value.

The Staff-level nuance: QUORUM + QUORUM gives you strong consistency for individual rows, not transactions across rows. If you need to read two rows and make a decision based on both, those reads can return values from different points in time. Cassandra has lightweight transactions (LWT) using Paxos for compare-and-set operations on a single partition, but they are 4-10x slower than normal writes. Use them sparingly.

Rendering diagram...

Lightweight Transactions (LWT)

Cassandra's tunable consistency applies to individual reads and writes — it does not provide compare-and-set semantics. For conditional operations ("insert this row only if it doesn't exist," "update this value only if it currently equals X"), Cassandra offers lightweight transactions using Paxos consensus.

-- Only insert if the username doesn't exist (uniqueness constraint)
INSERT INTO users (username, email, created_at)
VALUES ('alice', 'alice@example.com', toTimestamp(now()))
IF NOT EXISTS;

-- Only update if the current balance matches (optimistic locking)
UPDATE accounts SET balance = 950
WHERE account_id = 'acc-123'
IF balance = 1000;

LWTs are powerful but expensive. Each LWT requires four network round-trips (Paxos prepare, promise, propose, commit) compared to one for a normal write. In practice, LWTs are 4-10x slower than regular writes and should be used sparingly — for unique constraint enforcement, leader election, and compare-and-set operations only. If you find yourself using LWTs for the majority of your writes, Cassandra is probably not the right database for that workload.

Read Repair & Hinted Handoff

When a read at QUORUM returns different values from different replicas, the coordinator returns the value with the highest timestamp and triggers a read repair — it sends the latest value to the stale replica in the background. Read repair is a consistency healing mechanism, not a consistency guarantee. It only triggers on reads, so infrequently-read data can remain inconsistent until anti-entropy repair runs.

Hinted handoff handles writes when a replica is temporarily down. The coordinator stores a "hint" (the write data + the target replica) locally. When the down node comes back, the coordinator forwards the stored hints. Hints are stored for a configurable window (default 3 hours). If a node is down longer than the hint window, hints expire and the data must be recovered through full repair. In interviews, mention hinted handoff as the mechanism that makes CL=ONE writes durable even during node failures — but note that hints are not a substitute for repair.


Anti-Patterns — What Kills Cassandra Deployments

Secondary Indexes at Scale

Cassandra supports secondary indexes (CREATE INDEX), and they work fine in development with small datasets. In production at scale, they are a trap. A secondary index is local to each node — it only indexes the data that node owns. A query using a secondary index must be sent to every node in the cluster (scatter-gather), wait for all responses, and merge results. With 100 nodes, that is 100 sub-queries for a single SELECT.

The rule: If your query filters by something other than the partition key, create a denormalized table with that column as the partition key. Do not use a secondary index. The only exception is low-cardinality columns queried alongside the partition key (e.g., WHERE user_id = ? AND status = 'active' where status has <10 distinct values and you always provide user_id).

Large Partitions

A partition that exceeds 100MB or 100,000 rows is a ticking bomb. Large partitions cause compaction to consume excessive memory and disk I/O, reads to timeout as the node scans through millions of cells, and GC pauses to spike as the JVM processes large heap allocations.

Time-series anti-pattern: Using a sensor_id as the partition key for IoT data. As data accumulates over months, each sensor's partition grows unboundedly. The fix: use a compound partition key like (sensor_id, date_bucket) where date_bucket is a day, week, or month. This creates bounded partitions that you can TTL independently.

Tombstone Accumulation

Cassandra does not delete data immediately. A DELETE writes a tombstone — a marker that says "this data was deleted at timestamp T." Tombstones must persist until all replicas have been notified of the deletion and the gc_grace_seconds window (default 10 days) has passed. During reads, the node must scan through tombstones to determine which data is live, which burns CPU and I/O.

The tombstone death spiral: a table with frequent deletes accumulates tombstones. Reads slow down as they scan tombstones. Read timeouts increase. Clients retry, adding load. More timeouts cascade. The fix: design around deletes — use TTLs instead of explicit deletes where possible, use time-bucketed partitions so old data ages out via partition-level drops, and monitor tombstone counts per read with nodetool cfstats.


Compaction Strategies

Compaction is the process of merging SSTables — combining multiple small immutable files into fewer, larger files, discarding tombstones and superseded values. The compaction strategy determines when and how SSTables are merged, and choosing the wrong strategy for your workload is a common cause of performance degradation.

Size-Tiered Compaction (STCS)

The default strategy. SSTables are grouped into "tiers" by size, and when enough SSTables of similar size accumulate (default: 4), they are merged into one larger SSTable. STCS is write-optimized — it minimizes the total I/O spent on compaction relative to the data written. The tradeoff: read amplification increases because data for a single partition can be spread across many SSTables at different tiers.

Best for: Write-heavy workloads with infrequent reads (logs, metrics, event streams). Temporary spike resistance is good — it absorbs bursts without immediate compaction pressure.

Worst for: Read-heavy workloads or tables with frequent updates/deletes. STCS can temporarily require 2x the table size in disk space during compaction (old + new SSTables exist simultaneously).

Leveled Compaction (LCS)

Organizes SSTables into levels (L0, L1, L2...) where each level is 10x the size of the previous. SSTables within a level have non-overlapping token ranges, so a read for a single partition touches at most one SSTable per level. LCS guarantees bounded read amplification (at most ~10 SSTables consulted) but at the cost of higher write amplification — data is rewritten 10x more often than STCS.

Best for: Read-heavy workloads, tables with frequent updates, and workloads where consistent read latency matters more than write throughput.

Worst for: Write-heavy workloads where write amplification will saturate disk I/O. Boot-strapping new nodes is painfully slow with LCS.

Time-Window Compaction (TWCS)

Designed specifically for time-series data. SSTables are grouped by time window (e.g., 1 hour, 1 day). Within each window, STCS is used. Across windows, no compaction occurs — old windows are left as-is. This works perfectly with TTL: when all data in a window expires, the entire SSTable is dropped without any compaction I/O.

Best for: Time-series and immutable event data with TTL-based expiration. This is the only correct strategy for time-bucketed data.

Worst for: Tables with updates or deletes that span time windows (tombstones cannot be resolved across windows).

StrategyWrite AmplificationRead AmplificationSpace OverheadBest Workload
STCSLow (1x)High (many SSTables)2x during compactionWrite-heavy, few reads
LCSHigh (10x)Low (bounded)10% steady-stateRead-heavy, updates
TWCSLowestLow within windowMinimalTime-series with TTL

Scaling — Linear Horizontal Growth

Adding Capacity

Cassandra scales linearly: double the nodes, double the throughput. This is not theoretical — Netflix publicly demonstrated linear scaling to over one million writes per second by simply adding nodes. The mechanism is straightforward: new nodes join the ring, claim vnode token ranges from existing nodes, and existing nodes stream the relevant data to the new node. During streaming, the cluster remains fully operational.

Capacity planning heuristic: Each Cassandra node handles roughly 3,000-5,000 transactions per second per core on SSDs (mixed read/write workload with QUORUM consistency). A 3-node cluster with 8-core machines gives you ~100K ops/s. Need 1M ops/s? Deploy 30 nodes. The scaling is genuinely linear because there is no coordination bottleneck — no master to become a bottleneck, no global lock, no distributed transaction manager.

Scaling Considerations & Limits

Linear scaling has prerequisites. The most common mistake is assuming that adding nodes fixes performance problems caused by bad data models. If your partition keys create hotspots, adding nodes just means more nodes are idle while the hot node is overwhelmed. If your partitions are unbounded, adding nodes distributes the unbounded partitions across more machines — each one still causing compaction and read issues.

Hardware recommendations per node (production):

ComponentRecommendationWhy
CPU8-16 coresCompaction and GC are CPU-intensive
Memory32-64GB (8-16GB heap)Off-heap used for Bloom filters, caches
StorageSSDs (NVMe preferred)Random read I/O for reads; sequential write I/O for writes
Network10 GbpsStreaming during rebalance and repair saturates 1G links
Disk capacity1-2TB per nodeSmaller nodes = faster repairs and streaming

The "smaller nodes" strategy is counterintuitive but critical: running 20 nodes with 1TB each is operationally far better than 5 nodes with 4TB each. Repair, streaming, and recovery all scale with data per node. If a 4TB node fails, the surviving nodes must stream 4TB of data to the replacement — which can take hours. A 1TB node recovers in minutes.

Multi-Datacenter Replication

Cassandra has first-class multi-datacenter support. The NetworkTopologyStrategy replication strategy lets you specify independent replication factors per datacenter: {'dc-east': 3, 'dc-west': 3} means three replicas in each DC. Writes with LOCAL_QUORUM are acknowledged by the local DC and asynchronously replicated to remote DCs. This gives you:

  • Active-active deployment — both DCs serve reads and writes simultaneously.
  • Disaster recovery — lose an entire DC, the other continues with all data.
  • Geographic latency optimization — users connect to the nearest DC.

The tradeoff: cross-DC replication is asynchronous, so brief inconsistency windows exist between DCs. For most workloads (social feeds, messaging, IoT), this is acceptable. For financial transactions requiring cross-DC consistency, use EACH_QUORUM — but expect 10x higher latency due to cross-DC round trips.


Failure Modes & Recovery

1. Tombstone Accumulation Spiral

Symptoms: Read latency spikes, TombstoneOverwhelmingException, GC pauses >5s, read timeouts increasing over time.

Root cause: Tables with frequent deletes (e.g., inbox tables where read messages are deleted, session tables with explicit expiry deletes) accumulate tombstones faster than compaction resolves them. The gc_grace_seconds window (10 days default) prevents tombstone removal because deleted data might still exist on a replica that was down during the delete.

Fix: Redesign to use TTLs instead of DELETE. Use time-bucketed partition keys so old partitions are dropped entirely. Reduce gc_grace_seconds if all replicas are consistently healthy (risky — zombie data can resurface). Monitor with nodetool tablestats — if Average tombstones per slice exceeds a few hundred, investigate.

2. Large Partition Warnings → OOM

Symptoms: WARN Compacting large partition in logs, compaction falling behind, node OOM crashes during compaction, uneven disk usage across nodes.

Root cause: A partition key with unbounded cardinality (e.g., user_id for a power user with millions of activities, channel_id for a public chat room). Compaction must read the entire partition into memory to merge SSTables.

Fix: Add a bucketing column to the partition key: (user_id, activity_month) or (channel_id, day_bucket). For existing data, create a new table with the bucketed key and run a migration. Monitor partition sizes with nodetool tablehistograms.

3. Gossip Instability & Split Brain

Symptoms: Nodes repeatedly marked down/up in rapid succession, SchemaDisagreement errors, clients receiving conflicting schema versions, operations timing out sporadically.

Root cause: Network partitions, GC pauses exceeding the gossip timeout, or too many nodes joining/leaving simultaneously. The phi accrual failure detector misinterprets long GC pauses as node failures.

Fix: Tune GC to minimize stop-the-world pauses (G1GC with appropriate heap sizing — typically 8-16GB, not more). Ensure network latency between nodes is <10ms within a DC. During rolling upgrades, wait for each node to fully rejoin and stream before proceeding to the next. Never restart more than one node simultaneously.

4. Repair Debt

Symptoms: nodetool repair takes increasingly long, consistent reads returning stale data, read repair metrics showing high disagreement rates.

Root cause: Anti-entropy repair (the process that synchronizes replicas) has not been run within gc_grace_seconds. If repair does not run before tombstones expire, deleted data can be resurrected by a replica that missed the delete — the "zombie data" problem.

Fix: Run incremental repair on every node at least once within gc_grace_seconds (default 10 days, so schedule repair weekly). Use Reaper (an open-source repair orchestrator) rather than manual nodetool repair. Monitor repair completion status — incomplete repairs are worse than no repair because they mark ranges as repaired when they are not.

5. Hotspot Partitions

Symptoms: One or two nodes at 90%+ CPU while others idle, high latency for specific partition keys, uneven disk usage despite uniform vnode distribution.

Root cause: A partition key with low cardinality or skewed distribution. Classic example: using country_code as partition key where 40% of traffic is from one country, or using date as partition key where today's partition receives all writes.

Fix: Add a bucketing/sharding dimension to the partition key. For time-based hotspots: (date, shard_bucket) where shard_bucket is hash(entity_id) % N. For geographic hotspots: include a sub-region or user segment in the key. The tradeoff is that reads must now query N partitions and merge — but this fan-out is bounded and parallelizable.

Rendering diagram...

When to Use vs. Alternatives

DimensionCassandraDynamoDBScyllaDBPostgreSQL
Best forKnown query patterns, massive write throughput, multi-DCServerless, unpredictable traffic, AWS-nativeCassandra workloads needing lower latencyComplex queries, transactions, evolving schema
Scaling modelLinear, self-managed nodesAutomatic, pay-per-requestLinear, C++ (no JVM GC)Vertical + read replicas; Citus for sharding
ConsistencyTunable per-queryEventually consistent or strong per-itemTunable (Cassandra-compatible)ACID transactions, serializable isolation
Data modelWide-column, query-first tablesKey-value + GSIs, single-table designWide-column (CQL-compatible)Relational, normalized, JOIN-capable
Operational costHigh (JVM tuning, repair, compaction)Zero (fully managed)Medium (less GC tuning)Medium (VACUUM, replication lag)
Multi-DCFirst-class, active-activeGlobal tables (limited)First-classBDR/Citus (complex)
Query flexibilityLow — partition key requiredLow — partition key requiredLow — CQL compatibleHigh — arbitrary SQL
When to avoidEvolving access patterns, ad-hoc queries, small scaleNeed multi-cloud, complex queriesImmature ecosystem>10TB, >100K writes/sec

Decision framework for interviews:

  • "We need to store time-series data at massive scale" → Cassandra (or TimescaleDB if you need SQL)
  • "We need active-active multi-region" → Cassandra (best native multi-DC story)
  • "We need flexible queries and might change access patterns" → PostgreSQL
  • "We're on AWS and want zero ops" → DynamoDB
  • "We want Cassandra performance without JVM overhead" → ScyllaDB
  • "We need transactions across entities" → PostgreSQL (Cassandra has no cross-partition transactions)

Deployment Topologies

Rendering diagram...

Operational Concerns

JVM Tuning

Cassandra runs on the JVM, and garbage collection is the primary source of latency spikes. The standard production configuration:

  • Heap size: 8-16GB. Never more than 16GB — larger heaps cause longer GC pauses. If you need more memory for caching, Cassandra uses off-heap memory for row caches and Bloom filters.
  • GC algorithm: G1GC (default in Cassandra 4.x) or ZGC for Java 11+. Avoid CMS — it is deprecated and causes unpredictable full GC pauses.
  • Target GC pause: <200ms p99. Monitor with nodetool gcstats and GC logs. If p99 exceeds 500ms, the phi accrual failure detector may false-positive, marking the node as down.

Repair & Anti-Entropy

Repair is not optional. Without regular repair, replicas drift, tombstones cannot be safely expired, and deleted data can resurface. The two repair modes:

  • Full repair: Compares all data between replicas using Merkle trees. Expensive but thorough. Run weekly.
  • Incremental repair: Only compares data written since the last repair. Faster but has had bugs in older Cassandra versions (3.x). Preferred in Cassandra 4.x+.

Use Reaper (cassandra-reaper.io) to orchestrate repairs. It schedules repairs across the cluster, avoids overwhelming nodes with concurrent repair streams, and tracks completion. Manual nodetool repair across 50+ nodes is not sustainable.

Monitoring Essentials

MetricHealthy RangeAlert Threshold
Read latency p99<10ms>50ms
Write latency p99<5ms>25ms
Pending compactions<5>20
Tombstones per read<100>1000
Dropped mutations0Any >0
GC pause max<200ms>500ms
Heap usage<75%>85%
Disk usage<50%>70% (need room for compaction)

Backup & Restore

Cassandra snapshots are node-local — nodetool snapshot creates hard links to SSTables on that node. A full backup requires snapshotting every node. Restoring requires placing snapshot files on the correct nodes based on token ownership. For incremental backups, Cassandra can copy new SSTables to a backup directory automatically. Use Medusa (open-source) or commercial tools for orchestrated backup/restore to cloud storage (S3, GCS).


Interview Application — Staff-Level Plays

Which Playbooks Use Cassandra

PlaybookHow Cassandra Is UsedKey Pattern
Database ShardingExample of native horizontal partitioningConsistent hashing with vnodes, no resharding needed
Data ConsistencyTunable consistency model case studyR + W > RF for strong consistency, LWT for linearizability
Chat MessagingMessage storage with time-ordered retrieval(conversation_id, day_bucket) partition key + message_ts clustering
Leaderboard & CountingHigh-write counter storageCounter columns with eventual consistency

L5 vs L6 Cassandra Responses

ScenarioL5 AnswerL6/Staff Answer
"Design message storage for chat""Use Cassandra with conversation_id as partition key""Use Cassandra with (conversation_id, day_bucket) as compound partition key. CL=LOCAL_QUORUM for sends, ONE for read. Time-bucket prevents unbounded partitions for long-lived group chats. Dual-write to conversations_by_user for inbox queries."
"How do you handle consistency?""Cassandra is eventually consistent""Cassandra is tunably consistent. QUORUM reads + QUORUM writes give strong consistency per-partition (R+W>RF). Cross-partition consistency requires LWT with Paxos, which is 4x slower. For most workloads, eventual consistency with read repair is sufficient — design the application to be idempotent."
"How do you scale?""Add more nodes""Add nodes — Cassandra scales linearly with vnodes auto-rebalancing. But first, verify the data model isn't the bottleneck: check for hotspot partitions, large partition warnings, and tombstone accumulation. Scaling hardware won't fix a bad partition key."
"What about deletes?""DELETE the row""Never raw DELETE in Cassandra. Use TTLs for time-bounded data (sessions, caches). Use time-bucketed partitions so old data is dropped via partition deletion, not tombstones. Monitor gc_grace_seconds and ensure repair runs within that window to prevent zombie data."

The Staff Cassandra Checklist

When proposing Cassandra in a system design interview, hit these points:

  1. Justify the choice: "We need >100K writes/sec with known access patterns and multi-DC replication. Cassandra is designed for this exact workload."
  2. State the query patterns first: "We have three queries: get messages by conversation, get conversations by user, get unread count by user. That gives us three Cassandra tables."
  3. Design partition keys explicitly: Name the partition key and clustering key for every table. Explain why the partition key prevents hotspots and bounds partition size.
  4. State the consistency level: "QUORUM for writes (durability), LOCAL_QUORUM for reads (latency), ONE for analytics and non-critical paths."
  5. Address the failure modes: Tombstones (use TTL), large partitions (use time bucketing), repair (schedule weekly with Reaper).
  6. Name the complement: "Cassandra for the hot write path, PostgreSQL as the system of record for user profiles and financial data that needs transactions and flexible queries."

Quick Reference Card

Write path:    Client → Coordinator → Replicas (commit log → memtable → SSTable)
Read path:     Client → Coordinator → Replicas (memtable + SSTables merge by timestamp)
Consistency:   R + W > RF = strong consistency (per partition only)
Partition key: Determines node placement — MUST be in every query's WHERE clause
Clustering:    Determines sort order within partition — enables range scans
Compaction:    STCS (writes) | LCS (reads) | TWCS (time-series)
Replication:   RF=3 standard, NetworkTopologyStrategy for multi-DC
Repair:        Run weekly, within gc_grace_seconds, use Reaper
Anti-patterns: Secondary indexes at scale, large partitions, raw DELETEs