Why This Matters
Indexing is the first performance lever every engineer reaches for — and the one most frequently misused. "Just add an index" is the system design equivalent of "just add more servers." It sounds reasonable until you realize that every index is a write-time cost, a storage cost, and a maintenance burden that compounds over the life of the table.
Staff-level interviewers use indexing questions to test whether a candidate thinks in tradeoffs. A senior engineer says "add an index on the slow column." A Staff engineer says "what's the read/write ratio? We have 5 indexes on a write-heavy table — adding a 6th compounds a problem we should be reducing. Let's audit which indexes are actually scanned before adding more."
The indexing conversation in a system design interview is never just about making reads faster. It's about the cost structure of your storage layer: how B-trees and LSM-trees make different tradeoffs for different workloads, why composite index column order is a design decision, how covering indexes eliminate IO entirely, and when partitioning is a better answer than more indexes.
This guide teaches you to reason about indexes as a cost conversation — not a performance trick — which is exactly the framing Staff interviewers evaluate.
The 60-Second Version
- B-tree: balanced tree, O(log N) lookup. Read-optimized. Default for most RDBMS (Postgres, MySQL). The workhorse of OLTP with mixed read/write workloads.
- LSM-tree: log-structured merge tree. Write-optimized via sequential writes. Powers RocksDB, Cassandra, LevelDB. Trade-off: read amplification and compaction overhead.
- Composite indexes obey the leftmost prefix rule. An index on (A, B, C) serves queries on A, (A, B), and (A, B, C) — but NOT B alone or (B, C). Column order is a design decision, not an afterthought.
- Covering indexes include every column the query needs, eliminating heap/table lookups entirely. The fastest query is one that never touches the table.
- Write amplification is the hidden cost. Every index adds a write on every insert/update. Five indexes means roughly 5x write overhead. Index design is always a cost conversation.
- Partial/filtered indexes target only rows matching a condition. Smaller footprint, faster lookups for high-frequency query patterns on a subset of data.
- The Staff-level question is never "should we add an index?" It's "what's the cost of this index, and does the read benefit justify it?"
How Database Indexes Work
The Problem Indexes Solve
Without an index, a database must examine every row in the table to find the ones matching your query. This is a full table scan — O(N) where N is the number of rows.
For a 1,000-row table, a full scan takes microseconds. For a 100-million-row table, it takes seconds to minutes. For a user-facing query with a 50ms latency budget, a full scan on a large table is not slow — it is architecturally impossible.
An index is a sorted data structure that maps column values to row locations, allowing the database to find matching rows without scanning the entire table. The cost: the index must be maintained on every write (insert, update, delete).
B-Tree Indexes
B-trees are the default index type for Postgres, MySQL, Oracle, and SQL Server. They maintain a balanced tree structure where:
- The root node contains a small number of key ranges, pointing to child nodes
- Internal nodes narrow the search space by directing to the correct subtree
- Leaf nodes contain the actual key values and pointers to the heap (the table data)
A B-tree on 1 billion rows is typically 3–4 levels deep. Each level requires one disk read (or memory lookup if cached). So finding a specific row in a billion-row table costs 3–4 reads — the same as finding a row in a thousand-row table.
# B-tree lookup on 1B rows
Tree depth: ~4 levels (branching factor ~500)
Level 1: Root node → 1 read → narrows to 1/500 of the tree
Level 2: Internal node → 1 read → narrows to 1/250,000
Level 3: Internal node → 1 read → narrows to 1/125,000,000
Level 4: Leaf node → 1 read → finds the exact row
Total: 4 reads for a point lookup on 1 billion rows
Each "read" is ~8 KB from disk or ~100ns from buffer pool
B-tree properties that matter in interviews:
- Reads are fast: O(log N), practically 3–4 disk reads regardless of table size
- Writes are in-place: updates modify the existing leaf page (plus WAL entry)
- Range scans are efficient: leaf nodes are linked, so
WHERE created_at > Xscans sequentially - Write amplification: each index adds ~1 write per insert/update (page modification + WAL)
LSM-Tree Indexes
Log-Structured Merge trees take the opposite approach: optimize for writes at the cost of reads.
All writes go to an in-memory buffer (memtable). When the buffer fills, it is flushed to disk as a sorted, immutable file (SSTable). Background compaction merges SSTables to keep read performance manageable.
# LSM-tree write path
write(key, value):
memtable.insert(key, value) # O(log N) in-memory
if memtable.full():
flush_to_disk_as_sstable() # sequential write — fast
# LSM-tree read path
read(key):
if key in memtable: return value # check memory first
for sstable in levels: # check each level on disk
if key in sstable.bloom_filter:
return sstable.get(key)
return not_found
LSM-tree properties that matter in interviews:
- Writes are fast: always sequential (append to memtable, flush to SSTable)
- Reads may be slow: worst case checks multiple SSTables across multiple levels
- Compaction is the hidden cost: background IO to merge SSTables, can compete with foreground queries
- Space amplification: data may exist in multiple SSTables until compacted
When to Use Which
| Factor | B-Tree | LSM-Tree |
|---|---|---|
| Read-heavy (>5:1 read/write) | Optimal | Read amplification becomes expensive |
| Write-heavy (<2:1 read/write) | Write amplification compounds | Optimal — sequential writes |
| Balanced workload | Good — in-place updates | Good — but budget for compaction IO |
| Point lookups | 3–4 reads, predictable | Variable — depends on SSTable count |
| Range scans | Excellent — linked leaf pages | Good after compaction, poor before |
| Space efficiency | 1x data + index overhead | 1.1–1.5x due to duplicate entries across levels |
| Compaction overhead | None | 10–30% of disk IO goes to compaction |
Staff rule: Don't say "B-tree is standard" or "LSM is faster for writes." Say: "Our workload is 80% reads with latency-sensitive point lookups — B-tree gives us predictable 3–4 read performance without compaction overhead. If we had a write-heavy ingest pipeline, LSM would be better because sequential writes avoid the random IO cost of in-place B-tree updates."
Composite Indexes and Column Order
This is the most common indexing mistake in both production systems and interviews. Column order in a composite index determines which queries the index can serve.
The Leftmost Prefix Rule
An index on (A, B, C) can efficiently serve:
WHERE A = ?— uses the full first columnWHERE A = ? AND B = ?— uses the first two columnsWHERE A = ? AND B = ? AND C = ?— uses all three columnsWHERE A = ? AND B > ?— equality on A, range on BWHERE A = ? ORDER BY B— equality on A, sorted by B
It cannot efficiently serve:
WHERE B = ?— skips the first column; index is uselessWHERE B = ? AND C = ?— skips A; still uselessWHERE A = ? AND C = ?— can use A, but must scan all B values to check C
The Column Order Rule
For composite indexes, the optimal column order follows a specific pattern:
- Equality columns first — columns in
WHERE col = ?clauses - Range columns next — columns in
WHERE col > ?orWHERE col BETWEENclauses - Sort columns last — columns in
ORDER BY
# Query: WHERE user_id = ? AND created_at > ? ORDER BY score DESC
# Good: (user_id, created_at, score)
# → equality on user_id, range on created_at, sort by score
# Bad: (created_at, user_id, score)
# → range on created_at breaks the ability to use user_id efficiently
# Bad: (user_id, score, created_at)
# → range on created_at must scan all score values
Why this matters in interviews: If the interviewer asks "what index would you add?" and you propose (A, B) without explaining why A comes first (equality vs. range, selectivity, query pattern), you've given a senior answer. The Staff answer explains the column ordering rationale.
Covering Indexes
A covering index contains all the columns the query reads, so the database never needs to access the heap (table data). The query is served entirely from the index — zero table IO.
# Query: SELECT user_id, email FROM users WHERE status = 'active'
# Regular index on (status):
# 1. Scan index to find rows where status = 'active'
# 2. For each row, follow the pointer to the heap to read user_id and email
# → N index reads + N heap reads (random IO)
# Covering index on (status) INCLUDE (user_id, email):
# 1. Scan index — user_id and email are already in the leaf nodes
# → N index reads, 0 heap reads
The performance difference is dramatic. The gap between 0 random IOs and 1 random IO per row is larger than the gap between 1 and 10. For a query returning 10,000 rows, a covering index eliminates 10,000 random disk reads.
When to use covering indexes:
- High-frequency queries that read a small, fixed set of columns
- Queries where the table is large but the index is narrow
- Analytics queries that aggregate over a subset of columns
When not to:
- If the query reads most columns (the covering index would be nearly as large as the table)
- If the table is small enough to fit in memory anyway
Index Design Cheat Sheet
| Query Pattern | Index Design | Why |
|---|---|---|
WHERE user_id = ? | Single-column on user_id | Point lookup, B-tree ideal |
WHERE user_id = ? AND created_at > ? | Composite (user_id, created_at) | Equality first, then range |
WHERE status = 'active' AND ... | Partial index WHERE status = 'active' | Indexes only matching rows, smaller footprint |
SELECT id, email FROM users WHERE ... | Covering index (filter) INCLUDE (id, email) | Index-only scan, zero table access |
WHERE text LIKE '%search%' | Full-text index (GIN in Postgres) | B-tree can't help with leading wildcard |
ORDER BY score DESC LIMIT 10 | Index on score DESC | Avoids sorting; reads top 10 directly |
WHERE region = ? AND created_at > ? | (region, created_at) | Partition by region, range by time |
Write Amplification
Every index is a write-time cost. Understanding this cost is what separates the Staff answer ("let me quantify the write overhead") from the senior answer ("just add an index").
| Indexes | Write Cost per INSERT | Total IO per Insert |
|---|---|---|
| 0 | 1 write (heap only) | ~1 page write + WAL |
| 1 | ~2 writes | Heap + 1 index page + WAL |
| 3 | ~4 writes | Heap + 3 index pages + WAL |
| 5 | ~6 writes | Heap + 5 index pages + WAL |
| 10 | ~11 writes | Anti-pattern territory |
At 5+ indexes on a write-heavy table, the overhead is architectural, not incidental. Each index page modification also triggers WAL writes and potentially page splits.
Partial indexes reduce this cost. A partial index WHERE status = 'active' on a table where 95% of rows are status = 'completed' is 20x smaller than a full index — faster to scan and cheaper to maintain.
Visual Guide
B-Tree vs LSM Decision
The Numbers That Matter
| Metric | Value | Design Implication |
|---|---|---|
| B-tree lookup on 1B rows | 3–4 disk reads | Tree depth is O(log N) with branching factor 300–500 depending on key size |
| B-tree leaf page size | 8 KB (Postgres) | Each leaf holds 300–500 index entries (integer key ≈ 500, UUID ≈ 350) |
| Index entry size | 8–16 bytes typical | Key + 6-byte row pointer; wider keys reduce branching factor |
| Full table scan on 100M rows | seconds to minutes | Unusable for user-facing latency |
| Covering index benefit | 0 heap IO | The gap between 0 and 1 random IO is larger than 1 to 10 |
| Each additional index | +1 write per insert | Plus page splits, WAL entries, and potential rebalancing |
| LSM compaction IO | 10–30% of total disk IO | Must be budgeted; can't be eliminated |
| Bloom filter false positive rate | ~1% with 10 bits/key | LSM uses bloom filters to skip SSTables that don't contain the key |
| Index bloat after heavy updates | 2–3x necessary size | Dead tuples accumulate without vacuum/compaction |
CREATE INDEX on 100M rows | minutes (blocking) | Always use CONCURRENTLY in production |
How This Shows Up in Interviews
Scenario 1: "This query is slow — what do you do?"
Do not say "add an index." Say: "Three questions first. What's the query? SELECT * FROM orders WHERE user_id = ? AND created_at > ? ORDER BY total DESC — that tells me I need a composite index with equality column first: (user_id, created_at, total). What indexes exist already? If there's already (user_id), the composite supersedes it — I'd drop the single-column to reduce write overhead. What's the table's read/write ratio? If it's write-heavy with 5 existing indexes, adding a 6th might make the slow query faster but degrade overall write throughput. The answer might be moving this query to a read replica, not adding an index."
Scenario 2: "Orders table with 500M rows, 8 indexes, degrading write latency" (Full Walkthrough)
This is the classic "too many indexes" problem. Here's how a Staff engineer works through it:
Step 1 — Quantify the write cost. "8 indexes means every INSERT touches the heap plus 8 index pages — roughly 9 page writes per row, plus WAL entries. At 2M inserts/day, that's 18M page writes/day just for index maintenance. Each page write is ~8 KB, so ~144 GB/day of write IO from indexes alone."
Step 2 — Audit index usage. "First, check pg_stat_user_indexes to see which indexes are actually scanned. In my experience, 2–3 of 8 indexes are doing all the work. The rest were added speculatively and never used — or were superseded by composite indexes that cover the same queries. Every unused index costs writes with zero read benefit."
Step 3 — Check for redundancy. "If we have both (user_id) and (user_id, created_at), the composite index serves all queries that the single-column index does. Drop the single-column index — it's pure overhead. Similarly, if we have (status) and (status, region), the composite covers single-column lookups on status."
Step 4 — Move analytics off OLTP. "COUNT(*) GROUP BY status on 500M rows is an analytics query running on the OLTP primary. This forces a full index scan that competes with transaction writes. Move it to a read replica or a materialized view refreshed every 5 minutes. The dashboard doesn't need real-time data."
Step 5 — Partition the table. "500M rows with 2M/day growth means the table and all 8 indexes are growing unboundedly. Range-partition by created_at (monthly). Old partitions are cold; new partitions are hot. Index maintenance is per-partition, not per-table — this keeps the actively-maintained index size manageable."
Step 6 — Consider partial indexes. "If 95% of orders are status = 'completed', a partial index WHERE status != 'completed' covers active-order queries with 5% of the storage. The write amplification for the other 95% of inserts is zero — they don't touch this index because they don't match the predicate."
Why this is a Staff answer: The candidate doesn't add more indexes — they remove unnecessary ones, move analytics workloads elsewhere, and partition to keep index sizes manageable. Every step is justified with numbers (18M page writes/day, 144 GB/day write IO).
Scenario 3: "B-tree or LSM for this use case?"
Do not say "B-tree is standard." Say: "What's the read/write ratio? For our user-facing OLTP with 80% reads and latency-sensitive point lookups, B-tree — it gives predictable 3–4 read performance without compaction overhead. But for our event ingestion pipeline at 50K writes/sec with batch analytical reads, LSM — sequential writes avoid the random IO cost of in-place B-tree updates, and compaction runs during off-peak hours. The tradeoff: B-tree pays write amplification per index but has zero background IO. LSM pays 10–30% of disk IO to compaction but gets sequential writes."
Scenario 4: "How would you index a multi-tenant system?"
The Staff answer: tenant_id goes first in every composite index because it's the highest-selectivity equality predicate. (tenant_id, created_at) not (created_at, tenant_id). Consider a partial index per tenant if one tenant dominates the workload. If the largest tenant has 100x more data than average, their queries will scan 100x more index pages — you may need tenant-specific optimization (dedicated partition, dedicated read replica, or a covering index for their top queries).
In the Wild
Postgres: The Index Ecosystem
PostgreSQL supports 6 index types: B-tree (default), Hash, GiST, SP-GiST, GIN, and BRIN. In practice, B-tree handles 90% of workloads, GIN handles full-text search and JSONB queries, and BRIN handles time-series data on append-only tables. Postgres's INCLUDE clause (added in v11) enables covering indexes — CREATE INDEX ON orders (user_id) INCLUDE (total, status) turns a point lookup + heap read into a pure index scan.
The Staff-level insight: Postgres's query planner chooses between index scan, bitmap index scan, and sequential scan based on estimated selectivity. An index on a boolean column (is_active) where 90% of rows are true will actually perform worse than a sequential scan for WHERE is_active = true because the random IO of following 90% of index pointers is more expensive than a sequential scan. Partial indexes (WHERE is_active = false) fix this by indexing only the minority partition.
Cassandra: LSM-Tree at Scale
Cassandra's storage engine uses LSM-trees with configurable compaction strategies. Size-Tiered Compaction (STCS) merges SSTables when N tables of similar size accumulate — good for write-heavy workloads but causes temporary space amplification (2x during compaction). Leveled Compaction (LCS) maintains a strict hierarchy of SSTable levels with guaranteed read amplification bounds — better for read-heavy workloads but with higher steady-state write amplification.
The Staff-level insight: Cassandra's compaction strategy choice is a production-impacting decision, not a tuning knob. STCS on a read-heavy table means unpredictable read latency as the number of SSTables fluctuates. LCS on a write-heavy table means compaction can't keep up, causing write stalls. The correct strategy depends on the access pattern of each specific table — different tables in the same cluster can use different strategies.
RocksDB: The LSM Engine Under Everything
RocksDB powers the storage layers of CockroachDB, TiKV (used by TiDB), and MyRocks (MySQL with LSM storage). Its block cache keeps frequently accessed SSTable blocks in memory, and its bloom filters (configurable bits per key) skip SSTables that definitely don't contain the target key. With 10 bits per key, the false positive rate is ~1% — meaning 99% of SSTables that don't contain your key are skipped without reading from disk.
The Staff-level insight: RocksDB's write amplification is tunable via the level multiplier (default 10). At a multiplier of 10, data is rewritten ~10x across compaction levels. Reducing to 4 cuts write amplification but increases read amplification (more SSTables per level). This is the same read-write tradeoff that B-tree vs LSM represents, but expressed as a continuous dial rather than a binary choice.
Staff Calibration
The sections below are calibration tools for Staff-level interviews. If you already understand indexing mechanics, start here to sharpen the framing that separates L5 from L6 answers.
What Staff Engineers Say (That Seniors Don't)
| Concept | Senior Response | Staff Response |
|---|---|---|
| Index selection | "Add an index on the slow column" | "What's the read/write ratio? Five indexes on a write-heavy table is a cost we need to quantify — each adds ~1 write per insert" |
| B-tree vs LSM | "B-tree is standard" | "B-tree for read-heavy OLTP, LSM for write-heavy ingest — and we need to budget for LSM compaction IO at 10–30% of disk throughput" |
| Composite indexes | "Create an index on those columns" | "Column order follows the equality-range-sort rule. Reordering changes which queries benefit and which are left scanning" |
| Covering indexes | "It avoids a table lookup" | "It turns random heap IO into zero IO. At our query volume, that eliminates N million random disk seeks per day" |
| Index bloat | "We should rebuild indexes" | "Our index-to-data ratio is 1.4x. Partial indexes on the hot partition cut storage by 80% and keep the working set in memory" |
| Too many indexes | "Indexes make reads faster" | "Every index is a write-time contract. At 8 indexes on this table, we're writing 9 pages per INSERT. Let's audit which are actually scanned before adding more." |
Common Interview Traps
- "Just add an index" without discussing write amplification. Staff candidates frame indexing as a tradeoff between read latency, write throughput, and storage cost.
- Ignoring composite column order. Treating
(A, B)and(B, A)as equivalent reveals a gap in understanding the leftmost prefix rule. - Forgetting LSM compaction costs. Citing LSM as "faster for writes" without acknowledging compaction IO, space amplification, and read penalty is an incomplete answer.
- Missing the B-tree vs LSM decision boundary. The choice is workload-dependent, not universal. Interviewers probe whether you can articulate when each structure wins and why.
- Creating indexes during peak traffic.
CREATE INDEXon a large table acquires a write lock (withoutCONCURRENTLYin Postgres) and can block writes for minutes. Always mentionCONCURRENTLYin production contexts. - Ignoring index bloat. B-tree indexes accumulate dead tuples from updates/deletes. Without
REINDEXor autovacuum tuning, an index can be 2–3x its necessary size, wasting buffer pool memory. - Composite index column order by intuition. The correct order is: equality first, range second, sort third. Getting this wrong means the range condition breaks the sort optimization.
Practice Drill
Staff-Caliber Answer ShapeExpand
- Audit all 8 indexes. Use
pg_stat_user_indexesto check which indexes are actually scanned. Unused indexes cost writes with zero read benefit — drop them. - Check for redundant indexes. If you have both
(user_id)and(user_id, created_at), the composite covers single-column queries. Drop the single-column. - Move the dashboard query off OLTP.
COUNT(*) GROUP BY statuson 500M rows is an analytics query. Use a materialized view refreshed every 5 minutes or route to a replica. - Partition the table. Range-partition by
created_at(monthly). Index maintenance is per-partition, keeping active index sizes manageable. - Consider partial indexes. If 95% of orders are
status = 'completed', a partial indexWHERE status != 'completed'covers active-order queries with 5% of the storage.
The Staff move: Don't add more indexes. Remove unnecessary ones, move analytics queries elsewhere, and partition to keep index sizes manageable.
Where This Appears
These playbooks apply indexing concepts to complete system design problems with full Staff-level walkthroughs, evaluator-grade rubrics, and practice drills.
- Database Selection — How storage engine choice (B-tree vs LSM) drives index strategy, and when index requirements should influence your database selection rather than the other way around
- Database Sharding — Per-shard index management, how sharding reduces per-node index size, and the tradeoff between global indexes (cross-shard queries) and local indexes (write performance)
- Search & Indexing — Inverted indexes, tokenization, relevance scoring, and why full-text search requires fundamentally different index structures than OLTP queries
- Geospatial Indexing — R-trees, geohash-based indexes, and how spatial queries require specialized index structures that B-trees cannot efficiently serve
Related Technologies: PostgreSQL · Cassandra · Elasticsearch