Database Indexing — Staff Interview Quick Reference
The 60-Second Version
- B-tree: balanced tree, O(log N) lookup. Read-optimized. Default for most RDBMS (Postgres, MySQL). The workhorse of OLTP with mixed read/write workloads.
- LSM-tree: log-structured merge tree. Write-optimized via sequential writes. Powers RocksDB, Cassandra, LevelDB. Trade-off: read amplification and compaction overhead.
- Composite indexes obey the leftmost prefix rule. An index on (A, B, C) serves queries on A, (A, B), and (A, B, C) — but NOT B alone. Column order is a design decision, not an afterthought.
- Covering indexes include every column the query needs, eliminating heap/table lookups entirely. The fastest query is one that never touches the table.
- Write amplification is the hidden cost. Every index adds a write on every insert/update. Five indexes means roughly 5x write overhead. Index design is always a cost conversation.
- Partial/filtered indexes target only rows matching a condition. Smaller footprint, faster lookups for high-frequency query patterns.
- Index-only scans occur when the index contains all requested data. No table access. Massive performance wins, especially for analytics queries.
What Staff Engineers Say (That Seniors Don't)
| Concept | Senior Response | Staff Response |
|---|---|---|
| Index selection | "Add an index on the slow column" | "What's the read/write ratio? Five indexes on a write-heavy table is a cost we need to quantify" |
| B-tree vs LSM | "B-tree is standard" | "B-tree for read-heavy OLTP, LSM for write-heavy ingest — and we need to budget for LSM compaction IO" |
| Composite indexes | "Create an index on those columns" | "Column order follows query selectivity and the leftmost prefix rule; reordering changes which queries benefit" |
| Covering indexes | "It avoids a table lookup" | "It turns a random IO into zero IO. At our query volume that eliminates N million disk seeks per day" |
| Index bloat | "We should rebuild indexes" | "Our index-to-data ratio is 1.4x. Partial indexes on the hot partition cut storage and keep the working set in memory" |
The Numbers That Matter
- B-tree lookup on 1B rows: 3-4 disk reads (tree depth)
- Index entry size: 8-16 bytes typical (key + pointer)
- Full table scan at scale: effectively unusable for user-facing latency
- Each additional index: +1 write per insert/update (plus page splits, WAL entries)
- Covering index benefit: eliminates random IO entirely — the gap between 0 disk reads and 1 disk read is larger than the gap between 1 and 10
Common Interview Traps
- "Just add an index" without discussing write amplification. Staff candidates frame indexing as a trade-off between read latency, write throughput, and storage cost.
- Ignoring composite column order. Candidates who treat (A, B) and (B, A) as equivalent reveal a gap in understanding the leftmost prefix rule.
- Forgetting LSM compaction costs. Citing LSM as "faster for writes" without acknowledging compaction IO, space amplification, and read penalty is an incomplete answer.
- Missing the B-tree vs LSM decision boundary. The choice is workload-dependent, not universal. Interviewers probe whether you can articulate when each structure wins and why.
B-Tree vs LSM Decision
Rendering diagram...
Index Design Cheat Sheet
| Query Pattern | Index Design | Why |
|---|---|---|
WHERE user_id = ? | Single-column index on user_id | Point lookup, B-tree ideal |
WHERE user_id = ? AND created_at > ? | Composite (user_id, created_at) | Equality first, then range |
WHERE status = 'active' AND ... | Partial index WHERE status = 'active' | Indexes only matching rows, smaller footprint |
SELECT user_id, email FROM users WHERE ... | Covering index (filter_col) INCLUDE (user_id, email) | Index-only scan, zero table access |
WHERE text LIKE '%search%' | Full-text index (GIN in Postgres) | B-tree can't help with leading wildcard |
ORDER BY score DESC LIMIT 10 | Index on score DESC | Avoids sorting; reads top 10 directly from index |
Write Amplification Math
Each index adds write overhead. For a table with N indexes:
| Indexes | Write Cost per INSERT | Example Table |
|---|---|---|
| 0 | 1 write (heap only) | Append-only log |
| 1 | ~2 writes | Simple lookup table |
| 3 | ~4 writes | Typical OLTP table |
| 5 | ~6 writes | Over-indexed table |
| 10 | ~11 writes | Anti-pattern: audit table with analytics indexes |
Staff rule: Every index proposal must justify the read gain against the write cost. At 5+ indexes on a write-heavy table, the overhead is architectural, not incidental.
Practice Prompt
Staff-Caliber Answer ShapeExpand
- Audit all 8 indexes. Use
pg_stat_user_indexesto check which indexes are actually scanned. Unused indexes cost writes with zero read benefit — drop them. - Check for redundant indexes. If you have both
(user_id)and(user_id, created_at), the composite index covers the single-column queries. Drop the single-column index. - Move the dashboard query off OLTP.
COUNT(*) GROUP BY statuson 500M rows is an analytics query. It should hit a materialized view or a replica — not the primary. Refresh the materialized view every 5 minutes. - Partition the table. 500M rows with 2M/day growth means indexes are growing too. Range-partition by
created_at(monthly). Old partitions are cold; new partitions are hot. Index maintenance is per-partition, not per-table. - Consider partial indexes. If 95% of orders are
status = 'completed', a partial indexWHERE status != 'completed'covers the active-order queries with a fraction of the storage.
The Staff move: Don't add more indexes. Remove unnecessary ones, move analytics queries elsewhere, and partition to keep index sizes manageable.
Additional Traps
- Creating indexes during peak traffic.
CREATE INDEXon a large table acquires a write lock (withoutCONCURRENTLYin Postgres) and can block writes for minutes. Always useCREATE INDEX CONCURRENTLYin production. - Ignoring index bloat. B-tree indexes accumulate dead tuples from updates/deletes. Without
REINDEXor autovacuum tuning, an index can be 2-3x its necessary size, wasting memory and slowing scans. - Composite index column order by intuition. The correct order is: equality columns first, then range columns, then sort columns. Getting this wrong means the range condition breaks the sort optimization.