Numbers to Know — Staff Interview Quick Reference
The 60-Second Version
- System design interviews test whether you can size a system without a calculator. Wrong orders of magnitude signal you have not operated production infrastructure.
- Interviewers do not expect exact figures. They expect you to stay within 2x of reality. Being 10x off on latency or throughput raises immediate credibility concerns.
- Memorize order of magnitude, not decimal places. L1 cache is nanoseconds, disk seek is milliseconds, cross-region is tens to hundreds of milliseconds.
- Numbers anchor every capacity plan, every SLA discussion, and every sharding decision. They are not trivia — they are the language of trade-off conversations.
- Staff candidates connect numbers to architectural choices: "At 100K QPS we need horizontal scaling; at 1K QPS a single Postgres instance is fine."
- Round aggressively. Use powers of 10. State your assumptions out loud. This is what interviewers actually evaluate.
Staff vs Senior: How Numbers Change the Conversation
| Number | Senior Engineers Say | Staff Engineers Say |
|---|---|---|
| 99th percentile latency | "We should optimize the p99" | "p99 at 500ms means 1% of our 10M daily users hit this — that's 100K frustrated sessions. Is that acceptable for checkout vs. search?" |
| Throughput (QPS) | "We need to handle 50K QPS" | "50K QPS average means 150-500K peak. A single Postgres instance tops out at 50K reads/s — we need a caching layer, not more replicas" |
| Storage cost | "We'll store everything in S3" | "100TB across 3 replicas with indexes is 500TB actual. At $0.023/GB that's $11.5K/month — do we need 7-year retention or can we tier to Glacier after 90 days?" |
| Network bandwidth | "We have 10Gbps links" | "10Gbps theoretical is ~7Gbps goodput after overhead. Our 5TB/day outbound needs 460Mbps sustained — one link handles it, but during peak we'll saturate at 3x average" |
| Failure rate | "We target 99.9% availability" | "99.9% = 43 minutes downtime/month. With 3 dependencies each at 99.95%, our composite availability is 99.85% — we need circuit breakers and fallbacks to close the 0.05% gap" |
| Cache hit ratio | "Our cache hit rate is 95%" | "95% hit rate at 100K QPS means 5K cache misses/second hitting the database. If DB handles 10K reads/s, we're at 50% capacity from misses alone — a cache failure doubles DB load instantly" |
Latency Numbers
| Operation | Latency | Order of Magnitude |
|---|---|---|
| L1 cache reference | 1 ns | nanoseconds |
| L2 cache reference | 4 ns | nanoseconds |
| Main memory reference | 100 ns | nanoseconds |
| SSD random read | 100 us | microseconds |
| SSD sequential read (1 MB) | 1 ms | milliseconds |
| Network round-trip, same AZ | 0.5 ms | milliseconds |
| Network round-trip, same region | 1-2 ms | milliseconds |
| Network round-trip, cross-region | 50-150 ms | tens of milliseconds |
| Disk seek (HDD) | 10 ms | milliseconds |
Throughput Numbers
| System | Throughput | Notes |
|---|---|---|
| Single web server | ~10K req/s | CPU-bound; I/O-bound workloads vary |
| Redis (single thread) | ~100K ops/s | ~500K with pipelining |
| Kafka (per partition) | ~1M msgs/s | Throughput scales with partitions |
| Postgres | ~10K writes/s, ~50K reads/s | Assumes tuned config, SSDs |
| MySQL | ~15K writes/s | InnoDB, commodity hardware |
| 1 Gbps network link | ~120 MB/s | Practical ceiling after overhead |
Storage & Scale Numbers
| Calculation | Result | Rule of Thumb |
|---|---|---|
| 1M users x 1 KB each | 1 GB | Fits in RAM on a single machine |
| 1B events/day x 100 bytes | 100 GB/day, ~36 TB/year | Plan for compression + retention policy |
| 500M tweets/day x 300 bytes | 150 GB/day | ~55 TB/year raw, before indexes or replicas |
| Seconds in a day | ~86,400 (~100K) | Use 100K for quick QPS math |
Back-of-Envelope Reasoning
Example 1 — URL shortener write QPS. 100M new URLs/day. 100M / 100K seconds = ~1K writes/s. A single Postgres instance handles this comfortably.
Example 2 — Chat message storage. 1B messages/day, 200 bytes average. 200 GB/day raw. Over one year: ~73 TB. You need sharding and a retention strategy.
Example 3 — Image service bandwidth. 10M images served/day, 500 KB average. 5 TB/day outbound. At 120 MB/s per link, that is roughly 500 link-seconds of sustained throughput — a small CDN handles it.
Common Interview Traps
- Confusing latency units. Mixing up microseconds and milliseconds changes your architecture. SSD random read is 100 us, not 100 ms.
- Ignoring replication and indexing overhead. Raw data size is never the full storage cost. Multiply by 3x for replicas, add 30-50% for indexes.
- Forgetting to convert units consistently. Always normalize to the same time horizon (per second, per day, per year) before comparing.
- Over-precision. Saying "we need 11,574 QPS" instead of "roughly 12K QPS" signals inexperience with real estimation.
Latency Hierarchy
Quick Conversion Table
| From | To | Rule |
|---|---|---|
| Daily volume → QPS | ÷ 86,400 (use 100K) | 1M/day ≈ 10 QPS |
| QPS → daily volume | × 86,400 (use 100K) | 100 QPS ≈ 10M/day |
| GB/day → MB/s | ÷ 86,400 × 1,000 | 100 GB/day ≈ 1.2 MB/s |
| Users → concurrent | × 0.01 to 0.10 | 10M users → 100K-1M concurrent |
| Monthly active → daily active | × 0.30 to 0.50 | 100M MAU → 30-50M DAU |
| Peak → average | × 3 to 10 | Average 1K QPS → peak 3K-10K |
Practice Prompt
Staff-Caliber Answer ShapeExpand
- Total feed loads/day: 500M × 8 = 4B feed loads
- Total post reads/day: 4B × 20 = 80B post reads
- QPS: 80B / 100K seconds ≈ 800K QPS (peak: 2-3M QPS)
- Bandwidth: 80B × 2 KB = 160 TB/day ≈ 1.8 GB/s sustained
- Can a single DB handle it? No. Postgres handles ~50K reads/s. We need at least 16 read replicas for average load and 40+ for peak. This is a caching problem — a 95% cache hit rate reduces DB load to 40K QPS, within single-instance range.
The Staff move: Don't just compute the number. Follow through to the architectural implication: this volume demands a caching layer, not just database scaling.
Common Scale Anchors
Use these as sanity checks when estimating:
| System | Known Scale | Useful As |
|---|---|---|
| Twitter/X | ~500M tweets/day | High-write social benchmark |
| Google Search | ~8.5B queries/day (~100K QPS) | Read-heavy search benchmark |
| Uber | ~20M rides/day, 5M driver location updates/second | Real-time location at scale |
| Stripe | ~millions of transactions/day | Payment processing benchmark |
| ~100B messages/day | Messaging throughput ceiling | |
| YouTube | ~500 hours uploaded/minute, ~1B hours watched/day | Media storage + bandwidth |
Additional Traps
- Forgetting peak-to-average ratio. Average QPS is useless for capacity planning. You provision for peak, which is 3-10x average depending on the workload.
- Treating storage as free. "We'll just store everything" ignores that 100 TB of hot data across 3 replicas with indexes is 500+ TB of actual storage cost.
- Ignoring write amplification. One user action (post a tweet) can generate 10+ writes: the tweet itself, timeline fan-out, index updates, notification triggers, analytics events.
- Confusing network throughput with goodput. Protocol overhead, retransmissions, and encryption reduce usable throughput to ~70% of theoretical maximum.