StaffSignal
Foundation — Quick Reference

Data Modeling

Normalization vs denormalization as read/write cost tradeoff. Schema evolution is migration risk. The Staff question is 'what's the query pattern, not the data shape.'

Data Modeling — Staff Interview Quick Reference

The 60-Second Version

  • The Staff question is "What's the query pattern?" not "What's the data shape?" Model for how you read, not how the entity looks in a whiteboard diagram.
  • Normalization eliminates redundancy and enforces integrity. Good for write-heavy OLTP workloads where correctness matters. The cost is joins on reads.
  • Denormalization pre-joins data for fast reads. Good for read-heavy and analytics workloads. The cost is write amplification and stale data risk.
  • Break 3NF when read latency matters more than write correctness. Be explicit about who pays: engineering maintains consistency, and product explains staleness to users.
  • Document model (MongoDB, DynamoDB) works when "what you read is what you wrote" — self-contained documents fetched in a single call. It breaks down when relationships span documents.
  • Relational model works when relationships are the point. Foreign keys enforce integrity at the storage layer. Joins are cheap when indexed properly.
  • Embed when data is always read together. Reference when data changes independently. Getting this wrong means either read amplification (too many references) or update amplification (too much embedding).

What Staff Engineers Say (That Seniors Don't)

ConceptSenior ResponseStaff Response
Schema choice"Use relational for structured data, NoSQL for unstructured""Start from the top-three query patterns and work backward. The access pattern picks the schema, not the data shape."
Denormalization"Denormalize for performance""Every denormalized field is a consistency contract you maintain in application code. Quantify the read gain against the write amplification before committing."
Document vs. relational"Documents are more flexible""Documents trade query-time flexibility for write-time simplicity. Once you need cross-document joins, you've rebuilt a relational engine badly."
Schema evolution"Just add a column""Adding a column is cheap. Renaming is a migration. Removing requires backfill and a deprecation window. Every schema change is a deployment risk with a blast radius."
Embedding vs. referencing"Embed for speed""Embed when the child entity has no independent lifecycle. If it changes on a different cadence or is accessed alone, reference it — otherwise updates fan out to every parent."

The Numbers That Matter

  • Row size and read cost: a 1 KB row scanned from an index is ~0.1 ms; a 100 KB denormalized row with embedded arrays is still one read but balloons cache and network cost.
  • Join cost with proper indexes: two-table indexed join on 10M rows is typically <5 ms. Without indexes, it becomes a full table scan at seconds-scale.
  • Write amplification from denormalization: updating one normalized source row versus updating N denormalized copies. At N=50 fan-out, you have turned one write into a consistency problem.
  • Document size limits: MongoDB caps documents at 16 MB; DynamoDB items at 400 KB. Hitting these limits forces schema redesign under production pressure.

Common Interview Traps

  • Normalizing everything by default. Candidates who insist on 3NF without acknowledging the read path signal a lack of production experience. State the trade-off explicitly.
  • Denormalizing without quantifying the cost. Saying "denormalize for speed" without discussing write amplification, staleness windows, or consistency mechanisms is a senior-level answer.
  • Choosing document model because it is "simpler." Documents defer complexity to the read layer. If the interviewer's scenario involves cross-entity queries, pivot to relational and explain why.
  • Ignoring schema evolution. Production systems live for years. Mention backward-compatible changes, dual-write migration strategies, and the operational cost of schema changes.

Model Selection Decision Tree

Rendering diagram...

Embed vs. Reference Decision Matrix

FactorEmbedReference
Read together always?Yes → embedNo → reference
Child has independent lifecycle?No → embedYes → reference
Child data changes frequently?No → embedYes → reference (avoid update fan-out)
1:1 or 1:few relationship?Embed (bounded size)N/A
1:many or many:many?Reference (unbounded)Reference
Document size risk?<100 KB total → embedRisk of exceeding limits → reference

Schema Evolution Playbook

Change TypeRisk LevelStrategy
Add nullable columnLowDeploy code that handles null → add column → backfill if needed
Add required columnMediumAdd as nullable → backfill → add NOT NULL constraint → update code
Rename columnHighAdd new column → dual-write old+new → migrate reads → drop old
Change column typeHighAdd shadow column with new type → dual-write → migrate → drop old
Remove columnMediumStop reading → stop writing → drop column (wait for all deploys to roll out)
Split tableVery HighDual-write → backfill → migrate reads → drop old table (weeks-long project)

Practice Prompt

Staff-Caliber Answer Shape
Expand
  1. Start from the three query patterns:

    • Order creation: User → restaurant → menu items → order (transactional, relational)
    • Real-time tracking: Order status + driver location (high-frequency updates, KV/cache)
    • Analytics: Order history by date, revenue by restaurant (OLAP, columnar)
  2. Relational for the core domain. Users, restaurants, menu_items, orders in PostgreSQL. Foreign keys enforce integrity. Indexes on (user_id, created_at) and (restaurant_id, status) for the hot read paths.

  3. Denormalize the order view. An order_summary table (or materialized view) pre-joins order + restaurant + items. This eliminates the 3-table join on the order tracking page — the most latency-sensitive read.

  4. KV store for real-time state. Driver location and order status in Redis. Updated every 3 seconds. TTL of 24 hours (orders complete within hours). This keeps the real-time path off the relational DB.

  5. CDC to analytics. Stream order events to a columnar store (ClickHouse or BigQuery) for business analytics. Revenue reports, delivery time distributions, restaurant performance — all served from the analytical store, not the OLTP database.

The Staff move: Three access patterns → three storage strategies, connected by CDC. Don't force one database to serve all patterns.

Additional Traps

  • Over-normalizing from the start. If you need 5 JOINs to render the most common page, you've optimized for write correctness at the expense of read performance. Staff engineers ask "what's the read cost of this normalization?"
  • Embedding unbounded arrays. A user's order history embedded in the user document grows forever. This is a ticking time bomb for document size limits and read performance.
  • Ignoring soft deletes. DELETE FROM orders WHERE id = X is irreversible. Staff engineers default to soft deletes (deleted_at timestamp) for any business-critical entity and define a hard-delete retention policy.

Where This Appears