StaffSignal

Design a Notification System

Staff-Level Playbook

Technologies referenced in this playbook: Apache Kafka · Elasticsearch

How to Use This Playbook

This playbook supports three reading modes:

ModeTimeWhat to Read
Quick Review15 minExecutive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study1-2 hrsExecutive Summary → Interview Walkthrough → Core Flow, expand appendices where you're weak
Deep Dive3+ hrsEverything, including all appendices
What are Notification & Delivery Systems? — Why interviewers pick this topic

The Problem

Notification systems deliver time-sensitive information to users across multiple channels — push notifications, email, SMS, in-app messages, webhooks. The challenge isn't sending a message; it's building a system that sends the right message to the right user at the right time through the right channel — without annoying them, losing messages, or sending duplicates.

Common Use Cases

  • Transactional: Order confirmations, password resets, payment receipts (must deliver, user expects them)
  • Engagement: "Someone liked your post," "New message from Alice" (important but deferrable)
  • Marketing: Promotions, re-engagement campaigns (lowest priority, highest volume)
  • Operational: Alert escalations, incident notifications, deploy status (internal, often urgent)
  • Webhook delivery: Third-party integrations requiring reliable HTTP callbacks

Why Interviewers Ask About This

Notification systems expose the Staff-level skill of managing competing priorities across organizational boundaries. Product wants more engagement notifications. Users want fewer notifications. Infrastructure wants predictable load. Compliance wants audit trails. The Staff question isn't "how do you send a push notification" — it's "who decides what gets sent, what gets suppressed, and what happens when the delivery pipeline falls behind?"

Mechanics Refresher: Delivery Channel Characteristics — Push, email, SMS, webhook reliability profiles
Who Pays Analysis
ChannelDelivery GuaranteeLatencyCost per MessageFailure Mode
Push (APNs/FCM)Best-effort (device must be online)100ms-5s~$0Token expiry, device offline, OS throttling
Email (SMTP/SES)Best-effort (spam filters, bounce)Seconds to minutes$0.10/1000Spam classification, bounce, reputation damage
SMS (Twilio/SNS)High (carrier delivery)1-10s$0.01-0.05 eachCarrier filtering, number portability, cost explosion
In-appGuaranteed (if user opens app)Instant$0User never opens app
WebhookAt-least-once (with retry)100ms-30s$0Endpoint down, timeout, backpressure

Why this matters for fault lines: Each channel has different reliability, cost, and user experience characteristics. A notification system that treats all channels the same will either over-deliver (user annoyance, cost) or under-deliver (missed critical messages). Staff engineers design channel routing as a first-class concern.

What This Interview Actually Tests

Notifications is not a "send a push notification" question. Everyone can call the FCM API.

This is a delivery reliability and user experience ownership question that tests:

  • Whether you separate notification generation from delivery routing (and who owns each)
  • Whether you design suppression and deduplication as first-class features, not afterthoughts
  • Whether you reason about notification fatigue as a product metric, not just a UX complaint
  • Whether you can own the operational cost of multi-channel delivery at scale

The key insight: The hardest problem in notifications isn't delivery — it's deciding what not to send. Staff engineers design systems where suppression logic is as well-designed as delivery logic.

The L5 vs L6 Contrast (Memorize This)

Level Calibration
BehaviorL5 (Senior)L6 (Staff)
First move"We'll use a message queue and push API"Asks "What are the notification categories? What are the delivery SLAs per category?"
ArchitectureEvent → queue → sendEvent → priority classification → suppression → channel routing → delivery → tracking
Deduplication"We'll use idempotency keys"Designs suppression: rate limits per user per channel, quiet hours, preference-aware routing
Failure"Retry with backoff"Asks "What's the retry budget? Dead-letter queue? Who gets paged when delivery rate drops below 95%?"
OwnershipFocuses on the delivery pipelineAsks "Who owns notification content? Who owns suppression rules? Who measures notification fatigue?"

The Three Intents (Pick One and Commit)

Who Pays Analysis
IntentConstraintStrategyDelivery SLA
Transactional DeliveryMust deliver, user expects itRetry aggressively, multi-channel fallback99.9% within 60 seconds
Engagement OptimizationShould deliver, but not annoySmart routing, suppression, batching95% within 5 minutes
Scale BroadcastMust reach millions, cost mattersBatching, prioritization, degradation strategy90% within 30 minutes

🎯 Staff Insight: "I'll assume we're building a notification platform that handles both transactional and engagement notifications. Transactional messages (order confirmations, password resets) have a 99.9% delivery SLA within 60 seconds. Engagement notifications (social activity, recommendations) are subject to user preferences, suppression rules, and batching. I'll focus on the architecture that separates these concerns."

The Five Fault Lines (The Core of This Interview)

  1. Delivery Guarantee vs User Experience — Retrying aggressively ensures delivery but risks duplicates and annoyance. How do you balance reliability with user happiness?

  2. Channel Routing — Push, email, SMS, in-app — each has different reliability, cost, and user impact. Who decides which channel, and what's the fallback chain?

  3. Suppression & Rate Limiting — The most important feature users never see. How do you prevent notification fatigue without losing critical messages?

  4. Priority & Ordering — Password resets must beat marketing emails in the queue. How do you prioritize without starving lower-priority notifications?

  5. Delivery Tracking & Observability — How do you know a notification was actually received (not just sent)? Who owns delivery metrics?

Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.

Quick Reference: What Interviewers Probe

After You Say...They Will Ask...
"Push notification via FCM""What if the device is offline? What's the fallback? How long do you retry?"
"Message queue for async""How do you ensure a password reset email arrives in 30 seconds, not 5 minutes behind a marketing batch?"
"Retry with exponential backoff""What's your retry budget? What happens after exhausting retries? Who gets paged?"
"User preferences table""How do you handle 'do not disturb' across time zones? Who owns the suppression rules?"
"We'll batch notifications""What if two events happen 1 second apart? Does the user get two notifications or one?"

Jump to Practice

Active Drills (§7) — 8 practice prompts with expected answer shapes

System Architecture Overview

Rendering diagram...

Interview Walkthrough

Phase 1: Requirements & Framing (30 seconds)

  • "Notification systems deliver messages across multiple channels — push (APNs/FCM), email, SMS, in-app — based on user preferences, urgency, and content type. The hard part isn't sending notifications; it's deciding WHAT to send, to WHOM, through WHICH channel, and WHEN."

Phase 2: Core Entities (30 seconds)

  • NotificationEvent: the trigger — what happened (order shipped, friend request, price drop)
  • NotificationPreference: per-user, per-category channel preferences (push for messages, email for receipts, no SMS)
  • DeliveryChannel: the transport — APNs, FCM, SendGrid, Twilio, in-app websocket
  • NotificationLog: immutable record of what was sent, when, through which channel, and the delivery status

Phase 3: The 2-Minute Architecture (2 minutes)

Phase 4: Transition to Depth (15 seconds)

"The pipeline is straightforward. The hard problems are: notification fatigue (sending too many), delivery reliability across channels, and real-time vs batched delivery."

Phase 5: Deep Dives (5-15 minutes if probed)

Probe 1: "How do you prevent notification fatigue?" (3-5 min)

"The biggest failure mode of a notification system isn't failed delivery — it's successful delivery of too many notifications. Users disable notifications, and you permanently lose the channel."

Walk through the mitigation:

  1. Per-user daily budget: Maximum N notifications per day per channel. Priority-based: if the budget is 5 and 6 notifications are pending, drop the lowest-priority one.
  2. Notification scoring: Each notification gets a relevance score based on: user's historical engagement with this category, recency of similar notifications, content relevance. Low-scoring notifications are suppressed.
  3. Batching/digest: Instead of 10 individual "someone liked your post" notifications, batch them into one: "15 people liked your post." The batch window is time-based (every 30 minutes) or count-based (every 5 similar events).
  4. Smart timing: Don't send notifications at 3 AM. Predict the user's active hours from historical app usage and queue notifications for their next active window.

Probe 2: "How do you handle delivery reliability?" (3-5 min)

"Each channel has different delivery guarantees:"

  • Push (APNs/FCM): Fire-and-forget. APNs returns a success/failure immediately, but 'success' means 'Apple accepted it' — not 'the user saw it.' The device may be offline. "APNs gives you device tokens. If the token is invalid (user uninstalled), APNs tells you — update your device registry."
  • Email (SendGrid/SES): Delivery confirmation via webhook (delivered, bounced, opened, clicked). Bounce handling is critical — hard bounces must immediately unsubscribe the address.
  • SMS (Twilio): Delivery receipts available but not universal (depends on carrier). SMS is expensive ($0.01-$0.05/message) — use only for high-urgency notifications.
  • In-app: WebSocket push if the user is online; persist to an in-app inbox for later viewing if offline.

Probe 3: "How do you handle multi-channel delivery?" (3-5 min)

"A high-urgency notification (fraud alert) should try multiple channels: push first, then SMS if push fails, then email. A low-urgency notification (weekly digest) goes to email only."

Walk through the cascade:

  1. Primary channel: Based on user preference and notification type. Try push first.
  2. Fallback on failure: If push fails (device token invalid or user didn't engage within 5 minutes), try the next channel in the cascade.
  3. Deduplication: If the user sees the push notification, cancel the pending SMS/email. "The notification system must track cross-channel delivery and suppress duplicates."

Phase 6: Wrap-Up

"Notification systems are a restraint engineering problem. The technology (Kafka, APNs, SendGrid) is commodity. The Staff-level insight: the system's most important function is deciding NOT to send. Frequency caps, notification scoring, smart timing, and batching — these are the features that determine whether users keep notifications enabled or disable them forever."

Quick-Reference: The 30-Second Cheat Sheet

Level Calibration
TopicThe L5 AnswerThe L6 Answer (say this)
Purpose"Send notifications""Decide IF, WHEN, and HOW to notify — sending is the easy part"
Architecture"Service sends a push""Event → routing (preferences, frequency cap, scoring) → channel delivery → tracking"
Fatigue"Users can disable notifications""Per-user daily budget, notification scoring, digest batching, smart timing"
Multi-channel"Send to all channels""Priority cascade with fallback + cross-channel dedup"
Metrics"Track sends""Track sent → delivered → read funnel per channel per notification type"

Subscribe to continue reading

Core sections
  • 1. The Staff Lens
  • 2. Problem Framing & Intent
  • 3. The Five Fault Lines
  • 4. Failure Modes & Degradation
  • 5. Evaluation Rubric
  • 6. Interview Flow & Pivots
  • 7. Active Drills
  • 8. Deep Dive Scenarios
  • 9. Level Expectations Summary
  • 10. Staff Insiders: Controversial Opinions