Design a Notification System | StaffSignal Playbook

Technologies referenced in this playbook: Apache Kafka · Elasticsearch

How to Use This Playbook

This playbook supports three reading modes:

Mode	Time	What to Read
Quick Review	15 min	Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study	1-2 hrs	Executive Summary → Interview Walkthrough → Core Flow, expand appendices where you're weak
Deep Dive	3+ hrs	Everything, including all appendices

What are Notification & Delivery Systems? — Why interviewers pick this topic

The Problem

Notification systems deliver time-sensitive information to users across multiple channels — push notifications, email, SMS, in-app messages, webhooks. The challenge isn't sending a message; it's building a system that sends the right message to the right user at the right time through the right channel — without annoying them, losing messages, or sending duplicates.

Common Use Cases

Transactional: Order confirmations, password resets, payment receipts (must deliver, user expects them)
Engagement: "Someone liked your post," "New message from Alice" (important but deferrable)
Marketing: Promotions, re-engagement campaigns (lowest priority, highest volume)
Operational: Alert escalations, incident notifications, deploy status (internal, often urgent)
Webhook delivery: Third-party integrations requiring reliable HTTP callbacks

Why Interviewers Ask About This

Notification systems expose the Staff-level skill of managing competing priorities across organizational boundaries. Product wants more engagement notifications. Users want fewer notifications. Infrastructure wants predictable load. Compliance wants audit trails. The Staff question isn't "how do you send a push notification" — it's "who decides what gets sent, what gets suppressed, and what happens when the delivery pipeline falls behind?"

Mechanics Refresher: Delivery Channel Characteristics — Push, email, SMS, webhook reliability profiles

Who Pays Analysis

Channel	Delivery Guarantee	Latency	Cost per Message	Failure Mode
Push (APNs/FCM)	Best-effort (device must be online)	100ms-5s	~$0	Token expiry, device offline, OS throttling
Email (SMTP/SES)	Best-effort (spam filters, bounce)	Seconds to minutes	$0.10/1000	Spam classification, bounce, reputation damage
SMS (Twilio/SNS)	High (carrier delivery)	1-10s	$0.01-0.05 each	Carrier filtering, number portability, cost explosion
In-app	Guaranteed (if user opens app)	Instant	$0	User never opens app
Webhook	At-least-once (with retry)	100ms-30s	$0	Endpoint down, timeout, backpressure

Why this matters for fault lines: Each channel has different reliability, cost, and user experience characteristics. A notification system that treats all channels the same will either over-deliver (user annoyance, cost) or under-deliver (missed critical messages). Staff engineers design channel routing as a first-class concern.

What This Interview Actually Tests

Notifications is not a "send a push notification" question. Everyone can call the FCM API.

This is a delivery reliability and user experience ownership question that tests:

Whether you separate notification generation from delivery routing (and who owns each)
Whether you design suppression and deduplication as first-class features, not afterthoughts
Whether you reason about notification fatigue as a product metric, not just a UX complaint
Whether you can own the operational cost of multi-channel delivery at scale

The key insight: The hardest problem in notifications isn't delivery — it's deciding what not to send. Staff engineers design systems where suppression logic is as well-designed as delivery logic.

The L5 vs L6 Contrast (Memorize This)

Level Calibration

Behavior	L5 (Senior)	L6 (Staff)
First move	"We'll use a message queue and push API"	Asks "What are the notification categories? What are the delivery SLAs per category?"
Architecture	Event → queue → send	Event → priority classification → suppression → channel routing → delivery → tracking
Deduplication	"We'll use idempotency keys"	Designs suppression: rate limits per user per channel, quiet hours, preference-aware routing
Failure	"Retry with backoff"	Asks "What's the retry budget? Dead-letter queue? Who gets paged when delivery rate drops below 95%?"
Ownership	Focuses on the delivery pipeline	Asks "Who owns notification content? Who owns suppression rules? Who measures notification fatigue?"

The Three Intents (Pick One and Commit)

Who Pays Analysis

Intent	Constraint	Strategy	Delivery SLA
Transactional Delivery	Must deliver, user expects it	Retry aggressively, multi-channel fallback	99.9% within 60 seconds
Engagement Optimization	Should deliver, but not annoy	Smart routing, suppression, batching	95% within 5 minutes
Scale Broadcast	Must reach millions, cost matters	Batching, prioritization, degradation strategy	90% within 30 minutes

🎯 Staff Insight: "I'll assume we're building a notification platform that handles both transactional and engagement notifications. Transactional messages (order confirmations, password resets) have a 99.9% delivery SLA within 60 seconds. Engagement notifications (social activity, recommendations) are subject to user preferences, suppression rules, and batching. I'll focus on the architecture that separates these concerns."

The Five Fault Lines (The Core of This Interview)

Delivery Guarantee vs User Experience — Retrying aggressively ensures delivery but risks duplicates and annoyance. How do you balance reliability with user happiness?
Channel Routing — Push, email, SMS, in-app — each has different reliability, cost, and user impact. Who decides which channel, and what's the fallback chain?
Suppression & Rate Limiting — The most important feature users never see. How do you prevent notification fatigue without losing critical messages?
Priority & Ordering — Password resets must beat marketing emails in the queue. How do you prioritize without starving lower-priority notifications?
Delivery Tracking & Observability — How do you know a notification was actually received (not just sent)? Who owns delivery metrics?

Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.

Quick Reference: What Interviewers Probe

After You Say...	They Will Ask...
"Push notification via FCM"	"What if the device is offline? What's the fallback? How long do you retry?"
"Message queue for async"	"How do you ensure a password reset email arrives in 30 seconds, not 5 minutes behind a marketing batch?"
"Retry with exponential backoff"	"What's your retry budget? What happens after exhausting retries? Who gets paged?"
"User preferences table"	"How do you handle 'do not disturb' across time zones? Who owns the suppression rules?"
"We'll batch notifications"	"What if two events happen 1 second apart? Does the user get two notifications or one?"

Jump to Practice

→ Active Drills (§7) — 8 practice prompts with expected answer shapes

System Architecture Overview

Rendering diagram...

Interview Walkthrough

Phase 1: Requirements & Framing (30 seconds)

"Notification systems deliver messages across multiple channels — push (APNs/FCM), email, SMS, in-app — based on user preferences, urgency, and content type. The hard part isn't sending notifications; it's deciding WHAT to send, to WHOM, through WHICH channel, and WHEN."

Phase 2: Core Entities (30 seconds)

NotificationEvent: the trigger — what happened (order shipped, friend request, price drop)
NotificationPreference: per-user, per-category channel preferences (push for messages, email for receipts, no SMS)
DeliveryChannel: the transport — APNs, FCM, SendGrid, Twilio, in-app websocket
NotificationLog: immutable record of what was sent, when, through which channel, and the delivery status

Phase 3: The 2-Minute Architecture (2 minutes)

Staff-grade phrasing

"The notification pipeline has four stages:

1. Event ingestion. Services publish notification events to a Kafka topic. Events carry: event_type, target_user, payload, urgency.

2. Routing. The router checks user preferences: does this user want this notification type? Through which channel? Is the user within their daily notification budget (frequency cap)? If all checks pass, enqueue to the appropriate channel queue.

3. Channel delivery. Each channel has its own delivery service: APNs adapter, FCM adapter, email via SendGrid, SMS via Twilio. Each handles retries, rate limits, and delivery confirmation independently.

4. Tracking. Every sent notification is logged with delivery status (sent, delivered, read, failed). This feeds analytics (open rates, click rates) and the notification scoring model."

Phase 4: Transition to Depth (15 seconds)

"The pipeline is straightforward. The hard problems are: notification fatigue (sending too many), delivery reliability across channels, and real-time vs batched delivery."

Phase 5: Deep Dives (5-15 minutes if probed)

Probe 1: "How do you prevent notification fatigue?" (3-5 min)

"The biggest failure mode of a notification system isn't failed delivery — it's successful delivery of too many notifications. Users disable notifications, and you permanently lose the channel."

Walk through the mitigation:

Per-user daily budget: Maximum N notifications per day per channel. Priority-based: if the budget is 5 and 6 notifications are pending, drop the lowest-priority one.
Notification scoring: Each notification gets a relevance score based on: user's historical engagement with this category, recency of similar notifications, content relevance. Low-scoring notifications are suppressed.
Batching/digest: Instead of 10 individual "someone liked your post" notifications, batch them into one: "15 people liked your post." The batch window is time-based (every 30 minutes) or count-based (every 5 similar events).
Smart timing: Don't send notifications at 3 AM. Predict the user's active hours from historical app usage and queue notifications for their next active window.

Probe 2: "How do you handle delivery reliability?" (3-5 min)

"Each channel has different delivery guarantees:"

Push (APNs/FCM): Fire-and-forget. APNs returns a success/failure immediately, but 'success' means 'Apple accepted it' — not 'the user saw it.' The device may be offline. "APNs gives you device tokens. If the token is invalid (user uninstalled), APNs tells you — update your device registry."
Email (SendGrid/SES): Delivery confirmation via webhook (delivered, bounced, opened, clicked). Bounce handling is critical — hard bounces must immediately unsubscribe the address.
SMS (Twilio): Delivery receipts available but not universal (depends on carrier). SMS is expensive ($0.01-$0.05/message) — use only for high-urgency notifications.
In-app: WebSocket push if the user is online; persist to an in-app inbox for later viewing if offline.

Probe 3: "How do you handle multi-channel delivery?" (3-5 min)

"A high-urgency notification (fraud alert) should try multiple channels: push first, then SMS if push fails, then email. A low-urgency notification (weekly digest) goes to email only."

Walk through the cascade:

Primary channel: Based on user preference and notification type. Try push first.
Fallback on failure: If push fails (device token invalid or user didn't engage within 5 minutes), try the next channel in the cascade.
Deduplication: If the user sees the push notification, cancel the pending SMS/email. "The notification system must track cross-channel delivery and suppress duplicates."

Phase 6: Wrap-Up

"Notification systems are a restraint engineering problem. The technology (Kafka, APNs, SendGrid) is commodity. The Staff-level insight: the system's most important function is deciding NOT to send. Frequency caps, notification scoring, smart timing, and batching — these are the features that determine whether users keep notifications enabled or disable them forever."

Quick-Reference: The 30-Second Cheat Sheet