Technologies referenced in this playbook: Apache Kafka · Elasticsearch
How to Use This Playbook
This playbook supports three reading modes:
| Mode | Time | What to Read |
|---|---|---|
| Quick Review | 15 min | Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7) |
| Targeted Study | 1-2 hrs | Executive Summary → Interview Walkthrough → Core Flow, expand appendices where you're weak |
| Deep Dive | 3+ hrs | Everything, including all appendices |
What are Notification & Delivery Systems? — Why interviewers pick this topic
The Problem
Notification systems deliver time-sensitive information to users across multiple channels — push notifications, email, SMS, in-app messages, webhooks. The challenge isn't sending a message; it's building a system that sends the right message to the right user at the right time through the right channel — without annoying them, losing messages, or sending duplicates.
Common Use Cases
- Transactional: Order confirmations, password resets, payment receipts (must deliver, user expects them)
- Engagement: "Someone liked your post," "New message from Alice" (important but deferrable)
- Marketing: Promotions, re-engagement campaigns (lowest priority, highest volume)
- Operational: Alert escalations, incident notifications, deploy status (internal, often urgent)
- Webhook delivery: Third-party integrations requiring reliable HTTP callbacks
Why Interviewers Ask About This
Notification systems expose the Staff-level skill of managing competing priorities across organizational boundaries. Product wants more engagement notifications. Users want fewer notifications. Infrastructure wants predictable load. Compliance wants audit trails. The Staff question isn't "how do you send a push notification" — it's "who decides what gets sent, what gets suppressed, and what happens when the delivery pipeline falls behind?"
Mechanics Refresher: Delivery Channel Characteristics — Push, email, SMS, webhook reliability profiles
| Channel | Delivery Guarantee | Latency | Cost per Message | Failure Mode |
|---|---|---|---|---|
| Push (APNs/FCM) | Best-effort (device must be online) | 100ms-5s | ~$0 | Token expiry, device offline, OS throttling |
| Email (SMTP/SES) | Best-effort (spam filters, bounce) | Seconds to minutes | $0.10/1000 | Spam classification, bounce, reputation damage |
| SMS (Twilio/SNS) | High (carrier delivery) | 1-10s | $0.01-0.05 each | Carrier filtering, number portability, cost explosion |
| In-app | Guaranteed (if user opens app) | Instant | $0 | User never opens app |
| Webhook | At-least-once (with retry) | 100ms-30s | $0 | Endpoint down, timeout, backpressure |
Why this matters for fault lines: Each channel has different reliability, cost, and user experience characteristics. A notification system that treats all channels the same will either over-deliver (user annoyance, cost) or under-deliver (missed critical messages). Staff engineers design channel routing as a first-class concern.
What This Interview Actually Tests
Notifications is not a "send a push notification" question. Everyone can call the FCM API.
This is a delivery reliability and user experience ownership question that tests:
- Whether you separate notification generation from delivery routing (and who owns each)
- Whether you design suppression and deduplication as first-class features, not afterthoughts
- Whether you reason about notification fatigue as a product metric, not just a UX complaint
- Whether you can own the operational cost of multi-channel delivery at scale
The key insight: The hardest problem in notifications isn't delivery — it's deciding what not to send. Staff engineers design systems where suppression logic is as well-designed as delivery logic.
The L5 vs L6 Contrast (Memorize This)
| Behavior | L5 (Senior) | L6 (Staff) |
|---|---|---|
| First move | "We'll use a message queue and push API" | Asks "What are the notification categories? What are the delivery SLAs per category?" |
| Architecture | Event → queue → send | Event → priority classification → suppression → channel routing → delivery → tracking |
| Deduplication | "We'll use idempotency keys" | Designs suppression: rate limits per user per channel, quiet hours, preference-aware routing |
| Failure | "Retry with backoff" | Asks "What's the retry budget? Dead-letter queue? Who gets paged when delivery rate drops below 95%?" |
| Ownership | Focuses on the delivery pipeline | Asks "Who owns notification content? Who owns suppression rules? Who measures notification fatigue?" |
The Three Intents (Pick One and Commit)
| Intent | Constraint | Strategy | Delivery SLA |
|---|---|---|---|
| Transactional Delivery | Must deliver, user expects it | Retry aggressively, multi-channel fallback | 99.9% within 60 seconds |
| Engagement Optimization | Should deliver, but not annoy | Smart routing, suppression, batching | 95% within 5 minutes |
| Scale Broadcast | Must reach millions, cost matters | Batching, prioritization, degradation strategy | 90% within 30 minutes |
🎯 Staff Insight: "I'll assume we're building a notification platform that handles both transactional and engagement notifications. Transactional messages (order confirmations, password resets) have a 99.9% delivery SLA within 60 seconds. Engagement notifications (social activity, recommendations) are subject to user preferences, suppression rules, and batching. I'll focus on the architecture that separates these concerns."
The Five Fault Lines (The Core of This Interview)
-
Delivery Guarantee vs User Experience — Retrying aggressively ensures delivery but risks duplicates and annoyance. How do you balance reliability with user happiness?
-
Channel Routing — Push, email, SMS, in-app — each has different reliability, cost, and user impact. Who decides which channel, and what's the fallback chain?
-
Suppression & Rate Limiting — The most important feature users never see. How do you prevent notification fatigue without losing critical messages?
-
Priority & Ordering — Password resets must beat marketing emails in the queue. How do you prioritize without starving lower-priority notifications?
-
Delivery Tracking & Observability — How do you know a notification was actually received (not just sent)? Who owns delivery metrics?
Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.
Quick Reference: What Interviewers Probe
| After You Say... | They Will Ask... |
|---|---|
| "Push notification via FCM" | "What if the device is offline? What's the fallback? How long do you retry?" |
| "Message queue for async" | "How do you ensure a password reset email arrives in 30 seconds, not 5 minutes behind a marketing batch?" |
| "Retry with exponential backoff" | "What's your retry budget? What happens after exhausting retries? Who gets paged?" |
| "User preferences table" | "How do you handle 'do not disturb' across time zones? Who owns the suppression rules?" |
| "We'll batch notifications" | "What if two events happen 1 second apart? Does the user get two notifications or one?" |
Jump to Practice
→ Active Drills (§7) — 8 practice prompts with expected answer shapes
System Architecture Overview
Interview Walkthrough
Phase 1: Requirements & Framing (30 seconds)
- "Notification systems deliver messages across multiple channels — push (APNs/FCM), email, SMS, in-app — based on user preferences, urgency, and content type. The hard part isn't sending notifications; it's deciding WHAT to send, to WHOM, through WHICH channel, and WHEN."
Phase 2: Core Entities (30 seconds)
- NotificationEvent: the trigger — what happened (order shipped, friend request, price drop)
- NotificationPreference: per-user, per-category channel preferences (push for messages, email for receipts, no SMS)
- DeliveryChannel: the transport — APNs, FCM, SendGrid, Twilio, in-app websocket
- NotificationLog: immutable record of what was sent, when, through which channel, and the delivery status
Phase 3: The 2-Minute Architecture (2 minutes)
Phase 4: Transition to Depth (15 seconds)
"The pipeline is straightforward. The hard problems are: notification fatigue (sending too many), delivery reliability across channels, and real-time vs batched delivery."
Phase 5: Deep Dives (5-15 minutes if probed)
Probe 1: "How do you prevent notification fatigue?" (3-5 min)
"The biggest failure mode of a notification system isn't failed delivery — it's successful delivery of too many notifications. Users disable notifications, and you permanently lose the channel."
Walk through the mitigation:
- Per-user daily budget: Maximum N notifications per day per channel. Priority-based: if the budget is 5 and 6 notifications are pending, drop the lowest-priority one.
- Notification scoring: Each notification gets a relevance score based on: user's historical engagement with this category, recency of similar notifications, content relevance. Low-scoring notifications are suppressed.
- Batching/digest: Instead of 10 individual "someone liked your post" notifications, batch them into one: "15 people liked your post." The batch window is time-based (every 30 minutes) or count-based (every 5 similar events).
- Smart timing: Don't send notifications at 3 AM. Predict the user's active hours from historical app usage and queue notifications for their next active window.
Probe 2: "How do you handle delivery reliability?" (3-5 min)
"Each channel has different delivery guarantees:"
- Push (APNs/FCM): Fire-and-forget. APNs returns a success/failure immediately, but 'success' means 'Apple accepted it' — not 'the user saw it.' The device may be offline. "APNs gives you device tokens. If the token is invalid (user uninstalled), APNs tells you — update your device registry."
- Email (SendGrid/SES): Delivery confirmation via webhook (delivered, bounced, opened, clicked). Bounce handling is critical — hard bounces must immediately unsubscribe the address.
- SMS (Twilio): Delivery receipts available but not universal (depends on carrier). SMS is expensive ($0.01-$0.05/message) — use only for high-urgency notifications.
- In-app: WebSocket push if the user is online; persist to an in-app inbox for later viewing if offline.
Probe 3: "How do you handle multi-channel delivery?" (3-5 min)
"A high-urgency notification (fraud alert) should try multiple channels: push first, then SMS if push fails, then email. A low-urgency notification (weekly digest) goes to email only."
Walk through the cascade:
- Primary channel: Based on user preference and notification type. Try push first.
- Fallback on failure: If push fails (device token invalid or user didn't engage within 5 minutes), try the next channel in the cascade.
- Deduplication: If the user sees the push notification, cancel the pending SMS/email. "The notification system must track cross-channel delivery and suppress duplicates."
Phase 6: Wrap-Up
"Notification systems are a restraint engineering problem. The technology (Kafka, APNs, SendGrid) is commodity. The Staff-level insight: the system's most important function is deciding NOT to send. Frequency caps, notification scoring, smart timing, and batching — these are the features that determine whether users keep notifications enabled or disable them forever."
Quick-Reference: The 30-Second Cheat Sheet
| Topic | The L5 Answer | The L6 Answer (say this) |
|---|---|---|
| Purpose | "Send notifications" | "Decide IF, WHEN, and HOW to notify — sending is the easy part" |
| Architecture | "Service sends a push" | "Event → routing (preferences, frequency cap, scoring) → channel delivery → tracking" |
| Fatigue | "Users can disable notifications" | "Per-user daily budget, notification scoring, digest batching, smart timing" |
| Multi-channel | "Send to all channels" | "Priority cascade with fallback + cross-channel dedup" |
| Metrics | "Track sends" | "Track sent → delivered → read funnel per channel per notification type" |