Technologies referenced in this playbook: Redis · PostgreSQL
How to Use This Playbook
This playbook supports three reading modes:
| Mode | Time | What to Read |
|---|---|---|
| Quick Review | 15 min | Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7) |
| Targeted Study | 1-2 hrs | Interview Walkthrough + Core Flow, expand appendices where you're weak |
| Deep Dive | 3+ hrs | Everything, including all appendices |
What is URL Shortening? — Why interviewers pick this topic
The Problem
URL shortening maps long URLs to short, unique aliases (e.g., sho.rt/xK9m2 → https://example.com/very/long/path?with=params). The short link redirects to the original. At scale, this becomes a distributed key generation, high-throughput redirect, and analytics pipeline problem.
Common Use Cases
- Marketing & Attribution: Track click-through rates across campaigns, channels, and geographies
- Character-Limited Sharing: SMS, tweets, QR codes where URL length matters
- Link Management: Branded short domains, expiration policies, A/B redirect targets
- Internal Tooling: Short aliases for dashboards, runbooks, incident links
Why Interviewers Ask About This
URL shortening surfaces the core Staff-level skill: reasoning about ID generation at scale. The naive solution (hash + collision check) breaks under concurrency. The real interview tests whether you can design a globally unique, non-predictable, low-latency key generation system — and whether you understand the read-heavy redirect path as a caching and availability problem.
Mechanics Refresher: Base62 Encoding & Key Space Math — Why 7 characters gives you 3.5 trillion keys
Base62 alphabet: [a-zA-Z0-9] = 62 characters.
| Key Length | Unique Keys | Enough For |
|---|---|---|
| 6 chars | 56.8 billion | ~1,800 years at 1K writes/sec |
| 7 chars | 3.52 trillion | ~111,000 years at 1K writes/sec |
| 8 chars | 218 trillion | Effectively unlimited |
Why not Base64? URL-safe Base64 adds + and / which cause encoding issues in URLs without escaping. Base62 is URL-safe by default.
Collision probability (birthday paradox): With 7-char keys and 1 billion URLs, collision probability ≈ 0.014%. At 10 billion, ≈ 1.4%. This matters for hash-based approaches but not for counter-based approaches.
Why this matters later: Fault Line 1 (Hash vs Counter) and Fault Line 3 (Predictability) both depend on understanding key space exhaustion rates and collision math.
Mechanics Refresher: 301 vs 302 Redirects — The caching decision that changes your analytics
| Redirect | HTTP Semantics | Browser Behavior | Analytics Impact |
|---|---|---|---|
| 301 Moved Permanently | Cacheable by default | Browser caches, skips your server on repeat visits | You lose click tracking after first visit |
| 302 Found | Not cached by default | Browser hits your server every time | Full analytics, higher server load |
| 307 Temporary Redirect | Like 302, preserves HTTP method | Same as 302 for GET | Use when POST redirect matters |
The tension: 301 reduces server load (browser caches the redirect) but blinds your analytics. 302 gives full visibility but every click hits your infrastructure.
Why this matters later: Fault Line 4 (Redirect Semantics) forces you to choose between infrastructure cost and analytics completeness. Most candidates pick 301 without realizing they've killed click tracking.
What This Interview Actually Tests
URL shortening is not a hashing question. Everyone can Base62-encode a number.
This is a distributed key generation and read-path optimization question that tests:
- Whether you can design a globally unique ID generation scheme without coordination bottlenecks
- Whether you reason about the read path (redirects) as a caching and availability problem
- Whether you understand that analytics is the real product, not the short URL itself
- Whether you can articulate who owns the blast radius of key collisions
The key insight: The write path (URL creation) is the interesting distributed systems problem. The read path (redirect) is the interesting performance and caching problem. Most candidates spend 30 minutes on the wrong one.
The L5 vs L6 Contrast (Memorize This)
| Behavior | L5 (Senior) | L6 (Staff) |
|---|---|---|
| First move | Draws hash function + collision check | Asks "What's the read:write ratio and what analytics do we need?" |
| Key generation | MD5/SHA → truncate → check DB | Designs a coordination-free counter (Snowflake, pre-allocated ranges) |
| Read path | "Add a cache in front of the DB" | Quantifies cache hit ratio at 99.9%+, designs for cache-only serving |
| Failure awareness | "Add replicas" | Asks "What happens when two services generate the same key? Who detects it?" |
| Ownership | Focuses on the shortening service | Frames analytics pipeline as the primary value — short URLs are just the entry point |
The Three Intents (Pick One and Commit)
| Intent | Constraint | Strategy | Key Generation Approach |
|---|---|---|---|
| High-Throughput Link Creation | Write speed is everything | Pre-allocated ID ranges, no coordination on write | Counter-based, batch allocation |
| Analytics-First Platform | Click data completeness is everything | 302 redirects, event pipeline, real-time dashboards | Any, but redirect path must emit events |
| Branded Link Management | Custom aliases, expiration, A/B testing | CRUD API with slug validation, TTL management | User-chosen + auto-generated hybrid |
The Four Fault Lines (The Core of This Interview)
| # | Fault Line | The Tension | Staff Question |
|---|---|---|---|
| 1 | Hash vs Counter | Hash is stateless but collides; counters need coordination but never collide | "How do you generate globally unique keys without a central bottleneck?" |
| 2 | Write Path Coordination | Pre-allocation is fast but wastes ranges on crash; single counter is simple but bottlenecked | "What happens when a node crashes mid-range?" |
| 3 | Key Predictability | Sequential keys are guessable; random keys waste cache locality | "Can an attacker enumerate all your short URLs?" |
| 4 | Redirect Semantics & Caching | 301 saves infrastructure but kills analytics; 302 preserves data but costs more | "How do you serve 100K redirects/sec with sub-10ms p99?" |
System Architecture Overview
Interview Walkthrough: How to Present This in 45 Minutes
Most interview prep covers the basics — step-by-step architecture walkthroughs at tutorial pace. This section is different. Senior candidates spend 25 minutes on the basics and run out of time before reaching anything interesting. Staff candidates speed through the baseline in 10-12 minutes — fast enough to spend the remaining 30+ minutes on the fault lines, failure modes, and ownership questions that actually determine your level.
The six phases below add up to 45 minutes. The ratios matter: phases 1-4 are deliberately compressed so phase 5 gets the lion's share of time. If you're spending more than 12 minutes before the transition to depth, you're pacing like an L5.
Phase 1: Requirements & Framing (2-3 minutes)
State functional requirements in 30 seconds — don't enumerate, state the category:
- "Users create short URLs that redirect to long URLs. The system tracks click analytics and supports custom aliases and TTL-based expiry."
That's it. Don't describe the entire UI flow. The interviewer knows what a URL shortener does.
Invest time on non-functional requirements (this is the Staff move):
- "The hard constraint is the read-to-write ratio: 100:1 or higher. We optimize the redirect path — every millisecond of redirect latency is user-visible."
- "URL uniqueness guarantee: no two short codes map to different long URLs, even under concurrent creation."
- "Analytics at scale: every click generates an analytics event. At 100B redirects/month, that's ~38K events/second."
Phase 2: Core Entities & API (1-2 minutes)
State entities quickly (30 seconds):
- ShortURL: short code, long URL, creation timestamp, expiry, owner
- IDRange: pre-allocated range of IDs for collision-free generation (e.g., range 1M-2M assigned to server A)
- ClickEvent: timestamp, short code, referrer, user agent, geo — analytics payload
API (1 minute):
POST /urls → { long_url, custom_alias?, ttl? } → { short_url }
GET /{code} → 302 redirect (or 301 for permanent)
GET /urls/{code}/stats → click analytics (count, geo, referrer breakdown)
Phase 3: High-Level Architecture (5-7 minutes)
Walk through the flow:
- CDN Edge → First layer for cached 301 redirects; cache miss goes to Redirect Service
- Redirect Service → Looks up short code in Redis cache (hit rate: ~95%), falls back to database on miss
- Analytics Pipeline → Every redirect fires an async event to Kafka for click tracking — never blocks the redirect
- Write Service + ID Allocator → Creates short URLs using pre-allocated ID ranges; no coordination needed between write servers
Key points to hit on the whiteboard:
- Read path is king — CDN → Redis → DB fallback. Three layers, sub-10ms for 95% of requests
- Analytics are async — fire-and-forget to Kafka, never on the redirect critical path
- ID generation is pre-allocated — no hash collisions, no database-level dedup on writes
- Write path is simple — take next ID from allocated range, base62-encode, store
Phase 4: Transition to Depth (1 minute)
At this point you've spent ~12 minutes. Now pivot:
"The basic architecture is straightforward — redirect service with Redis cache, async analytics. What makes this a Staff-level problem is three things: (1) ID generation strategy — hashing vs pre-allocated ranges vs sequential, (2) the analytics pipeline at 38K events/second, (3) hot URL handling when a single short URL gets millions of clicks."
Then offer the interviewer a choice:
"I can go deep on any of these. Which is most interesting to you?"
If the interviewer doesn't have a preference, lead with ID generation — it's the most differentiating topic.
Phase 5: Deep Dives (25-30 minutes)
The interviewer will steer, but be prepared to go deep on any of these. For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → explain who absorbs that cost.
Fault Line 1: ID generation strategy (7-10 min)
Open with the three approaches:
"Option A: Hash the long URL (MD5/SHA256, take first 7 chars). Simple, but collisions are guaranteed at scale — the birthday paradox gives us a 50% collision probability at ~62^7/2 ≈ 1.8 trillion URLs. That sounds safe, but collision handling adds complexity."
"Option B: Global auto-increment counter (e.g., PostgreSQL SERIAL). No collisions, but the database becomes a bottleneck at high write rates — every write serializes on the counter."
"Option C: Pre-allocated ID ranges. A range allocator hands out chunks of 1M IDs to each write server. Server A gets IDs 1M-2M, Server B gets 2M-3M. Each server increments locally — zero coordination. When a range is exhausted, request a new one."
Pick a position: "I'd use Option C. It eliminates collisions, requires no per-write coordination, and scales linearly with write servers. The range allocator is a simple counter in ZooKeeper or Redis — it's called O(1/million) instead of O(1) because we only coordinate once per million writes."
Then address custom aliases: "Custom aliases bypass the ID allocator — the user-provided string goes directly into the URL table. We check for conflicts at write time: if the alias already exists, reject with 409 Conflict. Custom aliases are stored in the same table but with a flag distinguishing them from auto-generated codes."
Fault Line 2: Analytics pipeline at scale (5-7 min)
"Every redirect generates a click event: timestamp, short code, IP, user agent, referrer, geo (derived from IP). At 38K events/second, that's 3.3 billion events/day. How do we store and query this?"
Walk through the pipeline:
- Redirect Service fires event to Kafka (async, non-blocking)
- Stream processor (Flink/Kafka Streams) enriches events: IP → geo lookup, user agent parsing
- Real-time counters in Redis: total clicks per code, updated every event
- Batch sink to ClickHouse (columnar store): full event history for analytical queries (clicks by day, by geo, by referrer)
The key tradeoff: "Real-time counters give instant click counts but no dimensional breakdown. ClickHouse gives full analytics but with 30-60 second query latency. For the dashboard, I'd show real-time total count from Redis and lazy-load dimensional breakdowns from ClickHouse."
Fault Line 3: Hot URL handling (5-7 min)
"A tweet goes viral with a short URL. That URL goes from 10 clicks/sec to 500,000 clicks/sec in 5 minutes. The Redis cache entry for that code becomes a hot key — every redirect request hits the same cache shard."
Mitigations in order:
- CDN caching — For 302 redirects, set a short cache-control header (e.g., 60 seconds). The CDN absorbs 99% of traffic for popular URLs. For 301 redirects, the browser caches forever — the CDN barely sees traffic.
- Read replicas — Redis read replicas with client-side routing. Distribute reads across 3-5 replicas for the hot shard.
- Local cache — In-process LRU cache on each redirect server. Cache the top 10K most-accessed codes in memory. Cache hit eliminates even the Redis round-trip.
- Tiered strategy — Combine all three: local cache (sub-μs) → CDN (1-2ms) → Redis (2-5ms) → DB (10-50ms). For a viral URL, 99.99% of requests never leave the CDN or local cache.
URL expiry and cleanup (3-5 min)
"Short URLs with TTL need cleanup. But we can't just delete expired URLs — someone might have bookmarked them. Options: (1) hard delete after TTL + 30 days grace period, (2) soft delete that returns a 'this URL has expired' page with the original long URL shown, (3) tombstone that prevents re-use of the short code for 1 year."
"I'd use option 2 — soft delete with a grace period. The expired URL still resolves but shows a landing page instead of redirecting. After 30 days, hard delete and return the short code to the ID pool for reuse."
Phase 6: Wrap-Up (2-3 minutes)
Summarize the key insight — don't just restate your architecture:
"URL shortening is the simplest system design question, which makes it the most dangerous. Every candidate can build a working URL shortener in 15 minutes. What separates Staff from Senior is reasoning about the 100:1 read-write asymmetry, the ID generation strategy that eliminates coordination, and the analytics pipeline that handles 38K events/second without blocking the redirect path."
If time permits, add the operational insight:
"The hardest operational problem isn't redirect performance — it's abuse. Short URLs are a perfect tool for phishing and malware distribution. The system needs a URL safety check (Google Safe Browsing API) on creation and a reporting mechanism for malicious URLs. This is where the simple system design question becomes a real-world engineering problem."
Common Timing Mistakes
| Mistake | L5 Does This | L6 Does This |
|---|---|---|
| 10 min on requirements | Lists custom domains, QR codes, link previews | States read-write ratio in 1 min, moves to what's hard |
| 15 min on hashing | MD5 vs SHA256 vs base62, collision math on the whiteboard | "Pre-allocated ID ranges. Zero collisions. Moving on." |
| No analytics discussion | "We log clicks" with no pipeline | Designs a Kafka → Flink → ClickHouse pipeline with real-time counters |
| No hot URL handling | Assumes uniform traffic | Volunteers hot-key mitigation with quantified CDN absorption |
| Spreads thin | Touches 6 topics at surface level | Goes deep on 2-3 fault lines with numbers |
| No abuse discussion | Ignores the trust & safety angle | Names URL safety checking and reporting as operational concerns |
Reading the Interviewer
| Interviewer Signal | What They Care About | Where to Go Deep |
|---|---|---|
| Asks about collisions | Algorithm/data structure depth | ID generation strategy comparison |
| Asks about scale | Distributed systems | Hot URL handling, Redis sharding, CDN caching |
| Asks about analytics | Data engineering | Click pipeline, real-time vs batch, ClickHouse |
| Asks about 301 vs 302 | Product/caching tradeoffs | Redirect semantics, cache implications, analytics impact |
| Asks about custom URLs | Edge cases | Alias conflict handling, reserved words, namespace management |
| Pushes back on your architecture | Wants to see you defend or adapt | State your reasoning, acknowledge alternatives, explain your tradeoff |
What to Deliberately Skip
| Topic | Why L5 Goes Here | What L6 Says Instead |
|---|---|---|
| Hash function comparison | Textbook material | "Pre-allocated ranges. No hashing needed. Moving on." |
| Database schema | Feels productive | "Two columns: code, long_url. Plus metadata. Trivial." |
| QR code generation | Feature creep | "Client-side library generates QR from the short URL. Not a backend concern." |
| Custom domain support | Complex but not differentiating | "DNS CNAME + host header routing. Operational, not architectural." |
| Link preview / Open Graph | Nice-to-have | "Proxy the target page's OG tags. Separate service." |