Design a URL Shortener | StaffSignal Playbook

Technologies referenced in this playbook: Redis · PostgreSQL

How to Use This Playbook

This playbook supports three reading modes:

Mode	Time	What to Read
Quick Review	15 min	Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study	1-2 hrs	Interview Walkthrough + Core Flow, expand appendices where you're weak
Deep Dive	3+ hrs	Everything, including all appendices

What is URL Shortening? — Why interviewers pick this topic

The Problem

URL shortening maps long URLs to short, unique aliases (e.g., sho.rt/xK9m2 → https://example.com/very/long/path?with=params). The short link redirects to the original. At scale, this becomes a distributed key generation, high-throughput redirect, and analytics pipeline problem.

Common Use Cases

Marketing & Attribution: Track click-through rates across campaigns, channels, and geographies
Character-Limited Sharing: SMS, tweets, QR codes where URL length matters
Link Management: Branded short domains, expiration policies, A/B redirect targets
Internal Tooling: Short aliases for dashboards, runbooks, incident links

Why Interviewers Ask About This

URL shortening surfaces the core Staff-level skill: reasoning about ID generation at scale. The naive solution (hash + collision check) breaks under concurrency. The real interview tests whether you can design a globally unique, non-predictable, low-latency key generation system — and whether you understand the read-heavy redirect path as a caching and availability problem.

Mechanics Refresher: Base62 Encoding & Key Space Math — Why 7 characters gives you 3.5 trillion keys

Base62 alphabet: [a-zA-Z0-9] = 62 characters.

Key Length	Unique Keys	Enough For
6 chars	56.8 billion	~1,800 years at 1K writes/sec
7 chars	3.52 trillion	~111,000 years at 1K writes/sec
8 chars	218 trillion	Effectively unlimited

Why not Base64? URL-safe Base64 adds + and / which cause encoding issues in URLs without escaping. Base62 is URL-safe by default.

Collision probability (birthday paradox): With 7-char keys and 1 billion URLs, collision probability ≈ 0.014%. At 10 billion, ≈ 1.4%. This matters for hash-based approaches but not for counter-based approaches.

Why this matters later: Fault Line 1 (Hash vs Counter) and Fault Line 3 (Predictability) both depend on understanding key space exhaustion rates and collision math.

Mechanics Refresher: 301 vs 302 Redirects — The caching decision that changes your analytics

Redirect	HTTP Semantics	Browser Behavior	Analytics Impact
301 Moved Permanently	Cacheable by default	Browser caches, skips your server on repeat visits	You lose click tracking after first visit
302 Found	Not cached by default	Browser hits your server every time	Full analytics, higher server load
307 Temporary Redirect	Like 302, preserves HTTP method	Same as 302 for GET	Use when POST redirect matters

The tension: 301 reduces server load (browser caches the redirect) but blinds your analytics. 302 gives full visibility but every click hits your infrastructure.

Why this matters later: Fault Line 4 (Redirect Semantics) forces you to choose between infrastructure cost and analytics completeness. Most candidates pick 301 without realizing they've killed click tracking.

What This Interview Actually Tests

URL shortening is not a hashing question. Everyone can Base62-encode a number.

This is a distributed key generation and read-path optimization question that tests:

Whether you can design a globally unique ID generation scheme without coordination bottlenecks
Whether you reason about the read path (redirects) as a caching and availability problem
Whether you understand that analytics is the real product, not the short URL itself
Whether you can articulate who owns the blast radius of key collisions

The key insight: The write path (URL creation) is the interesting distributed systems problem. The read path (redirect) is the interesting performance and caching problem. Most candidates spend 30 minutes on the wrong one.

The L5 vs L6 Contrast (Memorize This)

Level Calibration

Behavior	L5 (Senior)	L6 (Staff)
First move	Draws hash function + collision check	Asks "What's the read:write ratio and what analytics do we need?"
Key generation	MD5/SHA → truncate → check DB	Designs a coordination-free counter (Snowflake, pre-allocated ranges)
Read path	"Add a cache in front of the DB"	Quantifies cache hit ratio at 99.9%+, designs for cache-only serving
Failure awareness	"Add replicas"	Asks "What happens when two services generate the same key? Who detects it?"
Ownership	Focuses on the shortening service	Frames analytics pipeline as the primary value — short URLs are just the entry point

The Three Intents (Pick One and Commit)

Intent	Constraint	Strategy	Key Generation Approach
High-Throughput Link Creation	Write speed is everything	Pre-allocated ID ranges, no coordination on write	Counter-based, batch allocation
Analytics-First Platform	Click data completeness is everything	302 redirects, event pipeline, real-time dashboards	Any, but redirect path must emit events
Branded Link Management	Custom aliases, expiration, A/B testing	CRUD API with slug validation, TTL management	User-chosen + auto-generated hybrid

The Four Fault Lines (The Core of This Interview)

Who Pays Analysis

#	Fault Line	The Tension	Staff Question
1	Hash vs Counter	Hash is stateless but collides; counters need coordination but never collide	"How do you generate globally unique keys without a central bottleneck?"
2	Write Path Coordination	Pre-allocation is fast but wastes ranges on crash; single counter is simple but bottlenecked	"What happens when a node crashes mid-range?"
3	Key Predictability	Sequential keys are guessable; random keys waste cache locality	"Can an attacker enumerate all your short URLs?"
4	Redirect Semantics & Caching	301 saves infrastructure but kills analytics; 302 preserves data but costs more	"How do you serve 100K redirects/sec with sub-10ms p99?"

System Architecture Overview

Rendering diagram...

Interview Walkthrough: How to Present This in 45 Minutes

Most interview prep covers the basics — step-by-step architecture walkthroughs at tutorial pace. This section is different. Senior candidates spend 25 minutes on the basics and run out of time before reaching anything interesting. Staff candidates speed through the baseline in 10-12 minutes — fast enough to spend the remaining 30+ minutes on the fault lines, failure modes, and ownership questions that actually determine your level.

The six phases below add up to 45 minutes. The ratios matter: phases 1-4 are deliberately compressed so phase 5 gets the lion's share of time. If you're spending more than 12 minutes before the transition to depth, you're pacing like an L5.

Phase 1: Requirements & Framing (2-3 minutes)

State functional requirements in 30 seconds — don't enumerate, state the category:

"Users create short URLs that redirect to long URLs. The system tracks click analytics and supports custom aliases and TTL-based expiry."

That's it. Don't describe the entire UI flow. The interviewer knows what a URL shortener does.

Invest time on non-functional requirements (this is the Staff move):

"The hard constraint is the read-to-write ratio: 100:1 or higher. We optimize the redirect path — every millisecond of redirect latency is user-visible."
"URL uniqueness guarantee: no two short codes map to different long URLs, even under concurrent creation."
"Analytics at scale: every click generates an analytics event. At 100B redirects/month, that's ~38K events/second."

Phase 2: Core Entities & API (1-2 minutes)

State entities quickly (30 seconds):

ShortURL: short code, long URL, creation timestamp, expiry, owner
IDRange: pre-allocated range of IDs for collision-free generation (e.g., range 1M-2M assigned to server A)
ClickEvent: timestamp, short code, referrer, user agent, geo — analytics payload

API (1 minute):

POST /urls            → { long_url, custom_alias?, ttl? } → { short_url }
GET  /{code}          → 302 redirect (or 301 for permanent)
GET  /urls/{code}/stats → click analytics (count, geo, referrer breakdown)

Phase 3: High-Level Architecture (5-7 minutes)

Rendering diagram...

Walk through the flow:

CDN Edge → First layer for cached 301 redirects; cache miss goes to Redirect Service
Redirect Service → Looks up short code in Redis cache (hit rate: ~95%), falls back to database on miss
Analytics Pipeline → Every redirect fires an async event to Kafka for click tracking — never blocks the redirect
Write Service + ID Allocator → Creates short URLs using pre-allocated ID ranges; no coordination needed between write servers

Key points to hit on the whiteboard:

Read path is king — CDN → Redis → DB fallback. Three layers, sub-10ms for 95% of requests
Analytics are async — fire-and-forget to Kafka, never on the redirect critical path
ID generation is pre-allocated — no hash collisions, no database-level dedup on writes
Write path is simple — take next ID from allocated range, base62-encode, store

Phase 4: Transition to Depth (1 minute)

At this point you've spent ~12 minutes. Now pivot:

"The basic architecture is straightforward — redirect service with Redis cache, async analytics. What makes this a Staff-level problem is three things: (1) ID generation strategy — hashing vs pre-allocated ranges vs sequential, (2) the analytics pipeline at 38K events/second, (3) hot URL handling when a single short URL gets millions of clicks."

Then offer the interviewer a choice:

"I can go deep on any of these. Which is most interesting to you?"

If the interviewer doesn't have a preference, lead with ID generation — it's the most differentiating topic.

Phase 5: Deep Dives (25-30 minutes)

The interviewer will steer, but be prepared to go deep on any of these. For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → explain who absorbs that cost.

Fault Line 1: ID generation strategy (7-10 min)

Open with the three approaches:

"Option A: Hash the long URL (MD5/SHA256, take first 7 chars). Simple, but collisions are guaranteed at scale — the birthday paradox gives us a 50% collision probability at ~62^7/2 ≈ 1.8 trillion URLs. That sounds safe, but collision handling adds complexity."

"Option B: Global auto-increment counter (e.g., PostgreSQL SERIAL). No collisions, but the database becomes a bottleneck at high write rates — every write serializes on the counter."

"Option C: Pre-allocated ID ranges. A range allocator hands out chunks of 1M IDs to each write server. Server A gets IDs 1M-2M, Server B gets 2M-3M. Each server increments locally — zero coordination. When a range is exhausted, request a new one."

Pick a position: "I'd use Option C. It eliminates collisions, requires no per-write coordination, and scales linearly with write servers. The range allocator is a simple counter in ZooKeeper or Redis — it's called O(1/million) instead of O(1) because we only coordinate once per million writes."

Then address custom aliases: "Custom aliases bypass the ID allocator — the user-provided string goes directly into the URL table. We check for conflicts at write time: if the alias already exists, reject with 409 Conflict. Custom aliases are stored in the same table but with a flag distinguishing them from auto-generated codes."

Fault Line 2: Analytics pipeline at scale (5-7 min)

"Every redirect generates a click event: timestamp, short code, IP, user agent, referrer, geo (derived from IP). At 38K events/second, that's 3.3 billion events/day. How do we store and query this?"

Walk through the pipeline:

Redirect Service fires event to Kafka (async, non-blocking)
Stream processor (Flink/Kafka Streams) enriches events: IP → geo lookup, user agent parsing
Real-time counters in Redis: total clicks per code, updated every event
Batch sink to ClickHouse (columnar store): full event history for analytical queries (clicks by day, by geo, by referrer)

The key tradeoff: "Real-time counters give instant click counts but no dimensional breakdown. ClickHouse gives full analytics but with 30-60 second query latency. For the dashboard, I'd show real-time total count from Redis and lazy-load dimensional breakdowns from ClickHouse."

Fault Line 3: Hot URL handling (5-7 min)

"A tweet goes viral with a short URL. That URL goes from 10 clicks/sec to 500,000 clicks/sec in 5 minutes. The Redis cache entry for that code becomes a hot key — every redirect request hits the same cache shard."

Mitigations in order:

CDN caching — For 302 redirects, set a short cache-control header (e.g., 60 seconds). The CDN absorbs 99% of traffic for popular URLs. For 301 redirects, the browser caches forever — the CDN barely sees traffic.
Read replicas — Redis read replicas with client-side routing. Distribute reads across 3-5 replicas for the hot shard.
Local cache — In-process LRU cache on each redirect server. Cache the top 10K most-accessed codes in memory. Cache hit eliminates even the Redis round-trip.
Tiered strategy — Combine all three: local cache (sub-μs) → CDN (1-2ms) → Redis (2-5ms) → DB (10-50ms). For a viral URL, 99.99% of requests never leave the CDN or local cache.

URL expiry and cleanup (3-5 min)

"Short URLs with TTL need cleanup. But we can't just delete expired URLs — someone might have bookmarked them. Options: (1) hard delete after TTL + 30 days grace period, (2) soft delete that returns a 'this URL has expired' page with the original long URL shown, (3) tombstone that prevents re-use of the short code for 1 year."

"I'd use option 2 — soft delete with a grace period. The expired URL still resolves but shows a landing page instead of redirecting. After 30 days, hard delete and return the short code to the ID pool for reuse."

Phase 6: Wrap-Up (2-3 minutes)

Summarize the key insight — don't just restate your architecture:

"URL shortening is the simplest system design question, which makes it the most dangerous. Every candidate can build a working URL shortener in 15 minutes. What separates Staff from Senior is reasoning about the 100:1 read-write asymmetry, the ID generation strategy that eliminates coordination, and the analytics pipeline that handles 38K events/second without blocking the redirect path."

If time permits, add the operational insight:

"The hardest operational problem isn't redirect performance — it's abuse. Short URLs are a perfect tool for phishing and malware distribution. The system needs a URL safety check (Google Safe Browsing API) on creation and a reporting mechanism for malicious URLs. This is where the simple system design question becomes a real-world engineering problem."

Common Timing Mistakes

Level Calibration

Mistake	L5 Does This	L6 Does This
10 min on requirements	Lists custom domains, QR codes, link previews	States read-write ratio in 1 min, moves to what's hard
15 min on hashing	MD5 vs SHA256 vs base62, collision math on the whiteboard	"Pre-allocated ID ranges. Zero collisions. Moving on."
No analytics discussion	"We log clicks" with no pipeline	Designs a Kafka → Flink → ClickHouse pipeline with real-time counters
No hot URL handling	Assumes uniform traffic	Volunteers hot-key mitigation with quantified CDN absorption
Spreads thin	Touches 6 topics at surface level	Goes deep on 2-3 fault lines with numbers
No abuse discussion	Ignores the trust & safety angle	Names URL safety checking and reporting as operational concerns

Reading the Interviewer

Interviewer Signal	What They Care About	Where to Go Deep
Asks about collisions	Algorithm/data structure depth	ID generation strategy comparison
Asks about scale	Distributed systems	Hot URL handling, Redis sharding, CDN caching
Asks about analytics	Data engineering	Click pipeline, real-time vs batch, ClickHouse
Asks about 301 vs 302	Product/caching tradeoffs	Redirect semantics, cache implications, analytics impact
Asks about custom URLs	Edge cases	Alias conflict handling, reserved words, namespace management
Pushes back on your architecture	Wants to see you defend or adapt	State your reasoning, acknowledge alternatives, explain your tradeoff

What to Deliberately Skip

Level Calibration

Topic	Why L5 Goes Here	What L6 Says Instead
Hash function comparison	Textbook material	"Pre-allocated ranges. No hashing needed. Moving on."
Database schema	Feels productive	"Two columns: code, long_url. Plus metadata. Trivial."
QR code generation	Feature creep	"Client-side library generates QR from the short URL. Not a backend concern."
Custom domain support	Complex but not differentiating	"DNS CNAME + host header routing. Operational, not architectural."
Link preview / Open Graph	Nice-to-have	"Proxy the target page's OG tags. Separate service."

How to Use This Playbook

The Problem

Common Use Cases

Why Interviewers Ask About This

What This Interview Actually Tests

The L5 vs L6 Contrast (Memorize This)

The Three Intents (Pick One and Commit)

The Four Fault Lines (The Core of This Interview)

System Architecture Overview

Interview Walkthrough: How to Present This in 45 Minutes

Phase 1: Requirements & Framing (2-3 minutes)

Phase 2: Core Entities & API (1-2 minutes)

Phase 3: High-Level Architecture (5-7 minutes)

Phase 4: Transition to Depth (1 minute)

Phase 5: Deep Dives (25-30 minutes)

Phase 6: Wrap-Up (2-3 minutes)

Common Timing Mistakes

Reading the Interviewer

What to Deliberately Skip

Subscribe to continue reading