StaffSignal

Design a URL Shortener

Staff-Level Playbook

Technologies referenced in this playbook: Redis · PostgreSQL

How to Use This Playbook

This playbook supports three reading modes:

ModeTimeWhat to Read
Quick Review15 minExecutive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study1-2 hrsInterview Walkthrough + Core Flow, expand appendices where you're weak
Deep Dive3+ hrsEverything, including all appendices
What is URL Shortening? — Why interviewers pick this topic

The Problem

URL shortening maps long URLs to short, unique aliases (e.g., sho.rt/xK9m2https://example.com/very/long/path?with=params). The short link redirects to the original. At scale, this becomes a distributed key generation, high-throughput redirect, and analytics pipeline problem.

Common Use Cases

  • Marketing & Attribution: Track click-through rates across campaigns, channels, and geographies
  • Character-Limited Sharing: SMS, tweets, QR codes where URL length matters
  • Link Management: Branded short domains, expiration policies, A/B redirect targets
  • Internal Tooling: Short aliases for dashboards, runbooks, incident links

Why Interviewers Ask About This

URL shortening surfaces the core Staff-level skill: reasoning about ID generation at scale. The naive solution (hash + collision check) breaks under concurrency. The real interview tests whether you can design a globally unique, non-predictable, low-latency key generation system — and whether you understand the read-heavy redirect path as a caching and availability problem.

Mechanics Refresher: Base62 Encoding & Key Space Math — Why 7 characters gives you 3.5 trillion keys

Base62 alphabet: [a-zA-Z0-9] = 62 characters.

Key LengthUnique KeysEnough For
6 chars56.8 billion~1,800 years at 1K writes/sec
7 chars3.52 trillion~111,000 years at 1K writes/sec
8 chars218 trillionEffectively unlimited

Why not Base64? URL-safe Base64 adds + and / which cause encoding issues in URLs without escaping. Base62 is URL-safe by default.

Collision probability (birthday paradox): With 7-char keys and 1 billion URLs, collision probability ≈ 0.014%. At 10 billion, ≈ 1.4%. This matters for hash-based approaches but not for counter-based approaches.

Why this matters later: Fault Line 1 (Hash vs Counter) and Fault Line 3 (Predictability) both depend on understanding key space exhaustion rates and collision math.

Mechanics Refresher: 301 vs 302 Redirects — The caching decision that changes your analytics
RedirectHTTP SemanticsBrowser BehaviorAnalytics Impact
301 Moved PermanentlyCacheable by defaultBrowser caches, skips your server on repeat visitsYou lose click tracking after first visit
302 FoundNot cached by defaultBrowser hits your server every timeFull analytics, higher server load
307 Temporary RedirectLike 302, preserves HTTP methodSame as 302 for GETUse when POST redirect matters

The tension: 301 reduces server load (browser caches the redirect) but blinds your analytics. 302 gives full visibility but every click hits your infrastructure.

Why this matters later: Fault Line 4 (Redirect Semantics) forces you to choose between infrastructure cost and analytics completeness. Most candidates pick 301 without realizing they've killed click tracking.

What This Interview Actually Tests

URL shortening is not a hashing question. Everyone can Base62-encode a number.

This is a distributed key generation and read-path optimization question that tests:

  • Whether you can design a globally unique ID generation scheme without coordination bottlenecks
  • Whether you reason about the read path (redirects) as a caching and availability problem
  • Whether you understand that analytics is the real product, not the short URL itself
  • Whether you can articulate who owns the blast radius of key collisions

The key insight: The write path (URL creation) is the interesting distributed systems problem. The read path (redirect) is the interesting performance and caching problem. Most candidates spend 30 minutes on the wrong one.

The L5 vs L6 Contrast (Memorize This)

Level Calibration
BehaviorL5 (Senior)L6 (Staff)
First moveDraws hash function + collision checkAsks "What's the read:write ratio and what analytics do we need?"
Key generationMD5/SHA → truncate → check DBDesigns a coordination-free counter (Snowflake, pre-allocated ranges)
Read path"Add a cache in front of the DB"Quantifies cache hit ratio at 99.9%+, designs for cache-only serving
Failure awareness"Add replicas"Asks "What happens when two services generate the same key? Who detects it?"
OwnershipFocuses on the shortening serviceFrames analytics pipeline as the primary value — short URLs are just the entry point

The Three Intents (Pick One and Commit)

IntentConstraintStrategyKey Generation Approach
High-Throughput Link CreationWrite speed is everythingPre-allocated ID ranges, no coordination on writeCounter-based, batch allocation
Analytics-First PlatformClick data completeness is everything302 redirects, event pipeline, real-time dashboardsAny, but redirect path must emit events
Branded Link ManagementCustom aliases, expiration, A/B testingCRUD API with slug validation, TTL managementUser-chosen + auto-generated hybrid

The Four Fault Lines (The Core of This Interview)

Who Pays Analysis
#Fault LineThe TensionStaff Question
1Hash vs CounterHash is stateless but collides; counters need coordination but never collide"How do you generate globally unique keys without a central bottleneck?"
2Write Path CoordinationPre-allocation is fast but wastes ranges on crash; single counter is simple but bottlenecked"What happens when a node crashes mid-range?"
3Key PredictabilitySequential keys are guessable; random keys waste cache locality"Can an attacker enumerate all your short URLs?"
4Redirect Semantics & Caching301 saves infrastructure but kills analytics; 302 preserves data but costs more"How do you serve 100K redirects/sec with sub-10ms p99?"

System Architecture Overview

Rendering diagram...

Interview Walkthrough: How to Present This in 45 Minutes

Most interview prep covers the basics — step-by-step architecture walkthroughs at tutorial pace. This section is different. Senior candidates spend 25 minutes on the basics and run out of time before reaching anything interesting. Staff candidates speed through the baseline in 10-12 minutes — fast enough to spend the remaining 30+ minutes on the fault lines, failure modes, and ownership questions that actually determine your level.

The six phases below add up to 45 minutes. The ratios matter: phases 1-4 are deliberately compressed so phase 5 gets the lion's share of time. If you're spending more than 12 minutes before the transition to depth, you're pacing like an L5.

Phase 1: Requirements & Framing (2-3 minutes)

State functional requirements in 30 seconds — don't enumerate, state the category:

  • "Users create short URLs that redirect to long URLs. The system tracks click analytics and supports custom aliases and TTL-based expiry."

That's it. Don't describe the entire UI flow. The interviewer knows what a URL shortener does.

Invest time on non-functional requirements (this is the Staff move):

  • "The hard constraint is the read-to-write ratio: 100:1 or higher. We optimize the redirect path — every millisecond of redirect latency is user-visible."
  • "URL uniqueness guarantee: no two short codes map to different long URLs, even under concurrent creation."
  • "Analytics at scale: every click generates an analytics event. At 100B redirects/month, that's ~38K events/second."

Phase 2: Core Entities & API (1-2 minutes)

State entities quickly (30 seconds):

  • ShortURL: short code, long URL, creation timestamp, expiry, owner
  • IDRange: pre-allocated range of IDs for collision-free generation (e.g., range 1M-2M assigned to server A)
  • ClickEvent: timestamp, short code, referrer, user agent, geo — analytics payload

API (1 minute):

POST /urls            → { long_url, custom_alias?, ttl? } → { short_url }
GET  /{code}          → 302 redirect (or 301 for permanent)
GET  /urls/{code}/stats → click analytics (count, geo, referrer breakdown)

Phase 3: High-Level Architecture (5-7 minutes)

Rendering diagram...

Walk through the flow:

  1. CDN Edge → First layer for cached 301 redirects; cache miss goes to Redirect Service
  2. Redirect Service → Looks up short code in Redis cache (hit rate: ~95%), falls back to database on miss
  3. Analytics Pipeline → Every redirect fires an async event to Kafka for click tracking — never blocks the redirect
  4. Write Service + ID Allocator → Creates short URLs using pre-allocated ID ranges; no coordination needed between write servers

Key points to hit on the whiteboard:

  1. Read path is king — CDN → Redis → DB fallback. Three layers, sub-10ms for 95% of requests
  2. Analytics are async — fire-and-forget to Kafka, never on the redirect critical path
  3. ID generation is pre-allocated — no hash collisions, no database-level dedup on writes
  4. Write path is simple — take next ID from allocated range, base62-encode, store

Phase 4: Transition to Depth (1 minute)

At this point you've spent ~12 minutes. Now pivot:

"The basic architecture is straightforward — redirect service with Redis cache, async analytics. What makes this a Staff-level problem is three things: (1) ID generation strategy — hashing vs pre-allocated ranges vs sequential, (2) the analytics pipeline at 38K events/second, (3) hot URL handling when a single short URL gets millions of clicks."

Then offer the interviewer a choice:

"I can go deep on any of these. Which is most interesting to you?"

If the interviewer doesn't have a preference, lead with ID generation — it's the most differentiating topic.

Phase 5: Deep Dives (25-30 minutes)

The interviewer will steer, but be prepared to go deep on any of these. For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → explain who absorbs that cost.

Fault Line 1: ID generation strategy (7-10 min)

Open with the three approaches:

"Option A: Hash the long URL (MD5/SHA256, take first 7 chars). Simple, but collisions are guaranteed at scale — the birthday paradox gives us a 50% collision probability at ~62^7/2 ≈ 1.8 trillion URLs. That sounds safe, but collision handling adds complexity."

"Option B: Global auto-increment counter (e.g., PostgreSQL SERIAL). No collisions, but the database becomes a bottleneck at high write rates — every write serializes on the counter."

"Option C: Pre-allocated ID ranges. A range allocator hands out chunks of 1M IDs to each write server. Server A gets IDs 1M-2M, Server B gets 2M-3M. Each server increments locally — zero coordination. When a range is exhausted, request a new one."

Pick a position: "I'd use Option C. It eliminates collisions, requires no per-write coordination, and scales linearly with write servers. The range allocator is a simple counter in ZooKeeper or Redis — it's called O(1/million) instead of O(1) because we only coordinate once per million writes."

Then address custom aliases: "Custom aliases bypass the ID allocator — the user-provided string goes directly into the URL table. We check for conflicts at write time: if the alias already exists, reject with 409 Conflict. Custom aliases are stored in the same table but with a flag distinguishing them from auto-generated codes."

Fault Line 2: Analytics pipeline at scale (5-7 min)

"Every redirect generates a click event: timestamp, short code, IP, user agent, referrer, geo (derived from IP). At 38K events/second, that's 3.3 billion events/day. How do we store and query this?"

Walk through the pipeline:

  1. Redirect Service fires event to Kafka (async, non-blocking)
  2. Stream processor (Flink/Kafka Streams) enriches events: IP → geo lookup, user agent parsing
  3. Real-time counters in Redis: total clicks per code, updated every event
  4. Batch sink to ClickHouse (columnar store): full event history for analytical queries (clicks by day, by geo, by referrer)

The key tradeoff: "Real-time counters give instant click counts but no dimensional breakdown. ClickHouse gives full analytics but with 30-60 second query latency. For the dashboard, I'd show real-time total count from Redis and lazy-load dimensional breakdowns from ClickHouse."

Fault Line 3: Hot URL handling (5-7 min)

"A tweet goes viral with a short URL. That URL goes from 10 clicks/sec to 500,000 clicks/sec in 5 minutes. The Redis cache entry for that code becomes a hot key — every redirect request hits the same cache shard."

Mitigations in order:

  1. CDN caching — For 302 redirects, set a short cache-control header (e.g., 60 seconds). The CDN absorbs 99% of traffic for popular URLs. For 301 redirects, the browser caches forever — the CDN barely sees traffic.
  2. Read replicas — Redis read replicas with client-side routing. Distribute reads across 3-5 replicas for the hot shard.
  3. Local cache — In-process LRU cache on each redirect server. Cache the top 10K most-accessed codes in memory. Cache hit eliminates even the Redis round-trip.
  4. Tiered strategy — Combine all three: local cache (sub-μs) → CDN (1-2ms) → Redis (2-5ms) → DB (10-50ms). For a viral URL, 99.99% of requests never leave the CDN or local cache.

URL expiry and cleanup (3-5 min)

"Short URLs with TTL need cleanup. But we can't just delete expired URLs — someone might have bookmarked them. Options: (1) hard delete after TTL + 30 days grace period, (2) soft delete that returns a 'this URL has expired' page with the original long URL shown, (3) tombstone that prevents re-use of the short code for 1 year."

"I'd use option 2 — soft delete with a grace period. The expired URL still resolves but shows a landing page instead of redirecting. After 30 days, hard delete and return the short code to the ID pool for reuse."

Phase 6: Wrap-Up (2-3 minutes)

Summarize the key insight — don't just restate your architecture:

"URL shortening is the simplest system design question, which makes it the most dangerous. Every candidate can build a working URL shortener in 15 minutes. What separates Staff from Senior is reasoning about the 100:1 read-write asymmetry, the ID generation strategy that eliminates coordination, and the analytics pipeline that handles 38K events/second without blocking the redirect path."

If time permits, add the operational insight:

"The hardest operational problem isn't redirect performance — it's abuse. Short URLs are a perfect tool for phishing and malware distribution. The system needs a URL safety check (Google Safe Browsing API) on creation and a reporting mechanism for malicious URLs. This is where the simple system design question becomes a real-world engineering problem."

Common Timing Mistakes

Level Calibration
MistakeL5 Does ThisL6 Does This
10 min on requirementsLists custom domains, QR codes, link previewsStates read-write ratio in 1 min, moves to what's hard
15 min on hashingMD5 vs SHA256 vs base62, collision math on the whiteboard"Pre-allocated ID ranges. Zero collisions. Moving on."
No analytics discussion"We log clicks" with no pipelineDesigns a Kafka → Flink → ClickHouse pipeline with real-time counters
No hot URL handlingAssumes uniform trafficVolunteers hot-key mitigation with quantified CDN absorption
Spreads thinTouches 6 topics at surface levelGoes deep on 2-3 fault lines with numbers
No abuse discussionIgnores the trust & safety angleNames URL safety checking and reporting as operational concerns

Reading the Interviewer

Interviewer SignalWhat They Care AboutWhere to Go Deep
Asks about collisionsAlgorithm/data structure depthID generation strategy comparison
Asks about scaleDistributed systemsHot URL handling, Redis sharding, CDN caching
Asks about analyticsData engineeringClick pipeline, real-time vs batch, ClickHouse
Asks about 301 vs 302Product/caching tradeoffsRedirect semantics, cache implications, analytics impact
Asks about custom URLsEdge casesAlias conflict handling, reserved words, namespace management
Pushes back on your architectureWants to see you defend or adaptState your reasoning, acknowledge alternatives, explain your tradeoff

What to Deliberately Skip

Level Calibration
TopicWhy L5 Goes HereWhat L6 Says Instead
Hash function comparisonTextbook material"Pre-allocated ranges. No hashing needed. Moving on."
Database schemaFeels productive"Two columns: code, long_url. Plus metadata. Trivial."
QR code generationFeature creep"Client-side library generates QR from the short URL. Not a backend concern."
Custom domain supportComplex but not differentiating"DNS CNAME + host header routing. Operational, not architectural."
Link preview / Open GraphNice-to-have"Proxy the target page's OG tags. Separate service."

Subscribe to continue reading

Core sections
  • 1. The Staff Lens
  • 2. Problem Framing & Intent
  • 3. The Four Fault Lines
  • 4. Failure Modes & Degradation
  • 5. Evaluation Rubric
  • 6. Interview Flow & Pivots
  • 7. Active Drills
  • 8. Deep Dive Scenarios
  • 9. Level Expectations Summary
  • 10. Staff Insiders: Controversial Opinions