Technologies referenced in this playbook: Redis · API Gateways
How to Use This Playbook
Organized for interview use first, reference second. Read front-to-back once. Return to individual sections for targeted review.
| Mode | Time | What to Read |
|---|---|---|
| Quick Review | 15 min | Executive Summary → Interview Walkthrough → Fault Lines → Active Drills |
| Targeted Study | 1–2 hrs | Executive Summary → Interview Walkthrough → Fault Lines → weak-spot Deep Dives |
| Deep Dive | 3+ hrs | Everything, including appendices |
What is Rate Limiting? — Why interviewers pick this topic
Rate limiting controls how many requests a client can make to your system within a time window. Without it, a single misbehaving client can overwhelm your servers, degrade performance for everyone, or run up infrastructure costs.
Before vs After — Flash Sale scenario:
Without rate limiting:
t=0: "50% off everything" push notification
t=+10s: Traffic spikes 10x — 50,000 req/s hits the API
t=+30s: Database connection pool exhausted
t=+45s: Backend returning 500 errors
t=+2min: Full outage. All users see error pages.
t=+45min: Engineers restore service. Revenue lost. Trust damaged.
With rate limiting:
t=0: Same push notification, same 10x spike
t=+10s: Rate limiter kicks in — excess traffic gets 429 responses
t=+10s: Backend stays at healthy capacity (5,000 req/s)
t=+1min: Traffic normalizes as retries spread out
t=+5min: Zero downtime. Graceful degradation. A metric blip, not a page.
Why interviewers reach for this question: Rate limiting surfaces the core Staff-level skill — reasoning about tradeoffs under uncertainty. There's no perfect solution. Every choice has a cost. Do you optimize for accuracy or latency? Who absorbs the cost of false positives? How do you handle distributed coordination without adding latency? Interviewers want to see you navigate these tensions, not recite Token Bucket mechanics.
Mechanics Refresher: Algorithms
| Algorithm | How It Works | Pros | Cons |
|---|---|---|---|
| Token Bucket | Tokens refill at a fixed rate; each request costs a token | Allows controlled bursts; O(1) per check | Slightly more complex state |
| Fixed Window | Count requests in fixed time intervals | Simple to implement | Boundary exploit: 2× burst at window edges |
| Sliding Window Log | Count requests over continuously sliding window | Accurate | O(N) memory per client — doesn't scale |
| Sliding Window Counter | Weighted approximation between two fixed windows | Memory-efficient approximation | ±1% error at boundaries |
| Leaky Bucket | Queue requests; process at fixed rate | Smooth output | Queuing abusive traffic is worse than rejecting it |
For most production systems: Token bucket. The algorithm is almost never the interview question — coordination, failure modes, and ownership are.
What This Interview Actually Tests
Rate limiting is not an algorithm question. Everyone knows Token Bucket.
This is a distributed systems ownership question that tests:
- Whether you clarify intent before designing
- Whether you reason about failure modes proactively
- Whether you understand who pays for each tradeoff
- Whether you can own the operational burden
The key insight: Rate limiting is fundamentally a policy enforcement problem with no perfect answer. Staff engineers reason about who absorbs the cost of imperfection.
The L5 vs L6 Contrast — Start Here
| Behavior | Senior (L5) | Staff (L6) |
|---|---|---|
| First move | Draws Redis + Token Bucket | Asks "What are we protecting against?" |
| Algorithm | Selects Token Bucket | Identifies the Latency Trap: central Redis adds 5–10ms to every request |
| Consistency | Assumes strong consistency | Argues rate limiting is "fuzzy" — eventual consistency may be acceptable |
| Failure | Mentions "Redis replicas" | Asks "Fail-open or fail-closed? Who signs off on that?" |
| Ownership | Focuses on implementation | Moves enforcement to Gateway/Sidecar to avoid "client library hell" |
Why "first move" separates levels
L5: Starts with a solution shape ("token bucket + Redis") before the problem is defined. This reads as pattern-matching and creates downstream confusion — mixing abuse protection, billing/quota, and fairness into one design with incompatible correctness and failure expectations.
L6: Names intent out loud and commits to one path before drawing a box. "Are we protecting infrastructure from abuse, enforcing paid quotas, or isolating tenants? These are fundamentally different systems with incompatible failure modes. I'll assume abuse protection."
Why "failure" separates levels
L5: Treats failure as "add Redis replicas / HA." That improves availability, but it dodges the hard question: during slowness, failovers, or partial outages, what do we do with requests?
L6: Makes the decision quickly and ties it to ownership: "For abuse protection we fail-open with conservative local caps + alerting so the limiter doesn't become a kill switch. For billing/quota we fail-closed because we can't give away resources." Also states who signs off on the risk and how you prevent silent bypass.
Why "ownership" separates levels
L5: Focuses on implementation ("we'll add middleware to services") and underestimates organizational drift: polyglot stacks, version skew, inconsistent enforcement, and slow rollouts for policy changes.
L6: Treats rate limiting as a platform control: enforce at the Gateway/Ingress (or sidecar/mesh), keep policy declarative, make changes safe to roll out and roll back. This is the Staff signal: you're designing for the organization, not just for one service.
The Staff Positions
| Position | Rationale |
|---|---|
| Token bucket over fixed window | Fixed window has boundary exploits; token bucket smooths traffic |
| Local-first over Redis-per-request | Avoid the latency trap; coordinate periodically, not per-request |
| Fail-open for abuse protection | The limiter shouldn't become a kill switch; protect availability |
| Fail-closed for billing/quota | Can't give away resources; accuracy trumps availability |
| Gateway/sidecar over SDK | Avoid "client library hell"; enforce at the infrastructure layer |
| Bounded drift is acceptable | For abuse protection, ±10% accuracy is fine; don't over-engineer |
The Three Intents
Three intents drive every design decision. Each leads to a fundamentally different architecture.
| Intent | Constraint | Strategy | Failure Mode | Correctness Bar |
|---|---|---|---|---|
| Abuse Protection | Speed is everything | Fail-open, local-first counters, high throughput | Some over-admission | Bounded drift acceptable (ε) |
| Billing/Quota | Accuracy is everything | Fail-closed, strong consistency, strict accounting | Cannot give away resources | Drift unacceptable; audit required |
| Multi-Tenant Fairness | Isolation is everything | Weighted quotas, reservations, bounded bursts | Noisy neighbor isolation | Per-tenant SLO preservation |
The Five Fault Lines
| # | Fault Line | The Tension |
|---|---|---|
| 1 | Protection vs Correctness | Prioritize system survival (allow drift) or exact limits (risk collapse)? |
| 2 | Centralized vs Distributed State | Redis per-request (simple, accurate, SPOF) vs local-first (resilient, fast, drifty)? |
| 3 | Latency vs Accuracy | Pay the 5–10ms Redis tax on every request, or accept approximation? |
| 4 | Fail-Open vs Fail-Closed | When the limiter fails, protect availability or protect resources? |
| 5 | Infra Ownership vs Team Autonomy | Central service vs sidecar vs SDK? Who owns policy changes? |
In the Wild: Real Production Systems
Stripe — Local-First Token Bucket with Centralized Billing
Stripe uses a local-first token bucket for abuse protection at the gateway layer, keeping enforcement sub-millisecond. For billing/quota (tracking API call usage against paid plans), they use a separate centralized path with stronger consistency guarantees. Every response includes RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers — treating rate limit transparency as a first-class API contract.
Staff insight: Stripe's separation of abuse protection from billing enforcement is the canonical example of why these systems cannot share an algorithm. Abuse protection fails-open; billing fails-closed. These failure modes are incompatible.
Figma — Backpressure-Based Throttling
Figma's collaborative editing infrastructure uses backpressure-based throttling rather than hard 429 rejection — graceful degradation through reduced cursor update fidelity, delayed sync batching, and selective feature throttling under load. When a document has too many simultaneous editors, the system reduces real-time sync frequency rather than disconnecting users.
Staff insight: "Throttling doesn't always mean 429s. In stateful systems, graceful degradation is the preferred strategy." Fail-open vs fail-closed is a spectrum, not a binary — the right answer depends on whether your protocol is stateless (HTTP) or stateful (WebSocket).
Cloudflare — Edge-First Multi-Layer Defense
Cloudflare enforces rate limiting at multiple layers: L3/L4 filtering at the network edge for volumetric DDoS, WAF-level rules at L7 for application-aware limits, and challenge pages as an intermediate step before hard blocks. Each layer catches what the previous one misses, with different accuracy/latency tradeoffs at each tier.
Staff insight: Cloudflare's multi-layer approach is the production implementation of the "CDN/WAF for coarse filtering, gateway for identity-aware limits" architecture. Rate limiting is a stack, not a single component.
What Interviewers Probe
| After You Say... | They Will Ask... |
|---|---|
| "Token bucket + Redis" | "What's the latency tax? What happens when Redis is slow?" |
| "Hybrid local + sync" | "How does coordination actually work? What's the drift bound?" |
| "We'll shard Redis" | "What about hot keys? One identity can dominate one shard." |
| "Fail-open for availability" | "What prevents the backend from melting? What's the circuit breaker?" |
| "We'll add replicas" | "Replicas don't answer degraded mode. What's your fallback behavior?" |
System Architecture Overview
Quick-Reference: The 30-Second Cheat Sheet
| Topic | The L5 Answer | The L6 Answer — Say This |
|---|---|---|
| Algorithm | "Token bucket + Redis per request" | "Token bucket semantics, local-first counters. Redis is for async coordination — never in the critical path." |
| Consistency | "Strong consistency, centralized" | "Bounded drift is acceptable for abuse protection — I'll quantify the drift bound. Billing needs strong consistency via a separate path." |
| Failure | "Add Redis replicas for HA" | "Fail-open for abuse protection (limiter shouldn't be a kill switch), fail-closed for billing (can't give away resources). Replicas don't answer degraded-mode behavior." |
| Hot keys | "Shard Redis" | "Sharding helps for many keys, not one hot key. Mitigate with token leasing, deny-cache, or key-owner routing." |
| Ownership | "Add middleware to each service" | "Enforce at the gateway/sidecar. Per-service SDKs create library hell — 3–4 incompatible versions in production within 6 months." |
| Policy changes | "Update the config file" | "Policy lifecycle: propose → review → staged rollout → observe → enforce. Feature-flag + canary. Who signs off? Who's on-call?" |
Key Numbers Worth Memorizing
| Metric | Value | Why It Matters |
|---|---|---|
| Redis per-request latency overhead | +5–10ms p99 | The latency trap — this is unacceptable on hot paths at high QPS |
| Local bucket check latency | ~0ms | Why local-first is the right default for abuse protection |
| Worst-case drift (10 nodes, 5s sync) | ~83 extra req/window | With 10 gateways syncing every 5s and 100 req/min limit; ~83% overshoot absolute worst case; 10–20% in practice |
| Bypass rate alert threshold | >1% for 5 minutes | Key signal for silent fail-open — if bypass is elevated this long, we're unprotected |
| Redis timeout for circuit break | 5–10ms | Aggressive timeout prevents rate limiter from becoming a latency amplifier |
| Token leasing reduction in Redis QPS | ~10–100× | vs per-request Redis calls — the operational efficiency justification for leasing |
| Enforcement rate SLO alert | <95% | Page on-call if less than 95% of requests are being enforced against Redis |
Phase 1: Requirements & Framing (2–3 minutes)
State functional requirements in 30 seconds — don't enumerate, state the category:
"We need to limit request rates per client to protect backend services from abuse and enforce fair usage across tiers."
Invest remaining time on non-functional requirements — this is the Staff move:
"What's the intent? Abuse protection, billing quota, or multi-tenant fairness? I'll assume abuse protection because that's where the hardest distributed tradeoffs live."
Then commit to a constraint set: "For abuse protection: sub-5ms enforcement latency overhead, fail-open behavior (I'll justify this), and eventual consistency across instances — I'll quantify the drift bound. These three constraints drive the entire design."
Phase 2: Core Entities & API (1–2 minutes)
State entities quickly — 30 seconds:
- RateLimitPolicy: tier, endpoint pattern, window size, threshold, action (reject / throttle / log)
- RateLimitCounter: composite key
{identity}:{endpoint}:{window}, count, TTL - RateLimitDecision: allow/reject, remaining quota, retry_after_ms
Don't draw an ER diagram. Name the three nouns, confirm alignment, move on.
Check path (hot, every request — middleware, not a standalone API):
CheckRateLimit(identity, resource, action) → { allowed: bool, remaining: int, retry_after_ms: int }
Config path (cold, admin only):
PUT /rate-limits/rules { tier, resource, limit, window, action }
GET /rate-limits/rules?tier=free
DELETE /rate-limits/rules/{rule_id}
Response headers on every request: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After
Phase 3: High-Level Architecture (5–7 minutes)
Draw the architecture in three layers, then walk through the request flow:
Walk the request flow in 90 seconds:
- Every request hits the gateway's rate limit middleware
- Middleware checks local token bucket — this is the hot path, ~0ms
- If tokens available: allow, forward to backend
- If no tokens: return 429 + Retry-After header
- Background: every 500ms, sync local count deltas to Redis for cross-instance coordination
- Redis is never in the critical request path
Key points to hit explicitly:
- Gateway-level enforcement — one enforcement point, not N per-service middlewares
- Local-first with Redis coordination — local token bucket for sub-ms checks; async sync to Redis; Redis is NOT in the critical path
- Fail-open as default — for abuse protection, blocking legitimate users is worse than letting some abuse through
- Config store separate from enforcement — policy changes propagate asynchronously, not in the hot path
- Observability from day one — 429 rate, false positive rate, Redis latency p99, bypass rate
Phase 4: Transition to Depth (1 minute)
"The basic architecture is straightforward. What makes this Staff-level is the failure mode reasoning. Three areas worth going deep: what happens when Redis fails, distributed coordination across multiple gateway instances, and policy management as an organizational problem. Which is most interesting to you?"
If the interviewer doesn't have a preference: lead with fail-open vs fail-closed — most impressive and most universally applicable.
Phase 5: Deep Dives (25–30 minutes)
For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → name who absorbs it.
Fault Line 1: Fail-open vs fail-closed (5–7 min)
Open with the decision framework:
"When Redis is down, do we let all traffic through (fail-open) or block all traffic (fail-closed)? For abuse protection, I default to fail-open: blocking 100% of legitimate users to stop potential abuse is worse than temporarily allowing unchecked traffic. For billing/quota, I'd flip to fail-closed because giving away resources has direct revenue impact."
Walk through the failure sequence:
- Redis goes down → middleware detects failure (connection timeout, 5–10ms aggressive timeout)
- Middleware switches to local in-memory counters — degraded accuracy but non-zero enforcement
- Observability pipeline fires alert: "enforcement rate dropped below 95%"
- On-call engineer sees alert, confirms Redis outage, follows runbook
- Redis recovers → middleware detects healthy connection → resumes centralized counters
The real danger — silent fail-open: "If the fallback silently passes all traffic without alerting, you could run unprotected for hours. This is the scenario most Senior candidates miss. Adding a bypass-rate metric (rate_limit_bypassed_total) with an alert threshold is mandatory, not optional."
Fault Line 2: Distributed coordination (5–7 min)
Frame with concrete numbers:
"With 10 gateway instances and a 100 req/min limit, each instance could independently allow 100 — giving the client 1,000 total. Three options:
- Centralized Redis per-request: accurate, +2ms latency per check, SPOF
- Local counters with periodic sync: fast (sub-ms), bounded drift
- Pre-split quotas (100/10 = 10 per instance): no coordination, wastes capacity on cold instances"
Pick a position and quantify: "I'd go with option (b) for abuse protection. With 10 instances syncing every 5 seconds, worst-case overshoot is 10 × (100/60 × 5) ≈ 83 extra requests per window. That's ~83% overshoot absolute worst case — but in practice traffic distributes across instances, so real overshoot is 10–20%. For abuse protection where limits are 1,000+ req/min, that's noise."
Name a concrete coordination mechanism — don't just say "sync with Redis":
- Token leasing: Lease N tokens from Redis, spend locally. Reduces Redis QPS by 10–100×. Failure mode: stranded tokens when gateway crashes mid-lease.
- Key-owner routing: Hash the identity to a consistent gateway owner. Single-writer avoids races. Failure mode: load skew if one identity dominates.
- Bounded reconciliation: Explicit drift budget ε; force global check when local tokens drop below low-watermark. Failure mode: thundering herd on low-watermark trigger.
"For billing/quota where every request has dollar value, I'd switch to centralized Redis. The +2ms latency is acceptable because billing endpoints are lower throughput."
Fault Line 3: Algorithm — why it matters less than you think (3–5 min)
"I'd use token bucket. But the algorithm choice is the least interesting part. Token bucket, sliding window log, sliding window counter — they all work. The real question is where the counter lives, what happens when that store fails, and how you coordinate across instances. I can explain the algorithmic differences if you'd like, but I'd rather spend time on the distributed coordination problem."
This is a power move. It demonstrates you know the algorithms but won't waste time on textbook recitation. If the interviewer insists:
- Token bucket: smooth, allows bursts up to bucket size, O(1) per check
- Sliding window counter: approximation between two fixed windows, low memory, ±1% error at boundaries
- Sliding window log: exact, but O(N) memory per client — doesn't scale for high-volume clients
Then redirect: "The algorithm determines local behavior. The hard problem is distributed coordination — which we just discussed."
Fault Line 4: Hot keys & thundering herd (3–5 min)
"What happens when a single API key generates 50% of all traffic? That key's counter becomes a hot key in Redis — every gateway instance contends on the same key. The mitigations:
- (a) Local aggregation: batch increments locally and flush to Redis every 100ms instead of per-request
- (b) Key sharding: split
rl:{api_key}:{window}intorl:{api_key}:{window}:{shard_0..7}and sum on read - (c) Deny-cache: if a key is already 10× over limit, reject locally without hitting Redis at all"
The important distinction: "Sharding Redis helps for many different keys across nodes. It does NOT help for a single hot identity that hashes to one shard. Hot key is structurally different from high cardinality — you need local aggregation or key-owner routing, not just more shards."
Fault Line 5: Ownership & policy management (3–5 min)
"Who writes the rate limit policies? In my experience, this is where rate limiting actually breaks. The platform team owns the enforcement infrastructure, but product teams own the policies for their endpoints. Without a self-service policy API and a review process, you end up with either: the platform team as a bottleneck for every policy change, or product teams setting limits too high (because they fear blocking users) and the limits being effectively useless."
The Staff answer: "Self-service policy API with guardrails. Product teams can set limits within pre-approved ranges. Changes go through a review pipeline — not a human review, but an automated check that the new limit won't exceed the backend's capacity. Deployment is canary: new limits apply to 5% of traffic for 1 hour before full rollout. Rollback is a one-line config change, not a deploy."
Phase 6: Wrap-Up (2–3 minutes)
Synthesize the insight — don't restate the architecture:
"Rate limiting is a policy enforcement problem, not an algorithm problem. The Staff-level challenge is: who absorbs the cost of imperfection? For abuse protection, we bias toward fail-open because blocking legitimate users is worse than letting some abuse through. For billing, we bias toward fail-closed because giving away resources has direct cost. The architecture is the same in both cases — the configuration and failure behavior change."
The organizational closer:
"The harder problem is policy management. The rate limiter is infrastructure — it's a solved technical problem. The unsolved problem is getting 15 product teams to agree on rate limit policies, keep them updated, and actually respond when limits are hit. That's an organizational design problem, not a systems design problem."
Common Timing Mistakes
| Mistake | L5 Does This | L6 Does This Instead |
|---|---|---|
| 10 min on requirements | Lists every functional requirement, asks about each edge case | States intent in 1 min, picks abuse protection, moves on |
| 15 min on algorithm | Deep dive into Token Bucket vs Sliding Window math | "Token bucket, here's why, moving on to what actually matters" |
| No failure discussion | Waits for interviewer to ask "what if Redis goes down?" | Volunteers fail-open/fail-closed proactively in the architecture phase |
| No ownership story | Focuses purely on implementation | Names who owns policies, who's on-call, how config changes deploy |
| Spreads thin | Touches 6 topics at surface level | Goes deep on 2–3 fault lines, shows quantitative reasoning |
| No numbers | "It should be fast" | "Sub-5ms overhead, bounded drift of ~83 requests with 10 instances syncing every 5s" |
1The Staff Lens
1.1 Why This Problem Exists in Staff Interviews
Rate limiting separates L6 from L5 because it forces you to reason about organizational tradeoffs — who absorbs the cost of imperfection, who owns the policy, who gets paged at 3 AM. Five behaviors below are what interviewers listen for.
1.2 The L5 vs L6 Contrast — Visual
1.3 The Staff Question That Cuts Through Everything
This single question reveals whether a candidate has operated rate limiters in production or only designed them theoretically.
2Problem Framing & Intent
2.1 The Three Intents — Explained
Abuse Protection → fail-open, speed-first
- Constraint: latency overhead must be sub-5ms; enforcement must not become a kill switch
- Algorithm: token bucket, local-first with async coordination
- Failure mode: fail-open with conservative local caps and bypass-rate alerting
- Who pays for imperfection: security/product team (explaining why some requests got through during Redis outage)
Billing/Quota → fail-closed, accuracy-first
- Constraint: every over-admission has dollar cost; audit trail required
- Algorithm: token bucket with centralized Redis; stricter coordination
- Failure mode: fail-closed (503 or 429 with "service temporarily unavailable") + immediate alert
- Who pays for imperfection: finance/legal (over-admission means giving away paid resources)
Multi-Tenant Fairness → isolation-first
- Constraint: one tenant's traffic burst must not starve other tenants
- Algorithm: hierarchical token buckets — global cap → tenant cap → sub-tenant cap
- Failure mode: per-tenant circuit breaking; global capacity preserved
- Who pays for imperfection: product team (explaining why enterprise customer was throttled)
2.2 What the Interviewer Leaves Underspecified
Interviewers deliberately omit:
- Auth vs unauth traffic
- Client identity strength
- Hard vs soft limits
- Multi-region behavior
- Regulatory constraints
Staff engineers surface these. Senior engineers assume them away.
2.3 Precise Terminology
Rate-limiter interviews are ambiguous about where enforcement runs. Use precise terms:
| Term | What It Means | Identity Context |
|---|---|---|
| API Gateway / Ingress | First programmable hop inside our infrastructure | API key, auth token, IP |
| CDN/WAF (true edge) | Cloudflare/Akamai/AWS WAF — before our gateway | IP, ASN, geo only |
| Service Mesh / Sidecar | Internal rate limiting between services | Service identity |
| Application Middleware | Per-service enforcement | Full request context |
3The Five Fault Lines
3.1 Fault Line 1: Protection vs Correctness
The tension: Strict correctness requires coordination that can become the bottleneck. Protection-first accepts drift but keeps the system alive.
| Choice | What Works | What Breaks | Who Pays |
|---|---|---|---|
| Prioritize Correctness | Exact limits enforced | System collapse under coordination load | Infra team (outage) |
| Prioritize Protection | System survives | Some over-admission | Security/Product (explaining drift) |
L6 answer: "For abuse protection, I choose protection-first with bounded drift. I'd rather over-admit 10% than add 10ms to every request or cause a self-inflicted outage when Redis degrades. For billing/quota, I flip to correctness-first — the over-admission has direct cost. The key insight: abuse protection and billing cannot share the same rate limiter configuration because they have incompatible failure modes."
L7 answer: Reframes as risk governance — "Which layer enforces what?" CDN/WAF for coarse abuse, gateway for identity-aware limits, app for business invariants. Defines who signs off on fail-open/closed and what blast radius is acceptable.
3.2 Fault Line 2: Centralized vs Distributed State
The tension: Central state is easy to reason about but creates a dependency in the hot path. Distributed state is resilient but requires explicit coordination mechanisms.
| Choice | What Works | What Breaks | Who Pays |
|---|---|---|---|
| Centralized (Redis per-request) | Simple, accurate, observable | SPOF, latency tax (+5–10ms) | Infra (reliability burden) |
| Distributed (local + async sync) | Resilient, fast, scalable | Accuracy loss (bounded drift) | Product (explaining over-admission) |
| Hybrid (leasing / routing) | Reduced Redis QPS, high accuracy | Complexity (lease sizing, reclaim) | Engineering (maintenance burden) |
L6 answer: "I pick token leasing for this design. Each gateway instance leases a chunk of tokens from Redis (say, 50 tokens for a 1,000 req/min limit with 10 gateways), spends them locally with zero coordination per request, then renews when the lease runs low. Redis QPS drops from request-per-lease-renewal to approximately 2 RPCs per gateway per 5 seconds — a 10–100× reduction. Failure mode: if a gateway crashes mid-lease, those leased tokens are stranded for the lease duration (5 seconds). For abuse protection, 5 seconds of stranded capacity is acceptable."
L7 answer: "Do we even need custom distributed coordination? Evaluate managed gateway throttling (Envoy RLS, AWS API Gateway throttling, Cloudflare Rate Limiting) before building custom. If custom is needed, select the coordination mechanism based on operational cost: token leasing for abuse protection at scale, key-owner routing for authenticated billing paths."
3.3 Fault Line 3: Latency vs Accuracy
The tension: Every millisecond added to the critical request path has compound effects at scale. At 1M req/sec, a 5ms Redis round-trip is the difference between a healthy gateway and a latency amplifier.
| Choice | Latency Impact | Accuracy | When Appropriate |
|---|---|---|---|
| Redis per-request | +5–10ms p99 | High | Low QPS, billing-critical paths |
| Local + async sync | ~0ms added | Medium (bounded drift) | High QPS, abuse protection |
| Hybrid (lease/route) | +1–2ms occasional (lease renewal) | Medium-High | Most production systems |
L6 answer: "The latency tax is the reason the local-first design exists. At 100K req/sec, every 1ms of rate-limiter overhead is 100 additional seconds of cumulative delay per second of traffic. I keep the hot path at ~0ms by checking local buckets first. Redis is for coordination, not for enforcement. The 5–10ms overhead only occurs on lease renewals (every 5–10 seconds per gateway instance) or on the first request from a new identity."
3.4 Fault Line 4: Fail-Open vs Fail-Closed
The tension: When the rate limiter's central store fails, you must choose between protecting availability (fail-open, risk abuse) and protecting resources (fail-closed, risk self-inflicted outage).
| Context | Recommended | Why |
|---|---|---|
| Ingress abuse protection | Fail-open | Limiter shouldn't be a kill switch |
| Billing/quota enforcement | Fail-closed | Cannot give away resources |
| Internal service protection | Depends | Cascade analysis required |
Guardrails for fail-open:
- Aggressive Redis timeout (5–10ms max — fail fast, don't let threads pile up)
- Conservative local fallback caps (set per-instance limit to 2× normal, not unlimited)
- Bypass-rate alerting:
bypass_rate > 1% for 5 min → page on-call - Circuit breaker on backend stress signals: if downstream error rate rises while in fail-open, tighten caps
L6 answer: "I make the fail-open/fail-closed decision before writing a single line of code, and I tie it to intent. For this abuse protection limiter: fail-open. The rate limiter should never become a kill switch for legitimate users. I add three guardrails: aggressive Redis timeout (5ms), conservative local fallback caps, and bypass-rate alerting. If the backend shows stress while in fail-open mode, the fallback caps tighten automatically. For billing: fail-closed with immediate on-call escalation — we can't give away paid resources."
L7 answer: Defines a governance model — who can flip fail-open/closed (change management approval required), what the emergency procedure is, what the kill-switch scope is (per-endpoint, per-tenant, or global), and what post-incident analysis is required.
3.5 Fault Line 5: Infra Ownership vs Team Autonomy
The tension: A central rate limiting service creates consistency but becomes a bottleneck. Per-service SDKs give teams flexibility but cause library hell and config drift.
| Model | Who Owns | Pros | Cons |
|---|---|---|---|
| Central service | Platform team | Consistency | Bottleneck, SPOF |
| Gateway / Sidecar | Platform team | Decoupled, consistent | Requires gateway/mesh investment |
| SDK / Library | Each team | Flexibility | Library hell, version drift |
L6 answer: "Enforce at the gateway/sidecar to avoid per-service SDK drift. Within 6 months of distributing a rate limiting SDK, I'd have 3–4 incompatible versions in production — one with a security bug the team hasn't updated, one on a deprecated version. A policy change requires coordinating 15 teams instead of one config update. The gateway enforcement model means: platform team owns infrastructure, product teams own policy intent via a self-service API. Policy changes deploy to the gateway — no service redeploy required."
L7 answer: Defines governance: who owns policy definition (product team, within platform-defined bounds), who approves exceptions (architecture review for limits above platform maximums), how staged rollouts work (shadow → warn → enforce), and how you prevent the platform team from becoming a bottleneck.
4Failure Modes & Operational Reality
4.1 Store Failures — Full Timeline
Scenario A: Redis becomes slow (most common)
t=0: Redis p99 jumps from 2ms to 100ms
t=0–30s: Gateway worker threads pile up waiting for Redis responses
t=30s: Gateway request queues fill — new requests start queuing
t=1min: Gateway starts returning 503s to ALL traffic
t=2min: "The rate limiter" has become a global outage
(NOT a Redis outage — a rate limiter outage)
What breaks first: P99 latency at the gateway spikes, threads pile up, the rate limiter becomes a latency amplifier for every request regardless of rate limit status.
Bad reaction: "Increase Redis timeouts" — makes it worse, extends the window before circuit break fires.
Staff reaction: Aggressive Redis timeout (5ms) triggers circuit break → switch to local-only mode → bypass-rate metric fires alert → on-call is paged within 30 seconds. The gateway never queues on Redis.
Scenario B: Redis is completely down
| Strategy | Effect | Who It Hurts |
|---|---|---|
| Fail-closed | Protect backend; risk total outage | All users |
| Fail-open with local caps | Preserve availability; bounded abuse risk | Security team (explaining gap) |
Staff choice for abuse protection: Fail-open with aggressive local limits, bypass-rate alerting, and circuit breaker on backend stress. The rate limiter must not become a global kill switch.
4.2 Hot Key & Amplification — Why "Just Shard It" Doesn't Work
The critical insight: sharding helps for many different keys. It does NOT help for one single identity that generates 100K req/s — that identity always maps to the same shard regardless of how many shards exist.
Scenario A: Leaked API key used by botnet
- Symptom: one API key drives 50%+ of traffic; single shard CPU spikes
- Mitigation: (1) local deny-cache — reject locally with TTL 30–60s without hitting Redis, (2) revoke/rotate the key, (3) add CDN/WAF edge rules if stable IP/ASN signals
- Tradeoff: fast containment vs false positives if key is shared with legitimate partner
Scenario B: Legitimate tenant burst (partner batch job)
- Symptom: paying tenant looks like "hot key" but isn't malicious
- Mitigation: token leasing with adaptive lease sizing for higher-tier tenants; per-tenant reservation pool
- Tradeoff: fairness vs utilization
Scenario C: IP-based identity collapses (NAT/corporate proxy)
- Symptom: one IP = thousands of real users → false positives
- Mitigation: IP as coarse outer limiter only; shift to stronger identity (API key / user_id) for authenticated paths
- Tradeoff: better UX vs implementation complexity
Mitigations that actually work for hot keys:
- Local aggregation: batch increments every 100ms instead of per-request — 100× Redis QPS reduction
- Key-owner routing: single gateway instance owns enforcement for hot identity — no multi-writer races
- Deny-cache: if identity is already 10× over limit, reject locally without hitting Redis at all
- Key sharding + sum-on-read: split
rl:{key}:{window}intorl:{key}:{window}:{shard_0..7}and sum at read time — distributes write load across shards
4.3 Data Integrity Failures
Clock Skew: Token refill depends on wall time. Drift causes over-refill or under-refill across nodes.
- Mitigations: cap refill deltas (never trust a >10s time jump), use monotonic clocks, use Redis server time inside Lua scripts
- Staff note:
redis.call("TIME")inside the Lua script uses Redis server time, reducing clock-skew risk — adds ~0.1ms overhead but worth it for billing paths
Script Bugs: Atomic Lua operations silently wrong. Hardest failure mode to detect.
- Detection: integration tests covering boundary conditions, audit logging with
remaining_tokenssampled 1% - Recovery: script rollback (keep previous version deployable), counter reset procedure
4.4 Operational Reality Matrix
| Failure | Loud/Silent | User Impact | Detection Time | Prevention |
|---|---|---|---|---|
| Redis down | Loud | Immediate if fail-closed | Seconds | Aggressive timeout + bypass alert |
| Redis slow | Medium | Latency spike | Minutes | timeout circuit break |
| Silent fail-open | Silent | Invisible until attack | Hours | bypass_rate > 1% alert |
| Hot key | Medium | Subset of users (one shard) | Minutes | Per-shard CPU monitoring |
| Clock skew | Silent | Gradual drift | Minutes–hours | Server-side time in Lua |
| Script bug | Silent | Varies | Hours–days | Integration tests + audit log |
| Policy misconfiguration | Medium | Wrong users throttled | Minutes | Shadow mode + canary rollout |
5Evaluation Rubric
5.1 Level-Based Signals
| Dimension | Senior (L5) | Staff (L6) | Principal (L7) |
|---|---|---|---|
| Semantics | Token bucket + Redis | Defines contract precisely; separates abuse vs billing semantics | Standardizes semantics org-wide: policy language, versioning, client-facing vs internal |
| Placement | API Gateway + central store | Chooses layers intentionally for blast radius; names CDN/WAF vs gateway vs mesh distinction | Sets org strategy; coarse abuse → CDN/WAF, identity-aware quota → gateway, dependency protection → mesh |
| Coordination | "Central Redis per request" | Picks one concrete mechanism (leasing/routing/reconciliation); explains failure behavior | Chooses via TCO + operational risk; prefers managed (Envoy RLS, WAF throttling) unless gap demands custom |
| Hot keys | "Redis cluster + sharding" | Explains why hot key ≠ many keys; proposes mitigations with scenario tradeoffs | Treats as incident + governance problem: key rotation, capacity planning, tenant isolation |
| Failure modes | "Redis replicas / HA" | Timeline-driven; explicit fail-open/closed by intent; bypass-rate observability contract | Governance + blast-radius controls: kill switches, change management, incident drills |
| Latency | "Low latency required" | Quantifies tax; sets p99 budget; keeps enforcement off critical path | Connects latency to business SLO + cost model; chooses where to pay the tax per path |
| Ownership | Implementation focus | Avoids SDK hell; defines policy lifecycle and rollout | Defines org boundaries: platform vs security vs product, staged rollout, long-term simplification |
5.2 Strong Hire Signals
| Signal | What It Sounds Like |
|---|---|
| Intent before architecture | "What are we protecting against? Abuse, billing, or fairness? These are incompatible systems." |
| Latency trap named | "Redis per-request adds 5–10ms to every request. That's unacceptable at 100K req/s." |
| Failure mode quantified | "Worst-case over-admission with 10 gateways and 50-token leases: ~500 extra tokens — 50% overshoot worst case, 10–20% in practice." |
| Silent fail-open addressed | "bypass_rate > 1% for 5 minutes pages on-call. The metric is mandatory, not optional." |
| Policy lifecycle defined | "Shadow → warn → hard-enforce. Policy changes never skip canary. Who signs off? What metrics must be green?" |
5.3 Lean No-Hire Signals
| Signal | Why It Misses the Bar |
|---|---|
| Algorithm fixation | 15 minutes on Token Bucket vs Sliding Window without tradeoffs |
| Over-engineering | "We need multi-region active-active from day one" |
| Ignoring operations | No mention of monitoring, alerting, failure handling, policy management |
| Missing intent | Designs without clarifying what we're protecting against |
| "Just add replicas" | Replicas improve availability; they don't define degraded-mode behavior |
5.4 Common False Positives
- Knows Redis deeply: Deep Redis knowledge ≠ good system design. Implementation detail fluency without tradeoff reasoning.
- Draws complex diagrams: Complexity is not a Staff signal. Can they explain the organizational cost of each component?
- Mentions many algorithms: Breadth without depth is Senior, not Staff.
- "We'll use Envoy RLS": Naming a managed solution without explaining when custom is needed and when managed is sufficient.
6Interview Flow & Pivots
6.1 Typical 45-Minute Shape
| Phase | Time | Goal |
|---|---|---|
| Framing | 0–3 min | Name intent, commit to abuse protection, state constraints |
| Entities + API | 3–5 min | Three nouns, hot path vs cold path, response headers |
| High-level design | 5–12 min | Gateway enforcement, local-first + Redis coordination, fail-open |
| Transition | 12 min | Offer three fault lines, let interviewer choose |
| Deep dives | 12–40 min | Failure modes → distributed coordination → policy management |
| Wrap-up | 40–45 min | Organizational insight, who's on-call, what the runbook says |
6.2 Reading the Interviewer
| Interviewer Signal | What They Care About | Where to Go Deep |
|---|---|---|
| Asks about Redis failure | Operational maturity | Fail-open vs fail-closed; bypass-rate observability |
| Asks about accuracy | Distributed systems depth | Coordination mechanisms; drift bounds |
| Asks about multi-region | Scale and architecture | Regional quotas; async reconciliation; GDPR on counter data |
| Asks "who decides limits?" | Organizational design | Policy management; self-service API; review process |
| Asks about DDoS | Security depth | CDN/WAF vs gateway vs application layers |
| Pushes back on your design | Wants you to defend or adapt | State reasoning, acknowledge alternatives, commit to tradeoff |
6.3 What to Deliberately Skip
| Topic | Why L5 Goes Here | What L6 Says Instead |
|---|---|---|
| Algorithm deep dive | It's in every textbook, feels safe | "Token bucket. The algorithm isn't the hard part — coordination is." |
| Database schema design | Feels productive to draw tables | "Counters live in Redis, policies in PostgreSQL. Schema is trivial." |
| HTTP status codes | Easy to enumerate | "429 with Retry-After header. Standard. Moving on." |
| Rate limit dashboard UI | Seems complete | "Admin UI is a CRUD app. Not interesting for this interview." |
| Exact sliding window math | Textbook material | "Sliding window approximation, ±1% error at boundaries. Acceptable." |
6.4 Follow-Up Questions to Expect
- "How do you handle clock skew across gateway instances?"
- "What if a single user generates 90% of traffic?"
- "How do you test rate limiting logic in production?"
- "What metrics would you monitor to detect a silent fail-open?"
- "How do you handle a global rate limit across multiple regions?"
- "What's your failure budget for this service?"
7Active Drills
Drill 1: The Opening (Intent + Constraints)
Staff Answer
"Before I draw anything — what are we protecting against? Abuse protection, billing/quota enforcement, and multi-tenant fairness are three fundamentally different systems with incompatible failure modes. Designing them as one leads to a hybrid that satisfies none.
I'll assume ingress abuse protection: the constraint is sub-5ms overhead, fail-open behavior (the limiter shouldn't become a kill switch), and eventual consistency with a bounded drift. I'll walk through: placement → algorithm semantics → coordination mechanism → failure modes → observability → policy management."
Why this is L6:
- Names intent before proposing architecture — prevents designing a hybrid that satisfies no intent
- States failure mode preference (fail-open) proactively — doesn't wait for the interviewer to ask
- Frames the outline as a decision sequence, not a component list — each step narrows the design space
❌ Common L5 Trap
"I'll design a rate limiter using token bucket with Redis. Each request checks Redis with a Lua script for atomicity..."
Why this misses: Picks an algorithm before defining the problem. The interviewer asks: "What are we protecting against?" Now the candidate has to backtrack from "Redis per request" when they realize abuse protection doesn't need that latency overhead. The L6 answer establishes intent first — the architecture follows from intent, not the other way around.
Drill 2: Token Bucket Semantics
Staff Answer
"'100 requests per minute' hides four implementation decisions:
- Is burst allowed? Token bucket says yes — you can burst up to bucket capacity (e.g., 200) if tokens have accumulated. Fixed window says no burst, but allows boundary exploitation (100 at t=59s, 100 at t=61s = 200 in 2 seconds).
- What's the window? Fixed (resets at :00, :01) or rolling (always looking back 60 seconds)?
- How much boundary skew is acceptable?
- What happens at the exact threshold — is the 100th request allowed or rejected?
For abuse protection: token bucket with capacity=200 (2x burst allowance) and refill_rate=100/60 per second. Users get smooth enforcement with burst tolerance. The 429 response includes Retry-After with the exact time until the next token is available. The client never needs to know the window boundaries."
Why this is L6:
- Distinguishes burst tolerance from steady-state rate — operationally different for users
- Connects algorithm choice back to intent (abuse protection → burst allowed)
- Specifies the client contract (429 + Retry-After) as first-class, not an afterthought
❌ Common L5 Trap
"100 requests per minute means the user can make 100 API calls per minute. We track this with a counter that resets every minute."
Why this misses: Fixed window semantics with unexamined boundary behavior. The interviewer asks: "What happens at t=59s if a user sends 100 requests, then at t=61s sends 100 more?" The answer: "Both are allowed — 200 requests in 2 seconds." This is the fixed window boundary exploit that allows 2× burst at every window reset. Token bucket prevents this, but the L5 candidate hasn't considered it.
Drill 3: "Hybrid Local + Redis" — Make It Concrete
Staff Answer
"I'll use token leasing. Here's the request timeline:
- Gateway starts up → leases 50 tokens for key
api_key_abcfrom Redis:LEASE api_key_abc 50 ttl=5s - Redis responds: granted, 50 tokens, expires in 5 seconds
- Requests arrive → gateway decrements local counter (0ms per request, no network hop)
- After 45 requests, local counter drops to 5 (below low-watermark threshold)
- Gateway proactively renews:
LEASE api_key_abc 50 ttl=5s - Redis responds: granted, accounting for all 50 gateways' leases against the global limit
The failure mode I'm explicitly accepting: if a gateway crashes mid-lease, those leased tokens are stranded for 5 seconds. For a 1,000 req/min limit across 10 gateways with 50-token leases: worst-case stranded capacity = 50 tokens for 5 seconds. Acceptable for abuse protection."
Why this is L6:
- Names one concrete coordination mechanism — not "sync with Redis"
- Walks through the request timeline to prove the atomic boundary is sound
- Explicitly names and accepts the failure mode (stranded tokens on crash)
❌ Common L5 Trap
"We sync the local counter with Redis every 500 milliseconds. Each gateway reads from Redis periodically to stay in sync."
Why this misses: "Sync with Redis" is not a mechanism — it's a frequency. The interviewer asks: "How exactly does the sync work? Do all gateways read the same Redis key? What if two gateways both read 80 tokens and both try to spend them before the next sync?" This exposes the race condition: periodic sync without a named coordination mechanism means multiple gateways can each independently allow 100 requests during the sync window, collectively over-admitting by N×. Token leasing bounds the over-admission explicitly.
Drill 4: Redis Is Slow/Down
Staff Answer
"Two things happen simultaneously:
First: the circuit breaker fires. I've configured an aggressive Redis timeout of 5ms. At 200ms p99, the circuit breaker opens after seeing 50% of requests timeout — it stops sending requests to Redis and switches to local-only mode. This happens within 30 seconds of Redis degrading.
Second: the bypass-rate metric fires. rate_limit_bypassed_total starts incrementing. When bypass_rate exceeds 1% of requests for 5 minutes, PagerDuty pages the on-call engineer. The runbook says: (1) check Redis cluster health, (2) identify the cause (network, memory, CPU), (3) if Redis is recovering, monitor — local fallback is handling the gap. (4) If Redis is down and not recovering, check if backend shows stress — if yes, tighten local fallback caps.
The thing I'm NOT doing: increasing the Redis timeout to 200ms. That's the wrong reaction — it extends the window where threads pile up waiting for slow Redis responses and turns a Redis slowdown into a gateway outage."
Why this is L6:
- Uses aggressive timeout as a circuit break — not "wait for Redis to recover"
- Names the specific metric and alert threshold for bypass detection
- Describes the runbook steps, not just the technical response
- Explicitly says what NOT to do and why
❌ Common L5 Trap
"Redis p99 at 200ms means our rate limit checks are adding 200ms to requests. We should increase Redis capacity or add replicas to bring the latency down."
Why this misses: Treats Redis slowness as a capacity problem to solve, not a failure mode to handle. The interviewer asks: "While you're scaling Redis, what's happening to your gateway for the next 5 minutes?" Answer: gateway threads are blocking for 200ms on every Redis call. With 100 concurrent requests per instance and 200ms Redis calls, gateway threads saturate and the rate limiter becomes a latency amplifier. The circuit break + local fallback prevents this cascade; adding Redis capacity doesn't.
Drill 5: Hot Key — One Identity at 90% of Traffic
Staff Answer
"The critical distinction: a hot key is structurally different from high traffic volume. Redis sharding distributes many different keys across nodes. It does nothing for one identity that always hashes to the same shard.
What breaks first: the Redis shard handling rl:abusive_api_key:* saturates — CPU and network on that shard spike. This causes latency spikes for all OTHER users whose keys hash to the same shard, not just the abusive key. Collateral damage.
The mitigations in priority order:
-
Local deny-cache (immediate): Once the key is identified as over-limit, cache the deny decision locally with a 30–60 second TTL. New requests from this key get rejected at the local bucket without touching Redis at all. Redis QPS from this key drops to near-zero.
-
Key-owner routing (for sustained hot keys): Route all requests for hot identities to a consistent gateway instance. Single-writer eliminates the Redis contention for that key. Other gateways never touch it.
-
Key sharding + sum-on-read (if both fail): Split
rl:{key}:{window}intorl:{key}:{window}:{shard_0..7}and sum the counts at read time. Distributes Redis write load across 8 shards. More complex to implement.
The edge case I'd surface: is this a legitimate partner doing a bulk operation (false positive) or actual abuse? Tighten too hard and you cut off a paying customer. I'd check account tier before escalating to full deny."
Why this is L6:
- Explains why sharding doesn't solve hot key (structurally different problems)
- Names collateral damage to other users on the same shard — not just the hot key user
- Proposes mitigations in priority order with implementation tradeoffs
- Raises the false positive edge case before the interviewer asks
❌ Common L5 Trap
"One API key at 90% of traffic would be handled by our Redis cluster — the key just gets heavy writes to that shard, but Redis can handle it with enough memory and CPU."
Why this misses: "Redis can handle it" ignores the saturation math. At 100K req/s from one key, the Redis shard handling that key receives 100K writes/second for just that one counter. Redis single-shard throughput is ~100K simple operations/second — this one key alone saturates the entire shard. Every other user hashing to that shard experiences latency spikes. The interviewer asks: "User Alice hashes to Shard 3, same as the abusive key. What does Alice experience?" Answer without thinking carefully: "Normal service." Correct answer: "Latency spikes and potential timeouts as Shard 3 is saturated."
Drill 6: Multi-Tenant Fairness
Staff Answer
"Multi-tenant fairness is a different problem than abuse protection — the noisy tenant isn't necessarily malicious, they're just large. The goal is per-tenant SLO preservation, not just global QPS control.
I'd use hierarchical token buckets with reserved floors:
Global capacity bucket (protects platform overall)
└── Tenant A bucket (contracted: 500 req/s, burst: 800 req/s)
└── Sub-buckets per user/endpoint within Tenant A
└── Tenant B bucket (contracted: 200 req/s, burst: 300 req/s)
└── Shared burst pool (available to all tenants when slack exists)
The key mechanism: each tenant has a guaranteed reserved_floor (their contracted rate) that they always get, even under global contention. Above the floor, tenants compete for the shared burst pool. Under contention, burst pool allocation falls back to the reserved floor.
Observability for fairness — critical metrics:
rate_limit_rejected_total{tenant}per tenanttenant.p99_latency_msper tenant — fairness failures show as latency regressions before they show as rejection spikes- Alert: 'Tenant starvation — tenant heavily throttled while global utilization < 80%'"
Why this is L6:
- Frames fairness as a per-tenant SLO problem, not a global QPS problem
- Proposes hierarchical buckets with reserved floors — protects small tenants from large ones
- Includes per-tenant observability with the specific fairness-detection alert
❌ Common L5 Trap
"We'll give each tenant a separate rate limit based on their plan tier. Enterprise gets 1,000 req/s, Pro gets 500, Free gets 100."
Why this misses: Tiered limits prevent individual tenants from exceeding their quota, but don't prevent the noisy neighbor problem — Enterprise Tenant A consuming all 1,000 req/s of their quota might saturate the shared infrastructure in ways that degrade Tenant B's experience even though Tenant B is within their 500 req/s limit. The L5 answer sets per-tenant ceilings; the L6 answer uses hierarchical buckets with reserved floors to guarantee per-tenant minimums.
Drill 7: Build vs Buy
Staff Answer
"My default is always: use what you already have. Before proposing custom rate limiting, I'd inventory existing controls:
-
CDN/WAF already has coarse IP-based rate limiting. Does this solve the abuse problem? For volumetric DDoS, yes. For API abuse from authenticated clients, no.
-
AWS API Gateway / GCP Apigee / Envoy RLS all have built-in rate limiting. Do they support our identity model (user_id + endpoint + tenant)? Do they support our coordination model (local-first with async sync)? If yes, use them.
-
If managed solutions have a gap: Custom rate limiting is justified only if the gap is differentiated and the maintenance cost is worth it. Common gaps: complex multi-tenant fairness, custom token leasing semantics, per-request cost (not all requests cost 1 token), business-logic-aware limiting.
TCO argument for managed: A managed solution means no on-call burden for the rate limiting infrastructure itself — just the policy configuration. A custom solution means rate-limiting incidents, rate-limiting upgrades, and rate-limiting on-call rotation, all on top of the actual product work.
My recommendation for most mid-size companies: start with managed gateway throttling (Envoy RLS, CloudFront, or API Gateway). Add custom logic only where managed solutions have demonstrable gaps that affect business goals."
Why this is L6:
- Inventories existing controls before proposing new work
- Quantifies the TCO argument (managed vs custom) rather than just "use the managed service"
- Identifies the specific gaps that justify custom engineering
❌ Common L5 Trap
"We should build a custom rate limiter for full control and flexibility."
Why this misses: "Full control" is not a TCO argument. The interviewer asks: "What does 'full control' get us that Envoy RLS or AWS API Gateway throttling doesn't?" If the answer is "we can customize it more," the follow-up is: "Who maintains the custom rate limiter? Who's on-call for it? Who upgrades it when Redis has a CVE?" Custom infrastructure has organizational cost beyond engineering time.
Drill 8: Policy Changes Without Outages
Staff Answer
"Policy changes follow a lifecycle. 'Deploy tomorrow' is an organizational failure — we should already have self-service for this:
Stage 1 — Shadow mode (24 hours): Apply the new limit logically but don't enforce. Log what would have been rejected. Check: would this limit have affected any legitimate paying customers yesterday?
Stage 2 — Soft enforcement (1 week): Send warning headers to requests that would be rejected, but don't actually return 429. Example: X-RateLimit-Warning: You would be rate limited. Reduce usage. Teams see they're about to be throttled without disruption.
Stage 3 — Canary enforcement (48 hours): Apply hard enforcement to 5% of traffic. Monitor per-tenant rejection rates. If top tenants spike, the limit is misconfigured — rollback with one config change, no deploy.
Stage 4 — Full enforcement: Roll out to 100% after clean canary.
The organizational piece: who approves the change? The platform team sets guardrails (maximum limits per tier). Product teams can self-service within those bounds. Changes above the guardrails require an architecture review. This prevents 'product team set a limit too high because they were scared of throttling users' from making the limit useless."
Why this is L6:
- Full policy lifecycle with shadow mode, soft enforcement, and canary — not "update the config"
- Rollback is a one-line config change, not a deploy — pre-conditions for safe changes
- Defines the organizational governance layer (guardrails + self-service + review)
❌ Common L5 Trap
"Update the rate limit config and deploy. We can use a feature flag to enable it."
Why this misses: A feature flag enables/disables the limit but doesn't tell you if the new limit is correct. The interviewer asks: "How do you know if the new limit is too tight before you've already throttled your top customers?" Answer: "We'll find out from support tickets." The shadow mode and soft enforcement stages exist specifically to catch misconfigured limits before they cause user impact. Deploying with a feature flag is not a deployment strategy — it's a kill switch.
8Deep Dive Scenarios
Scenario-based analysis for Staff-level depth
Deep Dive 1: Flash Sale Incident
Context: It's Black Friday. The rate limiter is returning 429s to legitimate users during a 10× traffic spike. The on-call engineer escalates to you.
Questions to Surface First:
- Is the backend actually unhealthy, or is the limiter rejecting traffic the backend could handle?
- Is this affecting all users or specific tenant tiers? Are VIP/enterprise customers hitting public limits?
- Was this traffic spike predicted? Was there a capacity planning exercise for Black Friday?
- What's the business cost per minute of rejecting legitimate buyers right now?
Staff Approach — Full Reasoning
| Phase | What to Do |
|---|---|
| Immediate (0–5 min) | Check backend health first — CPU, error rate, connection pool. If backend is healthy, the limiter is misconfigured, not the users. |
| Triage | Is this one tenant hitting limits, or system-wide? Check per-tenant dashboards. Are enterprise customers (who have pre-negotiated higher limits) affected? |
| Quick fix | If backend is healthy and legitimate traffic is rejected: emergency limit increase via feature flag. Document the change, set an expiry. |
| Guardrails | While limits are raised: watch backend CPU and error rates. Don't let increased limits cause a cascade. Set a limit on the limit increase. |
| Post-mortem | Why didn't capacity planning catch this? Should we have elastic limits for planned events? Should marketing notify infra before campaigns? |
Staff insight: The rate limiter's job is to protect the backend, not to be "correct." If the backend is healthy and users are being rejected, the limiter is misconfigured for the situation.
Metrics to Watch:
backend.cpu_utilization, backend.error_rate, backend.connection_pool_utilization, rate_limiter.rejection_rate_by_tier (are VIP customers hitting limits?), business.revenue_per_minute (the actual cost of rejections)
Organizational Follow-up: Create a pre-event capacity review process with marketing and product — Black Friday is not a surprise. Build elastic limits for planned events (limits that automatically increase during pre-announced high-traffic windows). Add "large marketing campaigns" to the infra team's change management calendar.
Ownership Question: "Who decides whether to override the rate limit during a flash sale incident?" Staff answer: The Staff on-call engineer, with a time-bounded override that auto-expires in 2 hours. The decision criteria are codified in the runbook: (1) backend health is green, (2) legitimate traffic is being rejected, (3) override logged with reason. This cannot be a per-incident judgment call — it needs to be a pre-approved procedure.
Staff Signals:
- Checks backend health before blaming the rate limiter
- Has pre-established emergency procedures, not ad-hoc judgment calls
- Identifies the organizational gap (capacity planning didn't include Black Friday) rather than just fixing the immediate symptom
Deep Dive 2: Silent Fail-Open
Context: You discover the rate limiter has been running in fail-open mode for 3 days because Redis was slow. No alerts fired. What went wrong, and how do you fix the system?
Questions to Surface First:
- How long has fail-open been active? Did abuse increase during this window — check login failure rates, API error rates, unusual traffic patterns.
- Is fail-open an intentional design choice, or did it happen by accident (no fallback configured)?
- What other services degrade silently when their dependencies are slow? Is this a systemic pattern?
- Who should have been alerted, and why wasn't existing monitoring sufficient?
Staff Approach — Full Reasoning
| Dimension | Staff Answer |
|---|---|
| Root cause | Missing observability contract: bypass rate wasn't monitored, or alert threshold was wrong |
| Immediate action | Is this an active incident? Check for abuse during the 3-day window — login failures, API anomalies, unusual data access patterns |
| System fix | Add rate_limiter.bypass_rate metric with alert: bypass_rate > 1% for 5min → page. This is non-negotiable. |
| Process fix | Fail-open is a deliberate choice. But deliberate requires visibility. Add time-bounded fail-open: auto-expires after 30 minutes and requires explicit re-approval to extend. |
| Broader question | Audit all services with fail-open behavior for similar observability gaps. "Silent degradation" as a category in post-incident retrospectives. |
Staff insight: Fail-open without alerting is the same as having no rate limiter. Every deliberate degradation mode needs a corresponding observability contract.
Metrics to Watch:
rate_limiter.bypass_rate (alert if >1% for 5min), redis.latency_p99 (warn at 5ms, critical at 10ms), rate_limiter.enforcement_mode (enforcing vs bypass — should be a visible dashboard metric, not buried in logs), login.failure_rate (anomaly during bypass window), api.error_rate_by_endpoint (abuse signal during bypass window)
Organizational Follow-up: Audit all services with fail-open behavior for observability gaps — this is a systemic pattern, not a one-off. Add "silent degradation detection" as a quarterly resilience review checklist item. Create an organizational standard: any service that can degrade silently must have a bypass-rate equivalent metric with a PagerDuty alert.
Ownership Question: "Who is responsible for the 3-day window where we were unprotected?" Staff answer: Two failures of ownership. (1) The engineer who implemented fail-open without adding the bypass-rate alert — they made a deliberate design decision without the corresponding observability contract. (2) The on-call rotation for not detecting the anomaly in Redis latency metrics that would have indicated the issue. Both gaps need process fixes, not blame.
Staff Signals:
- Treats this as a systemic observability contract failure, not a Redis monitoring gap
- Proposes time-bounded fail-open that auto-expires and requires explicit re-approval
- Audits all services with similar silent degradation modes
Deep Dive 3: Large Tenant Onboarding
Context: Sales just closed a deal 5× larger than your current biggest tenant. They go live in 2 weeks. Ensure the rate limiter handles them without affecting other tenants.
Questions to Surface First:
- What's this tenant's expected traffic pattern — steady-state vs bursty? What are their peak hours?
- If this tenant misbehaves, what's the blast radius to other tenants? Are counters shared with others on the same Redis shard?
- What SLA did Sales promise? Is it documented, or implicit? Who approved the capacity commitment?
- Do we have tenant isolation at the infrastructure level, or are all tenants sharing the same Redis shard?
Staff Approach — Full Reasoning
| Phase | What to Do |
|---|---|
| Capacity math | New tenant's peak QPS × 5 = new hot-key ceiling. Will this saturate a Redis shard? (Redis single shard: ~100K ops/sec) |
| Isolation assessment | Does this tenant need dedicated Redis infrastructure, or can they share with a token-leasing model that bounds per-tenant Redis QPS? |
| Shadow mode | Deploy with shadow mode first — log what the rate limiter would do, don't enforce. Run for 48 hours before live. |
| Staged rollout | 10% of tenant traffic → monitor shard health, latency, rejection rates → 100% |
| SLA documentation | "Tenant X gets guaranteed 50K req/s. Platform reserves 20% headroom for burst. Any traffic above 50K may be throttled." |
Staff insight: "Can we handle this tenant?" is the wrong question. The right question is "What's the blast radius if this tenant misbehaves, and how do we isolate it from affecting our other 10,000 tenants?"
Metrics to Watch:
redis.shard_cpu_utilization (per shard, alert at 70% during onboarding), rate_limiter.tenant_rejection_rate (per tenant — the new tenant shouldn't be rejecting other tenants), rate_limiter.p99_by_tenant (fairness signal), onboarding.shadow_mode_rejection_rate (would-be rejections before enforcement goes live)
Organizational Follow-up: Create a "large tenant onboarding checklist" for Sales/Engineering handoff: capacity sign-off required before SLA commitment, technical discovery call between tenant's infra team and platform team, 2-week minimum shadow mode before enforcement. Add "large tenant onboarding" as a trigger for automatic capacity review.
Ownership Question: "Sales committed to a 200K req/s SLA without asking infra. Who owns the resulting capacity emergency?" Staff answer: Sales owns the customer relationship and the commitment made without sign-off. Infra owns the implementation of whatever we commit to. The organizational fix: rate limit SLA commitments require infra sign-off above a QPS threshold. Create a named "capacity commitment approval" process. Sales gets fast-path approval for standard tiers; above-standard requires a 1-week infra review.
Staff Signals:
- Reframes from handling load to blast radius isolation
- Uses shadow mode before hard enforcement for high-risk capacity changes
- Negotiates explicit SLA documentation with Sales before committing infrastructure
Deep Dive 4: Post-Mortem — Limiter Let Through an Attack
Context: Post-mortem shows a credential-stuffing attack got through the rate limiter for 45 minutes. 50K accounts were compromised. What do you present to leadership?
Questions to Surface First:
- Was rate limiting ever designed to be the primary defense against credential stuffing, or was it assumed?
- What other defense layers exist (WAF, CAPTCHA, anomaly detection)? Which ones fired, which didn't?
- How did the attacker evade detection — IP rotation, distributed botnet, credential replay from a breach database?
- What's the regulatory exposure? Do we need to notify affected users under GDPR/CCPA?
Defense-in-depth: rate limiting (Layer 2) cannot stop sophisticated attacks alone. Layers 3 and 4 catch what volume-based rules miss.
Staff Approach — Full Reasoning
| Section | Content |
|---|---|
| What happened | Attacker rotated 5K IPs at 10 req/IP/min — each IP stayed under the per-IP threshold of 20 req/min. Identity was weak (IP-only on unauthenticated endpoint). |
| Why we missed it | Rate limiting is volume-based protection. This attack distributed volume across many identities. Behavioral detection (login failure rate anomaly) would have caught it, but wasn't the first line of defense. |
| Immediate actions | Add global login-failure-rate limit (regardless of IP): alert at 1,000 failures/minute globally. Add per-IP login failure rate: alert at 5 failures/min per IP (different from request rate). Integrate with CAPTCHA at threshold. |
| Systemic fix | Layered defense: CDN/WAF (known attack patterns) + rate limiting (volume protection) + anomaly detection (behavioral patterns) + CAPTCHA (human verification). Each layer catches different attack vectors. |
| Ownership boundary | Define who owns login abuse detection: Security team (threat intelligence, attack patterns) vs Platform team (infrastructure enforcement). Both own the gap that allowed 45 minutes of undetected abuse. |
Metrics to Watch:
login.failure_rate_global (alert at 1K failures/min), login.failure_rate_per_ip (alert at 5/min per IP), login.unique_ips_per_minute (spread = botnet signal), waf.blocked_rate (Layer 1 effectiveness), captcha.trigger_rate (Layer 4 load)
Organizational Follow-up: Define ownership boundary: who owns login abuse detection? Security team owns threat intelligence and policy; Platform team owns enforcement infrastructure. Joint review of defense-in-depth posture with specific ownership assigned at each layer. File regulatory notification if breach threshold is met (GDPR: 72-hour notification). Quarterly adversarial simulation to test each defense layer.
Ownership Question: "The rate limiter was operating as designed. Who's accountable for the 50K compromised accounts?" Staff answer: Accountability is shared. (1) The original architecture decision that used rate limiting as the sole authentication defense — the engineer who made that decision and the architect who approved it. (2) Security team for not having behavioral anomaly detection on login failure rates. (3) Product for not prioritizing multi-factor authentication. Single points of accountability for systemic security failures are usually wrong — the postmortem should identify the layer of defense that should have caught this and didn't, and assign that to the team responsible for that layer.
Staff Signals:
- Presents to leadership with architectural framing (rate limiting was never the right sole defense) rather than just fixing the specific gap
- Proposes layered defense-in-depth, not just tighter thresholds
- Defines ownership boundaries across Security and Platform teams as an organizational action item
Deep Dive 5: Multi-Region Expansion
Context: Your company is expanding to EU. The rate limiter is currently single-region. What's your recommendation?
Questions to Surface First:
- What's the primary use case for EU expansion — latency reduction, data residency compliance, or both?
- Which rate limiting intents need global accuracy (billing/quota) vs regional approximation (abuse protection)?
- Do rate limit counters contain PII or request metadata that falls under GDPR data residency requirements?
- What's the acceptable drift window — can a global abuser temporarily get 2× their limit across two regions?
Hybrid approach: abuse protection runs independently per region with async sync; billing/quota uses region-scoped quotas or tighter global coordination where accuracy matters.
Staff Approach — Full Reasoning
| Option | Tradeoffs |
|---|---|
| Independent per-region | Simple. A global abuser can hit 2× their limit (once per region). No cross-region latency. Recommended for abuse protection. |
| Global coordination | Accurate. But cross-region latency (+50–100ms per request) adds to the hot path or requires complex async sync. Only for billing-critical endpoints. |
| Hybrid (per-region + async reconciliation) | Per-region enforcement with async global reconciliation every 60 seconds. Accepts bounded drift. Best of both for most cases. |
The GDPR question that Senior candidates miss: Rate limit counters store {user_id}:{endpoint}:{count}. User IDs are personal data under GDPR. Replicating EU user rate limit counters to US Redis may require a legal basis (SCCs, adequacy decision) or be prohibited under data sovereignty requirements. Check with Legal before designing cross-region replication of counters.
Staff recommendation:
- For abuse protection: per-region enforcement with optional async reconciliation (async, not sync — never in the hot path)
- For billing/quota: region-scoped quotas (EU users get EU quota, US users get US quota) OR tighter global coordination only on billing endpoints
Migration plan:
- Week 1–2: Deploy EU rate limiter infrastructure (Redis, gateway middleware) — zero migration
- Week 3–4: Route new EU signups to EU enforcement
- Week 5–8: Migrate existing EU users to EU enforcement (canary, then full)
- Week 9+: Optional global reconciliation for abuse counters
Metrics to Watch:
cross_region.counter_sync_latency_ms, cross_region.drift_bound (max counter difference between regions at any point), geo_routing.misroute_rate (EU users accidentally hitting US enforcement), gdpr.counter_data_residency_violations (EU user counters found in US store)
Organizational Follow-up: Legal/compliance sign-off on counter data residency before any cross-region replication is deployed. Define SLA for cross-region reconciliation latency. Create runbook for region failover: if US goes down, does EU absorb global traffic with global limits or regional-only limits?
Ownership Question: "Who decides whether billing uses global or region-scoped quotas?" Staff answer: Finance (they understand the revenue impact of 2× quota across regions), Legal (data residency constraints on counter replication), and Product (customer contractual implications of region-scoped vs global quotas). Engineering provides the technical options and cost tradeoffs. The actual decision is a business decision, not an engineering decision.
Staff Signals:
- Separates abuse protection (tolerates drift) from billing/quota (needs accuracy) rather than applying one architecture to both
- Raises GDPR/data residency constraints on counter data — the non-technical dimension most candidates miss
- Plans a phased rollout rather than a big-bang migration
9Level Expectations Summary
After studying this playbook, you should be able to:
- Name the three intents and explain why they require incompatible failure modes
- Quantify the latency trap — Redis per-request at 5–10ms overhead at scale
- Design a local-first coordination model with a named mechanism (token leasing, key-owner routing, bounded reconciliation) and its specific failure mode
- Explain fail-open vs fail-closed by intent, with bypass-rate observability as a mandatory component
- Walk through the silent fail-open failure mode from detection to runbook
- Explain why hot keys are structurally different from high-cardinality traffic and propose mitigations
- Design a policy management system with a lifecycle (shadow → warn → canary → enforce) and organizational governance
The Bar for This Question
Mid-level (L4/E4): Understands rate limiting as a counter in Redis with a TTL. Can explain token bucket semantics. Knows rate limiting prevents abuse. Doesn't reason about distributed coordination, failure modes, or ownership.
Senior (L5/E5): Builds the baseline architecture quickly — gateway middleware + Redis counters with Lua atomicity. Reasons about the consistency-availability tradeoff. Has an opinion on local vs centralized counters and can explain the latency implications of a synchronous Redis call. Gets to fail-open vs fail-closed when prompted.
Staff+ (L6/E6+): Spends under 5 minutes on architecture and 30+ minutes on depth. Quantifies the latency trap and names the coordination mechanism (leasing, routing, or reconciliation) with its failure mode. Makes fail-open/fail-closed a proactive decision, not a response to the interviewer's question. Defines bypass-rate observability as mandatory. Reasons about policy management as an organizational problem. The interviewer should learn something from the answer.
10Staff Insiders: Controversial Opinions
10.1 "Exact Rate Limiting" Is a Myth
At scale, you are never enforcing the limit you think you're enforcing.
| Factor | Impact |
|---|---|
| Clock skew | Token refill varies by 10–100ms across nodes |
| Network delay | Coordination messages arrive late |
| Retry amplification | Rejected requests retry, adding load |
| Batching | Requests arrive in bursts, not smoothly |
| Measurement lag | By the time you measure, you've already over-admitted |
The Staff position: Stop pretending you're enforcing "exactly 1000 req/s." You're enforcing "approximately 1000 req/s ± ε." The Staff question is: what's your ε, and is it acceptable for your intent?
Why this matters in interviews: Candidates who claim exact enforcement without acknowledging drift reveal they haven't operated rate limiters at scale. The bar-raiser question: "What's the worst-case over-admission in your design, and who signed off on it?"
10.2 Abuse Protection and Billing Cannot Share an Algorithm
If you're using the same rate limiter for abuse protection and billing enforcement, one of them is wrong.
| Dimension | Abuse Protection | Billing/Quota |
|---|---|---|
| Failure mode | Fail-open (limiter shouldn't be kill switch) | Fail-closed (can't give away resources) |
| Correctness | Bounded drift acceptable | Drift unacceptable; audit required |
| Latency | Cannot add latency to hot path | Latency acceptable for accuracy |
| Identity | Weak (IP, fingerprint) | Strong (authenticated user, API key) |
The Staff position: These are fundamentally different systems. Trying to serve both with "one rate limiter" leads to billing drift (abuse limiter too loose) or availability problems (billing limiter too strict for abuse). Stripe figured this out and separated them. Companies that conflate them eventually have an incident.
10.3 Global Fairness Dies at Scale — And That's OK
Many companies that claim "global rate limiting" are actually doing per-region rate limiting and calling it global. At true global scale, they've abandoned global fairness.
| Scale | What Works | What Breaks |
|---|---|---|
| Single region | Redis per-request, strong consistency | Works fine |
| Multi-region, low QPS | Cross-region coordination | Acceptable for billing |
| Multi-region, high QPS | Per-region enforcement | Global accuracy is approximate |
| True global scale (10M+ req/s) | Per-region, no reconciliation | Regions are effectively independent |
The dirty secret: At hyperscale, a user with a "1,000 req/min global limit" might actually get 1,000 × N (where N = number of regions) if they distribute traffic across regions.
The Staff position: Global fairness is a spectrum. The honest question is: "What's the blast radius of our approximation, and is that acceptable?" At hyperscale, the answer is almost always "yes — 2× regional burst is acceptable for abuse protection."
10.4 The Policy Management Problem Is Harder Than the Technical Problem
The rate limiter is a solved technical problem. The unsolved problem is getting 15 product teams to agree on rate limit policies, keep them updated, respond when limits are hit, and not override the platform team's guardrails when their customers complain.
In practice:
- Product teams set limits too high because they fear blocking their customers — the limits become theater
- Product teams set limits too low and then complain when customers are throttled
- Nobody updates limits when a product changes behavior significantly
- The platform team becomes a bottleneck for every policy change, or abdicates policy ownership entirely
The Staff position: Rate limiting's technical implementation is a 2-week project. Rate limiting's organizational governance is a 6-month project. Staff engineers work on the 6-month project after shipping the 2-week project. Self-service policy API with guardrails, policy lifecycle with shadow mode, and clear ownership boundaries are the deliverables that actually matter long-term.
Appendix A: Algorithm Mechanics — Token Bucket, Fixed Window, Sliding Window
A.1 What "Limit" Actually Means
"100 requests per second" hides four decisions:
- Is burst allowed?
- Over what window?
- How much boundary skew is acceptable?
- What happens at the exact threshold?
A.2 Fixed Window — Why It's Almost Always Wrong
t=0.99s: 100 requests → Allowed (window 1)
t=1.01s: 100 requests → Allowed (window 2 just started)
Result: 200 requests in ~20ms
Users can time requests to get 2× burst at every window boundary.
A.3 Token Bucket — Gateway Default
Parameters:
capacity(burst): maximum tokens the bucket can hold — sets the burst ceilingrefill_rate(steady-state): tokens added per second — sets the sustained ratecost(optional): tokens per request (default 1; set >1 for expensive endpoints)
Timeline example:
t=0.0s: 150 requests → 150 allowed (tokens: 200→50)
t=0.5s: 120 requests → refilled 50 tokens (50→100)
→ 100 allowed, 20 rejected (429)
t=0.6s: 10 requests → refilled ~10 tokens (0→10)
→ 10 allowed (tokens: 0)
Why token bucket fits gateway: allows controlled bursts, smooths traffic, O(1) evaluation, degrades gracefully, easy to approximate locally.
A.4 Sliding Window Counter — Memory-Efficient Approximation
Weighted average of current and previous fixed windows. ±1% error at boundaries. Memory: O(1) per client.
count ≈ (prev_window_count × overlap_fraction) + current_window_count
Good for high-cardinality clients where O(N) sliding window log is too expensive.
A.5 Leaky Bucket — Why It's Rarely Used for Abuse Protection
Leaky bucket queues requests and processes at a fixed rate. The problem: queuing abusive traffic is worse than rejecting it. Under attack, the queue fills with attacker requests, adding memory pressure and increasing latency for legitimate users waiting in queue. For abuse protection: reject fast (token bucket 429), don't queue.
Appendix B: Client Identification Patterns
B.1 Identity Waterfall
Identity resolution follows priority order — use the strongest available:
| Identity | Strength | Notes |
|---|---|---|
| User ID (JWT sub, session) | Strong | Authenticated; stable; 1:1 with account |
| API Key | Medium | Can be leaked; should be rotatable |
| Device fingerprint | Weak | Evasive; false positives on shared devices |
| Source IP | Weak | NAT/corporate proxy = many users → one IP |
| IP + User-Agent hash | Slightly stronger | Minor improvement; still evasive |
Principle: IP as coarse outer limiter only. Shift to stronger identity (API key / user_id) for authenticated paths.
B.2 Rate Limit Key Construction
rate_limit:{identity}:{scope}:{window}
Examples:
rate_limit:ip:203.0.113.42:/login:60s # IP + endpoint + window
rate_limit:apikey:abc123:/search:10s # API key + endpoint + window
rate_limit:user:u_789:/api/v1:1m # User ID + API prefix + window
rate_limit:tenant:t_42:/api/v1:1m # Tenant ID for multi-tenant fairness
Key design considerations:
- Cardinality → memory pressure:
user_id:endpointhas much higher cardinality thanuser_idalone - Hot key risk:
user_id:{hot_user}can dominate a Redis shard - TTL strategy:
refill_window + 10s buffer— allows slightly stale entries to expire naturally
B.3 Endpoint Sensitivity Classes
Different endpoints warrant different limits and policies:
| Endpoint | Abuse Risk | Recommended Strategy |
|---|---|---|
/login, /signup | High — credential stuffing | Strict: per-IP + per-global + login-failure-rate |
/search, /catalog | Medium — scraping | Moderate: per-user + per-IP fallback |
/api/v1/* (authenticated) | Low-Medium — business use | Generous: per-user, per-tenant |
/health, /metrics | Minimal | Exempt or extremely high limit |
/webhook/* | Partner traffic | Per-partner limits, separate tier |
/admin/* | Security-critical | Very strict + 2FA required |
Appendix C: Storage & Coordination Patterns
C.1 Centralized Redis — Lua Atomicity
The check-refill-decrement operation must be atomic to prevent race conditions:
-- KEYS[1] = bucket key
-- ARGV: capacity, refill_rate_per_sec, now_ms, cost_tokens
-- Returns: {allowed (0/1), remaining_tokens, retry_after_ms}
local tokens, last_ms = ...
local elapsed_ms = now_ms - last_ms
local refilled = elapsed_ms * refill_rate_per_sec / 1000
tokens = math.min(capacity, tokens + refilled)
if tokens >= cost then
tokens = tokens - cost
-- store updated tokens, return allowed=1
else
-- return allowed=0, retry_after_ms = ceil((cost - tokens) / refill_rate * 1000)
end
Staff note on time: Use redis.call("TIME") inside the script to reduce clock-skew risk, at ~0.1ms overhead.
C.2 Token Leasing — Reduce Redis QPS by 10–100×
Failure mode: Gateway crashes mid-lease → stranded tokens for ttl duration (5 seconds). Acceptable for abuse protection; not for billing.
Lease sizing: Too small → frequent renewals → higher Redis QPS. Too large → fairness issues (one gateway holds a disproportionate share) + more stranded on crash.
C.3 Key-Owner Routing — Single-Writer Per Identity
Route requests so one gateway instance owns enforcement for each identity. Eliminates multi-writer races.
hash(api_key) % gateway_count → routes to consistent owner
Owner handles all enforcement locally, no coordination needed
Failure mode: Owner crashes → requests reroute to new owner, who must fetch current counter from Redis (one-time fallback). Load skew if one identity dominates (all traffic for hot key hits one gateway).
C.4 Bounded Reconciliation — Explicit Drift Budget
Accept drift but make it explicit and bounded. Force global check when local tokens drop below a low-watermark.
if local_tokens < low_watermark:
force_global_check(key) # One-time Redis call, not per-request
# Periodic: every 500ms
send_usage_deltas(key, local_count_delta)
Redis reconciles global budget across all gateways
Drift bound: G × local_slack where G = gateway count and local_slack = tokens between forced global checks.
C.5 Quick Comparison
| Mechanism | Redis QPS | Correctness | Hot Key | Best For |
|---|---|---|---|---|
| Per-request | 1:1 with traffic | Highest | Saturation risk | Billing-critical, low QPS |
| Token leasing | ~1:100 reduction | Medium-High | Renewal only | Abuse protection at scale |
| Key-owner routing | ~1:N (per owner) | High | Isolated to owner | Authenticated traffic |
| Bounded reconciliation | Periodic + watermark | Medium (ε) | Low | Abuse where drift OK |
Appendix D: Response Semantics
D.1 On Rejection
HTTP/1.1 429 Too Many Requests
Retry-After: 2
Content-Type: application/json
{"error": "rate_limit_exceeded", "retry_after_seconds": 2}
D.2 On Allow (Optional Headers)
HTTP/1.1 200 OK
X-RateLimit-Remaining-Tokens: 42
X-RateLimit-Refill-Rate: 100
X-RateLimit-Burst-Capacity: 200
D.3 Token Bucket Reset Time
Token bucket has no single reset time — it refills continuously. If you must provide a reset-like value:
retry_after_seconds = ceil((cost - tokens_remaining) / refill_rate)
D.4 Retry Behavior — The Thundering Herd Risk
Rate limiting tightly couples to client retry behavior. Naive clients turn a 429 into a retry storm:
429 received → retry immediately → another 429 → retry immediately...
10,000 clients each retry 3 times → 30,000 requests in the next 100ms
For trusted clients: Publish guidance and SDKs enforcing exponential backoff + jitter. Retry-After header must be respected.
For untrusted callers: Assume they ignore guidance. Use local deny-cache (reject locally without Redis for known-bad identities), progressive backoff (increase delay on repeated 429s for same identity), temporary blocks.
Appendix E: Metrics & Observability
E.1 Core Metrics — Non-Negotiable
# Three categories every request falls into:
rate_limit_allowed_total{identity, endpoint, tier}
rate_limit_rejected_total{identity, endpoint, tier, reason}
rate_limit_bypassed_total{identity, endpoint, reason} ← SILENT FAIL-OPEN DETECTOR
# Infrastructure health:
rate_limit_check_latency_ms{path} # local vs Redis path
redis_latency_p99{operation} # lease renewal, counter check
redis_timeout_total{operation} # circuit break trigger
# Policy health:
rate_limit_enforcement_rate_pct # % of requests checked against Redis (vs local only)
E.2 Critical Alerts
| Alert | Condition | Severity | Why |
|---|---|---|---|
| Silent fail-open | bypass_rate > 1% for 5 min | Page | Running unprotected |
| Enforcement rate | enforcement_rate < 95% | Page | Systemic coordination failure |
| Redis latency | redis_p99 > 5ms | Warn | Approaching circuit break |
| Redis latency | redis_p99 > 10ms | Page | Circuit break imminent |
| Rejection spike | rejected_total increase >10× in 5 min | Warn | Attack or misconfiguration |
| 429 drops to zero during traffic spike | rejected_total → 0 during high traffic | Page | Rate limiter may have failed |
E.3 Control Plane vs Data Plane
Control plane (policy management — cold path):
- Policy store: versioned configs with schema validation
- Rollout: shadow → warn → canary → enforce
- Kill switch: fast rollback at endpoint/tenant/global scope
- Audit trail: who approved, what traffic affected, when deployed
Data plane (per-request enforcement — hot path):
- Gateway/sidecar reads policy from local cache (never calls policy store per-request)
- Makes allow/deny decision locally
- Emits metrics asynchronously
E.4 Debugging Silent Fail-Open
When bypass_rate fires:
- Check Redis health:
redis-cli PING, cluster status, latency histogram - Check circuit breaker state: is the gateway in local-only mode?
- Check bypass_rate trend: has it been elevated for hours (silent) or minutes (recent)?
- Check for abuse during the bypass window: login failures, API error rates, traffic anomalies
- Confirm local fallback caps are set: are per-instance limits protecting the backend?
Appendix F: Scaling Considerations
F.1 What Works at Each Scale
| Scale | What Works | What Breaks | Recommended Change |
|---|---|---|---|
| 1K req/s | Redis per-request | Nothing yet | Keep it simple |
| 10K req/s | Redis per-request | Latency starting to show | Consider local-first |
| 100K req/s | Local-first + leasing | Redis cluster memory | Token leasing mandatory |
| 1M req/s | Local-first, async | Global accuracy | Per-region enforcement |
| 10M+ req/s | Per-region, no global sync | Global fairness is theoretical | Accept regional limits |
F.2 Multi-Region Evolution Path
- Phase 1: Single region. Redis cluster handles all traffic. Simple, accurate.
- Phase 2: Multi-region, low QPS. Cross-region replication with 50–100ms latency — acceptable for billing paths.
- Phase 3: Multi-region, high QPS. Per-region enforcement with async reconciliation for abuse. Billing uses region-scoped quotas.
- Phase 4: True global scale. Regions are effectively independent. "Global" limits are per-region approximations.
F.3 What You Don't Build on Day One
- Multi-region replication — start single-region, add regions when you have users there
- Adaptive rate limiting (auto-tightens under load) — complex to tune, often causes oscillation
- Per-endpoint granularity beyond 3–4 tiers — over-engineering for most products
- Real-time abuse analytics dashboard — build this in month 3, not month 1
Start simple. Add complexity only when production data shows you need it.
Appendix G: Multi-Tenant Fairness Deep Dive
G.1 The Noisy Neighbor Failure Mode
Fairness failures show up as:
- Uneven SLO burn: Small tenants see p99 spike during large tenant burst
- Support ambiguity: "Your platform is unreliable" tickets with no global incident
- Hidden starvation: Small tenant throttled while large tenant consumes most shared capacity
G.2 Hierarchical Token Buckets
Enforce at multiple layers simultaneously:
Global capacity bucket (protects platform)
└── Enterprise Tier bucket (protects other tenants)
└── Tenant A bucket (contracted 1,000 req/s burst 1,500)
└── Sub-buckets per user/endpoint within Tenant A
└── Tenant B bucket (contracted 500 req/s burst 750)
└── Pro Tier bucket
└── Free Tier bucket
Why this is Staff-grade: Answers "what if one user inside a tenant is the noisy neighbor?" The per-user sub-bucket within a tenant limits intra-tenant noisy neighbors, not just inter-tenant.
G.3 Reserved Floor + Shared Burst Pool
reserved_rate[tenant]is protected even under global contention — tenants always get their contracted rateburst_poolabsorbs temporary spikes if global slack exists- Under contention, burst pool allocation falls back to reserved floor
Borrow rules:
- Who can borrow from burst pool? (Paid tiers only?)
- How much can they borrow? (Cap at 2× reserved?)
- Under contention? (First-come-first-served or proportional?)
Tradeoff: Reservations improve isolation but reduce utilization if tenants are idle. An enterprise customer paying for 1,000 req/s but only using 100 req/s holds 900 req/s of reserved capacity that other tenants can't access.
G.4 Observability for Fairness
Critical metrics sliced by tenant:
rate_limit_allowed_total{tenant}
rate_limit_rejected_total{tenant, reason}
request_latency_p99{tenant} ← fairness failures appear here first
Fairness-specific alerts:
- Tenant starvation: tenant is heavily throttled while global utilization < 80%
- Plan-change regression: top tenants spike rejections after a policy rollout
- Burst pool domination: one tenant continuously consumes the shared burst pool
G.5 Tradeoff Summary
| Mechanism | Isolation | Utilization | Complexity | Debuggability |
|---|---|---|---|---|
| Flat quota per tier | Low | High | Low | High |
| Weighted quotas | Medium | High | Low-Medium | Medium |
| Reserved floor + pool | High | Medium-High | Medium | Medium |
| Hierarchical buckets | Highest | Medium | High | Medium-Low |
Staff recommendation: Start with weighted quotas. Move to reserved floor + pool when a large tenant (>10% of global capacity) causes fairness incidents. Hierarchical buckets only when intra-tenant fairness (per-user-within-tenant) becomes a support issue.