Ssstaffsignal

Design a Distributed Rate Limiter

Staff-Level Playbook

Technologies referenced in this playbook: Redis · API Gateways

How to Use This Playbook

Organized for interview use first, reference second. Read front-to-back once. Return to individual sections for targeted review.

ModeTimeWhat to Read
Quick Review15 minExecutive Summary → Interview Walkthrough → Fault Lines → Active Drills
Targeted Study1–2 hrsExecutive Summary → Interview Walkthrough → Fault Lines → weak-spot Deep Dives
Deep Dive3+ hrsEverything, including appendices
What is Rate Limiting? — Why interviewers pick this topic

Rate limiting controls how many requests a client can make to your system within a time window. Without it, a single misbehaving client can overwhelm your servers, degrade performance for everyone, or run up infrastructure costs.

Before vs After — Flash Sale scenario:

Without rate limiting:
t=0:      "50% off everything" push notification
t=+10s:   Traffic spikes 10x — 50,000 req/s hits the API
t=+30s:   Database connection pool exhausted
t=+45s:   Backend returning 500 errors
t=+2min:  Full outage. All users see error pages.
t=+45min: Engineers restore service. Revenue lost. Trust damaged.

With rate limiting:
t=0:      Same push notification, same 10x spike
t=+10s:   Rate limiter kicks in — excess traffic gets 429 responses
t=+10s:   Backend stays at healthy capacity (5,000 req/s)
t=+1min:  Traffic normalizes as retries spread out
t=+5min:  Zero downtime. Graceful degradation. A metric blip, not a page.

Why interviewers reach for this question: Rate limiting surfaces the core Staff-level skill — reasoning about tradeoffs under uncertainty. There's no perfect solution. Every choice has a cost. Do you optimize for accuracy or latency? Who absorbs the cost of false positives? How do you handle distributed coordination without adding latency? Interviewers want to see you navigate these tensions, not recite Token Bucket mechanics.

Mechanics Refresher: Algorithms
Who Pays Analysis
AlgorithmHow It WorksProsCons
Token BucketTokens refill at a fixed rate; each request costs a tokenAllows controlled bursts; O(1) per checkSlightly more complex state
Fixed WindowCount requests in fixed time intervalsSimple to implementBoundary exploit: 2× burst at window edges
Sliding Window LogCount requests over continuously sliding windowAccurateO(N) memory per client — doesn't scale
Sliding Window CounterWeighted approximation between two fixed windowsMemory-efficient approximation±1% error at boundaries
Leaky BucketQueue requests; process at fixed rateSmooth outputQueuing abusive traffic is worse than rejecting it

For most production systems: Token bucket. The algorithm is almost never the interview question — coordination, failure modes, and ownership are.

What This Interview Actually Tests

Rate limiting is not an algorithm question. Everyone knows Token Bucket.

This is a distributed systems ownership question that tests:

  • Whether you clarify intent before designing
  • Whether you reason about failure modes proactively
  • Whether you understand who pays for each tradeoff
  • Whether you can own the operational burden

The key insight: Rate limiting is fundamentally a policy enforcement problem with no perfect answer. Staff engineers reason about who absorbs the cost of imperfection.

The L5 vs L6 Contrast — Start Here

Level Calibration
BehaviorSenior (L5)Staff (L6)
First moveDraws Redis + Token BucketAsks "What are we protecting against?"
AlgorithmSelects Token BucketIdentifies the Latency Trap: central Redis adds 5–10ms to every request
ConsistencyAssumes strong consistencyArgues rate limiting is "fuzzy" — eventual consistency may be acceptable
FailureMentions "Redis replicas"Asks "Fail-open or fail-closed? Who signs off on that?"
OwnershipFocuses on implementationMoves enforcement to Gateway/Sidecar to avoid "client library hell"
Why "first move" separates levels

L5: Starts with a solution shape ("token bucket + Redis") before the problem is defined. This reads as pattern-matching and creates downstream confusion — mixing abuse protection, billing/quota, and fairness into one design with incompatible correctness and failure expectations.

L6: Names intent out loud and commits to one path before drawing a box. "Are we protecting infrastructure from abuse, enforcing paid quotas, or isolating tenants? These are fundamentally different systems with incompatible failure modes. I'll assume abuse protection."

Why "failure" separates levels

L5: Treats failure as "add Redis replicas / HA." That improves availability, but it dodges the hard question: during slowness, failovers, or partial outages, what do we do with requests?

L6: Makes the decision quickly and ties it to ownership: "For abuse protection we fail-open with conservative local caps + alerting so the limiter doesn't become a kill switch. For billing/quota we fail-closed because we can't give away resources." Also states who signs off on the risk and how you prevent silent bypass.

Why "ownership" separates levels

L5: Focuses on implementation ("we'll add middleware to services") and underestimates organizational drift: polyglot stacks, version skew, inconsistent enforcement, and slow rollouts for policy changes.

L6: Treats rate limiting as a platform control: enforce at the Gateway/Ingress (or sidecar/mesh), keep policy declarative, make changes safe to roll out and roll back. This is the Staff signal: you're designing for the organization, not just for one service.

The Staff Positions

Default Staff Positions
PositionRationale
Token bucket over fixed windowFixed window has boundary exploits; token bucket smooths traffic
Local-first over Redis-per-requestAvoid the latency trap; coordinate periodically, not per-request
Fail-open for abuse protectionThe limiter shouldn't become a kill switch; protect availability
Fail-closed for billing/quotaCan't give away resources; accuracy trumps availability
Gateway/sidecar over SDKAvoid "client library hell"; enforce at the infrastructure layer
Bounded drift is acceptableFor abuse protection, ±10% accuracy is fine; don't over-engineer

The Three Intents

Three intents drive every design decision. Each leads to a fundamentally different architecture.

IntentConstraintStrategyFailure ModeCorrectness Bar
Abuse ProtectionSpeed is everythingFail-open, local-first counters, high throughputSome over-admissionBounded drift acceptable (ε)
Billing/QuotaAccuracy is everythingFail-closed, strong consistency, strict accountingCannot give away resourcesDrift unacceptable; audit required
Multi-Tenant FairnessIsolation is everythingWeighted quotas, reservations, bounded burstsNoisy neighbor isolationPer-tenant SLO preservation

The Five Fault Lines

#Fault LineThe Tension
1Protection vs CorrectnessPrioritize system survival (allow drift) or exact limits (risk collapse)?
2Centralized vs Distributed StateRedis per-request (simple, accurate, SPOF) vs local-first (resilient, fast, drifty)?
3Latency vs AccuracyPay the 5–10ms Redis tax on every request, or accept approximation?
4Fail-Open vs Fail-ClosedWhen the limiter fails, protect availability or protect resources?
5Infra Ownership vs Team AutonomyCentral service vs sidecar vs SDK? Who owns policy changes?

In the Wild: Real Production Systems

Stripe — Local-First Token Bucket with Centralized Billing

Stripe uses a local-first token bucket for abuse protection at the gateway layer, keeping enforcement sub-millisecond. For billing/quota (tracking API call usage against paid plans), they use a separate centralized path with stronger consistency guarantees. Every response includes RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers — treating rate limit transparency as a first-class API contract.

Staff insight: Stripe's separation of abuse protection from billing enforcement is the canonical example of why these systems cannot share an algorithm. Abuse protection fails-open; billing fails-closed. These failure modes are incompatible.

Figma — Backpressure-Based Throttling

Figma's collaborative editing infrastructure uses backpressure-based throttling rather than hard 429 rejection — graceful degradation through reduced cursor update fidelity, delayed sync batching, and selective feature throttling under load. When a document has too many simultaneous editors, the system reduces real-time sync frequency rather than disconnecting users.

Staff insight: "Throttling doesn't always mean 429s. In stateful systems, graceful degradation is the preferred strategy." Fail-open vs fail-closed is a spectrum, not a binary — the right answer depends on whether your protocol is stateless (HTTP) or stateful (WebSocket).

Cloudflare — Edge-First Multi-Layer Defense

Cloudflare enforces rate limiting at multiple layers: L3/L4 filtering at the network edge for volumetric DDoS, WAF-level rules at L7 for application-aware limits, and challenge pages as an intermediate step before hard blocks. Each layer catches what the previous one misses, with different accuracy/latency tradeoffs at each tier.

Staff insight: Cloudflare's multi-layer approach is the production implementation of the "CDN/WAF for coarse filtering, gateway for identity-aware limits" architecture. Rate limiting is a stack, not a single component.

What Interviewers Probe

After You Say...They Will Ask...
"Token bucket + Redis""What's the latency tax? What happens when Redis is slow?"
"Hybrid local + sync""How does coordination actually work? What's the drift bound?"
"We'll shard Redis""What about hot keys? One identity can dominate one shard."
"Fail-open for availability""What prevents the backend from melting? What's the circuit breaker?"
"We'll add replicas""Replicas don't answer degraded mode. What's your fallback behavior?"

System Architecture Overview

Rendering diagram...

Quick-Reference: The 30-Second Cheat Sheet

Level Calibration
TopicThe L5 AnswerThe L6 Answer — Say This
Algorithm"Token bucket + Redis per request""Token bucket semantics, local-first counters. Redis is for async coordination — never in the critical path."
Consistency"Strong consistency, centralized""Bounded drift is acceptable for abuse protection — I'll quantify the drift bound. Billing needs strong consistency via a separate path."
Failure"Add Redis replicas for HA""Fail-open for abuse protection (limiter shouldn't be a kill switch), fail-closed for billing (can't give away resources). Replicas don't answer degraded-mode behavior."
Hot keys"Shard Redis""Sharding helps for many keys, not one hot key. Mitigate with token leasing, deny-cache, or key-owner routing."
Ownership"Add middleware to each service""Enforce at the gateway/sidecar. Per-service SDKs create library hell — 3–4 incompatible versions in production within 6 months."
Policy changes"Update the config file""Policy lifecycle: propose → review → staged rollout → observe → enforce. Feature-flag + canary. Who signs off? Who's on-call?"

Key Numbers Worth Memorizing

MetricValueWhy It Matters
Redis per-request latency overhead+5–10ms p99The latency trap — this is unacceptable on hot paths at high QPS
Local bucket check latency~0msWhy local-first is the right default for abuse protection
Worst-case drift (10 nodes, 5s sync)~83 extra req/windowWith 10 gateways syncing every 5s and 100 req/min limit; ~83% overshoot absolute worst case; 10–20% in practice
Bypass rate alert threshold>1% for 5 minutesKey signal for silent fail-open — if bypass is elevated this long, we're unprotected
Redis timeout for circuit break5–10msAggressive timeout prevents rate limiter from becoming a latency amplifier
Token leasing reduction in Redis QPS~10–100×vs per-request Redis calls — the operational efficiency justification for leasing
Enforcement rate SLO alert<95%Page on-call if less than 95% of requests are being enforced against Redis

Phase 1: Requirements & Framing (2–3 minutes)

State functional requirements in 30 seconds — don't enumerate, state the category:

"We need to limit request rates per client to protect backend services from abuse and enforce fair usage across tiers."

Invest remaining time on non-functional requirements — this is the Staff move:

"What's the intent? Abuse protection, billing quota, or multi-tenant fairness? I'll assume abuse protection because that's where the hardest distributed tradeoffs live."

Then commit to a constraint set: "For abuse protection: sub-5ms enforcement latency overhead, fail-open behavior (I'll justify this), and eventual consistency across instances — I'll quantify the drift bound. These three constraints drive the entire design."

Phase 2: Core Entities & API (1–2 minutes)

State entities quickly — 30 seconds:

  • RateLimitPolicy: tier, endpoint pattern, window size, threshold, action (reject / throttle / log)
  • RateLimitCounter: composite key {identity}:{endpoint}:{window}, count, TTL
  • RateLimitDecision: allow/reject, remaining quota, retry_after_ms

Don't draw an ER diagram. Name the three nouns, confirm alignment, move on.

Check path (hot, every request — middleware, not a standalone API):

CheckRateLimit(identity, resource, action) → { allowed: bool, remaining: int, retry_after_ms: int }

Config path (cold, admin only):

PUT /rate-limits/rules   { tier, resource, limit, window, action }
GET /rate-limits/rules?tier=free
DELETE /rate-limits/rules/{rule_id}

Response headers on every request: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After

Phase 3: High-Level Architecture (5–7 minutes)

Draw the architecture in three layers, then walk through the request flow:

Rendering diagram...

Walk the request flow in 90 seconds:

  1. Every request hits the gateway's rate limit middleware
  2. Middleware checks local token bucket — this is the hot path, ~0ms
  3. If tokens available: allow, forward to backend
  4. If no tokens: return 429 + Retry-After header
  5. Background: every 500ms, sync local count deltas to Redis for cross-instance coordination
  6. Redis is never in the critical request path

Key points to hit explicitly:

  1. Gateway-level enforcement — one enforcement point, not N per-service middlewares
  2. Local-first with Redis coordination — local token bucket for sub-ms checks; async sync to Redis; Redis is NOT in the critical path
  3. Fail-open as default — for abuse protection, blocking legitimate users is worse than letting some abuse through
  4. Config store separate from enforcement — policy changes propagate asynchronously, not in the hot path
  5. Observability from day one — 429 rate, false positive rate, Redis latency p99, bypass rate

Phase 4: Transition to Depth (1 minute)

"The basic architecture is straightforward. What makes this Staff-level is the failure mode reasoning. Three areas worth going deep: what happens when Redis fails, distributed coordination across multiple gateway instances, and policy management as an organizational problem. Which is most interesting to you?"

If the interviewer doesn't have a preference: lead with fail-open vs fail-closed — most impressive and most universally applicable.

Phase 5: Deep Dives (25–30 minutes)

For each, follow the Staff pattern: state the tradeoff → pick a position → quantify the cost → name who absorbs it.

Fault Line 1: Fail-open vs fail-closed (5–7 min)

Open with the decision framework:

"When Redis is down, do we let all traffic through (fail-open) or block all traffic (fail-closed)? For abuse protection, I default to fail-open: blocking 100% of legitimate users to stop potential abuse is worse than temporarily allowing unchecked traffic. For billing/quota, I'd flip to fail-closed because giving away resources has direct revenue impact."

Walk through the failure sequence:

  1. Redis goes down → middleware detects failure (connection timeout, 5–10ms aggressive timeout)
  2. Middleware switches to local in-memory counters — degraded accuracy but non-zero enforcement
  3. Observability pipeline fires alert: "enforcement rate dropped below 95%"
  4. On-call engineer sees alert, confirms Redis outage, follows runbook
  5. Redis recovers → middleware detects healthy connection → resumes centralized counters

The real danger — silent fail-open: "If the fallback silently passes all traffic without alerting, you could run unprotected for hours. This is the scenario most Senior candidates miss. Adding a bypass-rate metric (rate_limit_bypassed_total) with an alert threshold is mandatory, not optional."

Fault Line 2: Distributed coordination (5–7 min)

Frame with concrete numbers:

"With 10 gateway instances and a 100 req/min limit, each instance could independently allow 100 — giving the client 1,000 total. Three options:

  • Centralized Redis per-request: accurate, +2ms latency per check, SPOF
  • Local counters with periodic sync: fast (sub-ms), bounded drift
  • Pre-split quotas (100/10 = 10 per instance): no coordination, wastes capacity on cold instances"

Pick a position and quantify: "I'd go with option (b) for abuse protection. With 10 instances syncing every 5 seconds, worst-case overshoot is 10 × (100/60 × 5) ≈ 83 extra requests per window. That's ~83% overshoot absolute worst case — but in practice traffic distributes across instances, so real overshoot is 10–20%. For abuse protection where limits are 1,000+ req/min, that's noise."

Name a concrete coordination mechanism — don't just say "sync with Redis":

  • Token leasing: Lease N tokens from Redis, spend locally. Reduces Redis QPS by 10–100×. Failure mode: stranded tokens when gateway crashes mid-lease.
  • Key-owner routing: Hash the identity to a consistent gateway owner. Single-writer avoids races. Failure mode: load skew if one identity dominates.
  • Bounded reconciliation: Explicit drift budget ε; force global check when local tokens drop below low-watermark. Failure mode: thundering herd on low-watermark trigger.

"For billing/quota where every request has dollar value, I'd switch to centralized Redis. The +2ms latency is acceptable because billing endpoints are lower throughput."

Fault Line 3: Algorithm — why it matters less than you think (3–5 min)

"I'd use token bucket. But the algorithm choice is the least interesting part. Token bucket, sliding window log, sliding window counter — they all work. The real question is where the counter lives, what happens when that store fails, and how you coordinate across instances. I can explain the algorithmic differences if you'd like, but I'd rather spend time on the distributed coordination problem."

This is a power move. It demonstrates you know the algorithms but won't waste time on textbook recitation. If the interviewer insists:

  • Token bucket: smooth, allows bursts up to bucket size, O(1) per check
  • Sliding window counter: approximation between two fixed windows, low memory, ±1% error at boundaries
  • Sliding window log: exact, but O(N) memory per client — doesn't scale for high-volume clients

Then redirect: "The algorithm determines local behavior. The hard problem is distributed coordination — which we just discussed."

Fault Line 4: Hot keys & thundering herd (3–5 min)

"What happens when a single API key generates 50% of all traffic? That key's counter becomes a hot key in Redis — every gateway instance contends on the same key. The mitigations:

  • (a) Local aggregation: batch increments locally and flush to Redis every 100ms instead of per-request
  • (b) Key sharding: split rl:{api_key}:{window} into rl:{api_key}:{window}:{shard_0..7} and sum on read
  • (c) Deny-cache: if a key is already 10× over limit, reject locally without hitting Redis at all"

The important distinction: "Sharding Redis helps for many different keys across nodes. It does NOT help for a single hot identity that hashes to one shard. Hot key is structurally different from high cardinality — you need local aggregation or key-owner routing, not just more shards."

Fault Line 5: Ownership & policy management (3–5 min)

"Who writes the rate limit policies? In my experience, this is where rate limiting actually breaks. The platform team owns the enforcement infrastructure, but product teams own the policies for their endpoints. Without a self-service policy API and a review process, you end up with either: the platform team as a bottleneck for every policy change, or product teams setting limits too high (because they fear blocking users) and the limits being effectively useless."

The Staff answer: "Self-service policy API with guardrails. Product teams can set limits within pre-approved ranges. Changes go through a review pipeline — not a human review, but an automated check that the new limit won't exceed the backend's capacity. Deployment is canary: new limits apply to 5% of traffic for 1 hour before full rollout. Rollback is a one-line config change, not a deploy."

Phase 6: Wrap-Up (2–3 minutes)

Synthesize the insight — don't restate the architecture:

"Rate limiting is a policy enforcement problem, not an algorithm problem. The Staff-level challenge is: who absorbs the cost of imperfection? For abuse protection, we bias toward fail-open because blocking legitimate users is worse than letting some abuse through. For billing, we bias toward fail-closed because giving away resources has direct cost. The architecture is the same in both cases — the configuration and failure behavior change."

The organizational closer:

"The harder problem is policy management. The rate limiter is infrastructure — it's a solved technical problem. The unsolved problem is getting 15 product teams to agree on rate limit policies, keep them updated, and actually respond when limits are hit. That's an organizational design problem, not a systems design problem."

Common Timing Mistakes

Level Calibration
MistakeL5 Does ThisL6 Does This Instead
10 min on requirementsLists every functional requirement, asks about each edge caseStates intent in 1 min, picks abuse protection, moves on
15 min on algorithmDeep dive into Token Bucket vs Sliding Window math"Token bucket, here's why, moving on to what actually matters"
No failure discussionWaits for interviewer to ask "what if Redis goes down?"Volunteers fail-open/fail-closed proactively in the architecture phase
No ownership storyFocuses purely on implementationNames who owns policies, who's on-call, how config changes deploy
Spreads thinTouches 6 topics at surface levelGoes deep on 2–3 fault lines, shows quantitative reasoning
No numbers"It should be fast""Sub-5ms overhead, bounded drift of ~83 requests with 10 instances syncing every 5s"

1The Staff Lens

1.1 Why This Problem Exists in Staff Interviews

Rate limiting separates L6 from L5 because it forces you to reason about organizational tradeoffs — who absorbs the cost of imperfection, who owns the policy, who gets paged at 3 AM. Five behaviors below are what interviewers listen for.

1.2 The L5 vs L6 Contrast — Visual

Rendering diagram...

1.3 The Staff Question That Cuts Through Everything

This single question reveals whether a candidate has operated rate limiters in production or only designed them theoretically.

2Problem Framing & Intent

2.1 The Three Intents — Explained

Abuse Protection → fail-open, speed-first

  • Constraint: latency overhead must be sub-5ms; enforcement must not become a kill switch
  • Algorithm: token bucket, local-first with async coordination
  • Failure mode: fail-open with conservative local caps and bypass-rate alerting
  • Who pays for imperfection: security/product team (explaining why some requests got through during Redis outage)

Billing/Quota → fail-closed, accuracy-first

  • Constraint: every over-admission has dollar cost; audit trail required
  • Algorithm: token bucket with centralized Redis; stricter coordination
  • Failure mode: fail-closed (503 or 429 with "service temporarily unavailable") + immediate alert
  • Who pays for imperfection: finance/legal (over-admission means giving away paid resources)

Multi-Tenant Fairness → isolation-first

  • Constraint: one tenant's traffic burst must not starve other tenants
  • Algorithm: hierarchical token buckets — global cap → tenant cap → sub-tenant cap
  • Failure mode: per-tenant circuit breaking; global capacity preserved
  • Who pays for imperfection: product team (explaining why enterprise customer was throttled)

2.2 What the Interviewer Leaves Underspecified

Interviewers deliberately omit:

  • Auth vs unauth traffic
  • Client identity strength
  • Hard vs soft limits
  • Multi-region behavior
  • Regulatory constraints

Staff engineers surface these. Senior engineers assume them away.

2.3 Precise Terminology

Rate-limiter interviews are ambiguous about where enforcement runs. Use precise terms:

TermWhat It MeansIdentity Context
API Gateway / IngressFirst programmable hop inside our infrastructureAPI key, auth token, IP
CDN/WAF (true edge)Cloudflare/Akamai/AWS WAF — before our gatewayIP, ASN, geo only
Service Mesh / SidecarInternal rate limiting between servicesService identity
Application MiddlewarePer-service enforcementFull request context

3The Five Fault Lines

3.1 Fault Line 1: Protection vs Correctness

The tension: Strict correctness requires coordination that can become the bottleneck. Protection-first accepts drift but keeps the system alive.

Who Pays Analysis
ChoiceWhat WorksWhat BreaksWho Pays
Prioritize CorrectnessExact limits enforcedSystem collapse under coordination loadInfra team (outage)
Prioritize ProtectionSystem survivesSome over-admissionSecurity/Product (explaining drift)
Rendering diagram...

L6 answer: "For abuse protection, I choose protection-first with bounded drift. I'd rather over-admit 10% than add 10ms to every request or cause a self-inflicted outage when Redis degrades. For billing/quota, I flip to correctness-first — the over-admission has direct cost. The key insight: abuse protection and billing cannot share the same rate limiter configuration because they have incompatible failure modes."

L7 answer: Reframes as risk governance — "Which layer enforces what?" CDN/WAF for coarse abuse, gateway for identity-aware limits, app for business invariants. Defines who signs off on fail-open/closed and what blast radius is acceptable.

3.2 Fault Line 2: Centralized vs Distributed State

The tension: Central state is easy to reason about but creates a dependency in the hot path. Distributed state is resilient but requires explicit coordination mechanisms.

Who Pays Analysis
ChoiceWhat WorksWhat BreaksWho Pays
Centralized (Redis per-request)Simple, accurate, observableSPOF, latency tax (+5–10ms)Infra (reliability burden)
Distributed (local + async sync)Resilient, fast, scalableAccuracy loss (bounded drift)Product (explaining over-admission)
Hybrid (leasing / routing)Reduced Redis QPS, high accuracyComplexity (lease sizing, reclaim)Engineering (maintenance burden)
Rendering diagram...

L6 answer: "I pick token leasing for this design. Each gateway instance leases a chunk of tokens from Redis (say, 50 tokens for a 1,000 req/min limit with 10 gateways), spends them locally with zero coordination per request, then renews when the lease runs low. Redis QPS drops from request-per-lease-renewal to approximately 2 RPCs per gateway per 5 seconds — a 10–100× reduction. Failure mode: if a gateway crashes mid-lease, those leased tokens are stranded for the lease duration (5 seconds). For abuse protection, 5 seconds of stranded capacity is acceptable."

L7 answer: "Do we even need custom distributed coordination? Evaluate managed gateway throttling (Envoy RLS, AWS API Gateway throttling, Cloudflare Rate Limiting) before building custom. If custom is needed, select the coordination mechanism based on operational cost: token leasing for abuse protection at scale, key-owner routing for authenticated billing paths."

3.3 Fault Line 3: Latency vs Accuracy

The tension: Every millisecond added to the critical request path has compound effects at scale. At 1M req/sec, a 5ms Redis round-trip is the difference between a healthy gateway and a latency amplifier.

ChoiceLatency ImpactAccuracyWhen Appropriate
Redis per-request+5–10ms p99HighLow QPS, billing-critical paths
Local + async sync~0ms addedMedium (bounded drift)High QPS, abuse protection
Hybrid (lease/route)+1–2ms occasional (lease renewal)Medium-HighMost production systems
Rendering diagram...

L6 answer: "The latency tax is the reason the local-first design exists. At 100K req/sec, every 1ms of rate-limiter overhead is 100 additional seconds of cumulative delay per second of traffic. I keep the hot path at ~0ms by checking local buckets first. Redis is for coordination, not for enforcement. The 5–10ms overhead only occurs on lease renewals (every 5–10 seconds per gateway instance) or on the first request from a new identity."

3.4 Fault Line 4: Fail-Open vs Fail-Closed

The tension: When the rate limiter's central store fails, you must choose between protecting availability (fail-open, risk abuse) and protecting resources (fail-closed, risk self-inflicted outage).

ContextRecommendedWhy
Ingress abuse protectionFail-openLimiter shouldn't be a kill switch
Billing/quota enforcementFail-closedCannot give away resources
Internal service protectionDependsCascade analysis required
Rendering diagram...

Guardrails for fail-open:

  • Aggressive Redis timeout (5–10ms max — fail fast, don't let threads pile up)
  • Conservative local fallback caps (set per-instance limit to 2× normal, not unlimited)
  • Bypass-rate alerting: bypass_rate > 1% for 5 min → page on-call
  • Circuit breaker on backend stress signals: if downstream error rate rises while in fail-open, tighten caps

L6 answer: "I make the fail-open/fail-closed decision before writing a single line of code, and I tie it to intent. For this abuse protection limiter: fail-open. The rate limiter should never become a kill switch for legitimate users. I add three guardrails: aggressive Redis timeout (5ms), conservative local fallback caps, and bypass-rate alerting. If the backend shows stress while in fail-open mode, the fallback caps tighten automatically. For billing: fail-closed with immediate on-call escalation — we can't give away paid resources."

L7 answer: Defines a governance model — who can flip fail-open/closed (change management approval required), what the emergency procedure is, what the kill-switch scope is (per-endpoint, per-tenant, or global), and what post-incident analysis is required.

3.5 Fault Line 5: Infra Ownership vs Team Autonomy

The tension: A central rate limiting service creates consistency but becomes a bottleneck. Per-service SDKs give teams flexibility but cause library hell and config drift.

ModelWho OwnsProsCons
Central servicePlatform teamConsistencyBottleneck, SPOF
Gateway / SidecarPlatform teamDecoupled, consistentRequires gateway/mesh investment
SDK / LibraryEach teamFlexibilityLibrary hell, version drift
Rendering diagram...

L6 answer: "Enforce at the gateway/sidecar to avoid per-service SDK drift. Within 6 months of distributing a rate limiting SDK, I'd have 3–4 incompatible versions in production — one with a security bug the team hasn't updated, one on a deprecated version. A policy change requires coordinating 15 teams instead of one config update. The gateway enforcement model means: platform team owns infrastructure, product teams own policy intent via a self-service API. Policy changes deploy to the gateway — no service redeploy required."

L7 answer: Defines governance: who owns policy definition (product team, within platform-defined bounds), who approves exceptions (architecture review for limits above platform maximums), how staged rollouts work (shadow → warn → enforce), and how you prevent the platform team from becoming a bottleneck.

4Failure Modes & Operational Reality

4.1 Store Failures — Full Timeline

Scenario A: Redis becomes slow (most common)

t=0:     Redis p99 jumps from 2ms to 100ms
t=0–30s: Gateway worker threads pile up waiting for Redis responses
t=30s:   Gateway request queues fill — new requests start queuing
t=1min:  Gateway starts returning 503s to ALL traffic
t=2min:  "The rate limiter" has become a global outage
         (NOT a Redis outage — a rate limiter outage)

What breaks first: P99 latency at the gateway spikes, threads pile up, the rate limiter becomes a latency amplifier for every request regardless of rate limit status.

Bad reaction: "Increase Redis timeouts" — makes it worse, extends the window before circuit break fires.

Staff reaction: Aggressive Redis timeout (5ms) triggers circuit break → switch to local-only mode → bypass-rate metric fires alert → on-call is paged within 30 seconds. The gateway never queues on Redis.

Scenario B: Redis is completely down

StrategyEffectWho It Hurts
Fail-closedProtect backend; risk total outageAll users
Fail-open with local capsPreserve availability; bounded abuse riskSecurity team (explaining gap)

Staff choice for abuse protection: Fail-open with aggressive local limits, bypass-rate alerting, and circuit breaker on backend stress. The rate limiter must not become a global kill switch.

4.2 Hot Key & Amplification — Why "Just Shard It" Doesn't Work

Rendering diagram...

The critical insight: sharding helps for many different keys. It does NOT help for one single identity that generates 100K req/s — that identity always maps to the same shard regardless of how many shards exist.

Scenario A: Leaked API key used by botnet

  • Symptom: one API key drives 50%+ of traffic; single shard CPU spikes
  • Mitigation: (1) local deny-cache — reject locally with TTL 30–60s without hitting Redis, (2) revoke/rotate the key, (3) add CDN/WAF edge rules if stable IP/ASN signals
  • Tradeoff: fast containment vs false positives if key is shared with legitimate partner

Scenario B: Legitimate tenant burst (partner batch job)

  • Symptom: paying tenant looks like "hot key" but isn't malicious
  • Mitigation: token leasing with adaptive lease sizing for higher-tier tenants; per-tenant reservation pool
  • Tradeoff: fairness vs utilization

Scenario C: IP-based identity collapses (NAT/corporate proxy)

  • Symptom: one IP = thousands of real users → false positives
  • Mitigation: IP as coarse outer limiter only; shift to stronger identity (API key / user_id) for authenticated paths
  • Tradeoff: better UX vs implementation complexity

Mitigations that actually work for hot keys:

  1. Local aggregation: batch increments every 100ms instead of per-request — 100× Redis QPS reduction
  2. Key-owner routing: single gateway instance owns enforcement for hot identity — no multi-writer races
  3. Deny-cache: if identity is already 10× over limit, reject locally without hitting Redis at all
  4. Key sharding + sum-on-read: split rl:{key}:{window} into rl:{key}:{window}:{shard_0..7} and sum at read time — distributes write load across shards

4.3 Data Integrity Failures

Clock Skew: Token refill depends on wall time. Drift causes over-refill or under-refill across nodes.

  • Mitigations: cap refill deltas (never trust a >10s time jump), use monotonic clocks, use Redis server time inside Lua scripts
  • Staff note: redis.call("TIME") inside the Lua script uses Redis server time, reducing clock-skew risk — adds ~0.1ms overhead but worth it for billing paths

Script Bugs: Atomic Lua operations silently wrong. Hardest failure mode to detect.

  • Detection: integration tests covering boundary conditions, audit logging with remaining_tokens sampled 1%
  • Recovery: script rollback (keep previous version deployable), counter reset procedure

4.4 Operational Reality Matrix

FailureLoud/SilentUser ImpactDetection TimePrevention
Redis downLoudImmediate if fail-closedSecondsAggressive timeout + bypass alert
Redis slowMediumLatency spikeMinutestimeout circuit break
Silent fail-openSilentInvisible until attackHoursbypass_rate > 1% alert
Hot keyMediumSubset of users (one shard)MinutesPer-shard CPU monitoring
Clock skewSilentGradual driftMinutes–hoursServer-side time in Lua
Script bugSilentVariesHours–daysIntegration tests + audit log
Policy misconfigurationMediumWrong users throttledMinutesShadow mode + canary rollout

5Evaluation Rubric

5.1 Level-Based Signals

Level Calibration
DimensionSenior (L5)Staff (L6)Principal (L7)
SemanticsToken bucket + RedisDefines contract precisely; separates abuse vs billing semanticsStandardizes semantics org-wide: policy language, versioning, client-facing vs internal
PlacementAPI Gateway + central storeChooses layers intentionally for blast radius; names CDN/WAF vs gateway vs mesh distinctionSets org strategy; coarse abuse → CDN/WAF, identity-aware quota → gateway, dependency protection → mesh
Coordination"Central Redis per request"Picks one concrete mechanism (leasing/routing/reconciliation); explains failure behaviorChooses via TCO + operational risk; prefers managed (Envoy RLS, WAF throttling) unless gap demands custom
Hot keys"Redis cluster + sharding"Explains why hot key ≠ many keys; proposes mitigations with scenario tradeoffsTreats as incident + governance problem: key rotation, capacity planning, tenant isolation
Failure modes"Redis replicas / HA"Timeline-driven; explicit fail-open/closed by intent; bypass-rate observability contractGovernance + blast-radius controls: kill switches, change management, incident drills
Latency"Low latency required"Quantifies tax; sets p99 budget; keeps enforcement off critical pathConnects latency to business SLO + cost model; chooses where to pay the tax per path
OwnershipImplementation focusAvoids SDK hell; defines policy lifecycle and rolloutDefines org boundaries: platform vs security vs product, staged rollout, long-term simplification

5.2 Strong Hire Signals

SignalWhat It Sounds Like
Intent before architecture"What are we protecting against? Abuse, billing, or fairness? These are incompatible systems."
Latency trap named"Redis per-request adds 5–10ms to every request. That's unacceptable at 100K req/s."
Failure mode quantified"Worst-case over-admission with 10 gateways and 50-token leases: ~500 extra tokens — 50% overshoot worst case, 10–20% in practice."
Silent fail-open addressed"bypass_rate > 1% for 5 minutes pages on-call. The metric is mandatory, not optional."
Policy lifecycle defined"Shadow → warn → hard-enforce. Policy changes never skip canary. Who signs off? What metrics must be green?"

5.3 Lean No-Hire Signals

SignalWhy It Misses the Bar
Algorithm fixation15 minutes on Token Bucket vs Sliding Window without tradeoffs
Over-engineering"We need multi-region active-active from day one"
Ignoring operationsNo mention of monitoring, alerting, failure handling, policy management
Missing intentDesigns without clarifying what we're protecting against
"Just add replicas"Replicas improve availability; they don't define degraded-mode behavior

5.4 Common False Positives

  • Knows Redis deeply: Deep Redis knowledge ≠ good system design. Implementation detail fluency without tradeoff reasoning.
  • Draws complex diagrams: Complexity is not a Staff signal. Can they explain the organizational cost of each component?
  • Mentions many algorithms: Breadth without depth is Senior, not Staff.
  • "We'll use Envoy RLS": Naming a managed solution without explaining when custom is needed and when managed is sufficient.

6Interview Flow & Pivots

6.1 Typical 45-Minute Shape

PhaseTimeGoal
Framing0–3 minName intent, commit to abuse protection, state constraints
Entities + API3–5 minThree nouns, hot path vs cold path, response headers
High-level design5–12 minGateway enforcement, local-first + Redis coordination, fail-open
Transition12 minOffer three fault lines, let interviewer choose
Deep dives12–40 minFailure modes → distributed coordination → policy management
Wrap-up40–45 minOrganizational insight, who's on-call, what the runbook says

6.2 Reading the Interviewer

Interviewer SignalWhat They Care AboutWhere to Go Deep
Asks about Redis failureOperational maturityFail-open vs fail-closed; bypass-rate observability
Asks about accuracyDistributed systems depthCoordination mechanisms; drift bounds
Asks about multi-regionScale and architectureRegional quotas; async reconciliation; GDPR on counter data
Asks "who decides limits?"Organizational designPolicy management; self-service API; review process
Asks about DDoSSecurity depthCDN/WAF vs gateway vs application layers
Pushes back on your designWants you to defend or adaptState reasoning, acknowledge alternatives, commit to tradeoff

6.3 What to Deliberately Skip

Level Calibration
TopicWhy L5 Goes HereWhat L6 Says Instead
Algorithm deep diveIt's in every textbook, feels safe"Token bucket. The algorithm isn't the hard part — coordination is."
Database schema designFeels productive to draw tables"Counters live in Redis, policies in PostgreSQL. Schema is trivial."
HTTP status codesEasy to enumerate"429 with Retry-After header. Standard. Moving on."
Rate limit dashboard UISeems complete"Admin UI is a CRUD app. Not interesting for this interview."
Exact sliding window mathTextbook material"Sliding window approximation, ±1% error at boundaries. Acceptable."

6.4 Follow-Up Questions to Expect

  1. "How do you handle clock skew across gateway instances?"
  2. "What if a single user generates 90% of traffic?"
  3. "How do you test rate limiting logic in production?"
  4. "What metrics would you monitor to detect a silent fail-open?"
  5. "How do you handle a global rate limit across multiple regions?"
  6. "What's your failure budget for this service?"

7Active Drills

1

Drill 1: The Opening (Intent + Constraints)

Staff Answer

"Before I draw anything — what are we protecting against? Abuse protection, billing/quota enforcement, and multi-tenant fairness are three fundamentally different systems with incompatible failure modes. Designing them as one leads to a hybrid that satisfies none.

I'll assume ingress abuse protection: the constraint is sub-5ms overhead, fail-open behavior (the limiter shouldn't become a kill switch), and eventual consistency with a bounded drift. I'll walk through: placement → algorithm semantics → coordination mechanism → failure modes → observability → policy management."

Why this is L6:

  • Names intent before proposing architecture — prevents designing a hybrid that satisfies no intent
  • States failure mode preference (fail-open) proactively — doesn't wait for the interviewer to ask
  • Frames the outline as a decision sequence, not a component list — each step narrows the design space
❌ Common L5 Trap

"I'll design a rate limiter using token bucket with Redis. Each request checks Redis with a Lua script for atomicity..."

Why this misses: Picks an algorithm before defining the problem. The interviewer asks: "What are we protecting against?" Now the candidate has to backtrack from "Redis per request" when they realize abuse protection doesn't need that latency overhead. The L6 answer establishes intent first — the architecture follows from intent, not the other way around.

2

Drill 2: Token Bucket Semantics

Staff Answer

"'100 requests per minute' hides four implementation decisions:

  1. Is burst allowed? Token bucket says yes — you can burst up to bucket capacity (e.g., 200) if tokens have accumulated. Fixed window says no burst, but allows boundary exploitation (100 at t=59s, 100 at t=61s = 200 in 2 seconds).
  2. What's the window? Fixed (resets at :00, :01) or rolling (always looking back 60 seconds)?
  3. How much boundary skew is acceptable?
  4. What happens at the exact threshold — is the 100th request allowed or rejected?

For abuse protection: token bucket with capacity=200 (2x burst allowance) and refill_rate=100/60 per second. Users get smooth enforcement with burst tolerance. The 429 response includes Retry-After with the exact time until the next token is available. The client never needs to know the window boundaries."

Why this is L6:

  • Distinguishes burst tolerance from steady-state rate — operationally different for users
  • Connects algorithm choice back to intent (abuse protection → burst allowed)
  • Specifies the client contract (429 + Retry-After) as first-class, not an afterthought
❌ Common L5 Trap

"100 requests per minute means the user can make 100 API calls per minute. We track this with a counter that resets every minute."

Why this misses: Fixed window semantics with unexamined boundary behavior. The interviewer asks: "What happens at t=59s if a user sends 100 requests, then at t=61s sends 100 more?" The answer: "Both are allowed — 200 requests in 2 seconds." This is the fixed window boundary exploit that allows 2× burst at every window reset. Token bucket prevents this, but the L5 candidate hasn't considered it.

3

Drill 3: "Hybrid Local + Redis" — Make It Concrete

Staff Answer

"I'll use token leasing. Here's the request timeline:

  1. Gateway starts up → leases 50 tokens for key api_key_abc from Redis: LEASE api_key_abc 50 ttl=5s
  2. Redis responds: granted, 50 tokens, expires in 5 seconds
  3. Requests arrive → gateway decrements local counter (0ms per request, no network hop)
  4. After 45 requests, local counter drops to 5 (below low-watermark threshold)
  5. Gateway proactively renews: LEASE api_key_abc 50 ttl=5s
  6. Redis responds: granted, accounting for all 50 gateways' leases against the global limit

The failure mode I'm explicitly accepting: if a gateway crashes mid-lease, those leased tokens are stranded for 5 seconds. For a 1,000 req/min limit across 10 gateways with 50-token leases: worst-case stranded capacity = 50 tokens for 5 seconds. Acceptable for abuse protection."

Why this is L6:

  • Names one concrete coordination mechanism — not "sync with Redis"
  • Walks through the request timeline to prove the atomic boundary is sound
  • Explicitly names and accepts the failure mode (stranded tokens on crash)
❌ Common L5 Trap

"We sync the local counter with Redis every 500 milliseconds. Each gateway reads from Redis periodically to stay in sync."

Why this misses: "Sync with Redis" is not a mechanism — it's a frequency. The interviewer asks: "How exactly does the sync work? Do all gateways read the same Redis key? What if two gateways both read 80 tokens and both try to spend them before the next sync?" This exposes the race condition: periodic sync without a named coordination mechanism means multiple gateways can each independently allow 100 requests during the sync window, collectively over-admitting by N×. Token leasing bounds the over-admission explicitly.

4

Drill 4: Redis Is Slow/Down

Staff Answer

"Two things happen simultaneously:

First: the circuit breaker fires. I've configured an aggressive Redis timeout of 5ms. At 200ms p99, the circuit breaker opens after seeing 50% of requests timeout — it stops sending requests to Redis and switches to local-only mode. This happens within 30 seconds of Redis degrading.

Second: the bypass-rate metric fires. rate_limit_bypassed_total starts incrementing. When bypass_rate exceeds 1% of requests for 5 minutes, PagerDuty pages the on-call engineer. The runbook says: (1) check Redis cluster health, (2) identify the cause (network, memory, CPU), (3) if Redis is recovering, monitor — local fallback is handling the gap. (4) If Redis is down and not recovering, check if backend shows stress — if yes, tighten local fallback caps.

The thing I'm NOT doing: increasing the Redis timeout to 200ms. That's the wrong reaction — it extends the window where threads pile up waiting for slow Redis responses and turns a Redis slowdown into a gateway outage."

Why this is L6:

  • Uses aggressive timeout as a circuit break — not "wait for Redis to recover"
  • Names the specific metric and alert threshold for bypass detection
  • Describes the runbook steps, not just the technical response
  • Explicitly says what NOT to do and why
❌ Common L5 Trap

"Redis p99 at 200ms means our rate limit checks are adding 200ms to requests. We should increase Redis capacity or add replicas to bring the latency down."

Why this misses: Treats Redis slowness as a capacity problem to solve, not a failure mode to handle. The interviewer asks: "While you're scaling Redis, what's happening to your gateway for the next 5 minutes?" Answer: gateway threads are blocking for 200ms on every Redis call. With 100 concurrent requests per instance and 200ms Redis calls, gateway threads saturate and the rate limiter becomes a latency amplifier. The circuit break + local fallback prevents this cascade; adding Redis capacity doesn't.

5

Drill 5: Hot Key — One Identity at 90% of Traffic

Staff Answer

"The critical distinction: a hot key is structurally different from high traffic volume. Redis sharding distributes many different keys across nodes. It does nothing for one identity that always hashes to the same shard.

What breaks first: the Redis shard handling rl:abusive_api_key:* saturates — CPU and network on that shard spike. This causes latency spikes for all OTHER users whose keys hash to the same shard, not just the abusive key. Collateral damage.

The mitigations in priority order:

  1. Local deny-cache (immediate): Once the key is identified as over-limit, cache the deny decision locally with a 30–60 second TTL. New requests from this key get rejected at the local bucket without touching Redis at all. Redis QPS from this key drops to near-zero.

  2. Key-owner routing (for sustained hot keys): Route all requests for hot identities to a consistent gateway instance. Single-writer eliminates the Redis contention for that key. Other gateways never touch it.

  3. Key sharding + sum-on-read (if both fail): Split rl:{key}:{window} into rl:{key}:{window}:{shard_0..7} and sum the counts at read time. Distributes Redis write load across 8 shards. More complex to implement.

The edge case I'd surface: is this a legitimate partner doing a bulk operation (false positive) or actual abuse? Tighten too hard and you cut off a paying customer. I'd check account tier before escalating to full deny."

Why this is L6:

  • Explains why sharding doesn't solve hot key (structurally different problems)
  • Names collateral damage to other users on the same shard — not just the hot key user
  • Proposes mitigations in priority order with implementation tradeoffs
  • Raises the false positive edge case before the interviewer asks
❌ Common L5 Trap

"One API key at 90% of traffic would be handled by our Redis cluster — the key just gets heavy writes to that shard, but Redis can handle it with enough memory and CPU."

Why this misses: "Redis can handle it" ignores the saturation math. At 100K req/s from one key, the Redis shard handling that key receives 100K writes/second for just that one counter. Redis single-shard throughput is ~100K simple operations/second — this one key alone saturates the entire shard. Every other user hashing to that shard experiences latency spikes. The interviewer asks: "User Alice hashes to Shard 3, same as the abusive key. What does Alice experience?" Answer without thinking carefully: "Normal service." Correct answer: "Latency spikes and potential timeouts as Shard 3 is saturated."

6

Drill 6: Multi-Tenant Fairness

Staff Answer

"Multi-tenant fairness is a different problem than abuse protection — the noisy tenant isn't necessarily malicious, they're just large. The goal is per-tenant SLO preservation, not just global QPS control.

I'd use hierarchical token buckets with reserved floors:

Global capacity bucket (protects platform overall)
  └── Tenant A bucket (contracted: 500 req/s, burst: 800 req/s)
       └── Sub-buckets per user/endpoint within Tenant A
  └── Tenant B bucket (contracted: 200 req/s, burst: 300 req/s)
  └── Shared burst pool (available to all tenants when slack exists)

The key mechanism: each tenant has a guaranteed reserved_floor (their contracted rate) that they always get, even under global contention. Above the floor, tenants compete for the shared burst pool. Under contention, burst pool allocation falls back to the reserved floor.

Observability for fairness — critical metrics:

  • rate_limit_rejected_total{tenant} per tenant
  • tenant.p99_latency_ms per tenant — fairness failures show as latency regressions before they show as rejection spikes
  • Alert: 'Tenant starvation — tenant heavily throttled while global utilization < 80%'"

Why this is L6:

  • Frames fairness as a per-tenant SLO problem, not a global QPS problem
  • Proposes hierarchical buckets with reserved floors — protects small tenants from large ones
  • Includes per-tenant observability with the specific fairness-detection alert
❌ Common L5 Trap

"We'll give each tenant a separate rate limit based on their plan tier. Enterprise gets 1,000 req/s, Pro gets 500, Free gets 100."

Why this misses: Tiered limits prevent individual tenants from exceeding their quota, but don't prevent the noisy neighbor problem — Enterprise Tenant A consuming all 1,000 req/s of their quota might saturate the shared infrastructure in ways that degrade Tenant B's experience even though Tenant B is within their 500 req/s limit. The L5 answer sets per-tenant ceilings; the L6 answer uses hierarchical buckets with reserved floors to guarantee per-tenant minimums.

7

Drill 7: Build vs Buy

Staff Answer

"My default is always: use what you already have. Before proposing custom rate limiting, I'd inventory existing controls:

  1. CDN/WAF already has coarse IP-based rate limiting. Does this solve the abuse problem? For volumetric DDoS, yes. For API abuse from authenticated clients, no.

  2. AWS API Gateway / GCP Apigee / Envoy RLS all have built-in rate limiting. Do they support our identity model (user_id + endpoint + tenant)? Do they support our coordination model (local-first with async sync)? If yes, use them.

  3. If managed solutions have a gap: Custom rate limiting is justified only if the gap is differentiated and the maintenance cost is worth it. Common gaps: complex multi-tenant fairness, custom token leasing semantics, per-request cost (not all requests cost 1 token), business-logic-aware limiting.

TCO argument for managed: A managed solution means no on-call burden for the rate limiting infrastructure itself — just the policy configuration. A custom solution means rate-limiting incidents, rate-limiting upgrades, and rate-limiting on-call rotation, all on top of the actual product work.

My recommendation for most mid-size companies: start with managed gateway throttling (Envoy RLS, CloudFront, or API Gateway). Add custom logic only where managed solutions have demonstrable gaps that affect business goals."

Why this is L6:

  • Inventories existing controls before proposing new work
  • Quantifies the TCO argument (managed vs custom) rather than just "use the managed service"
  • Identifies the specific gaps that justify custom engineering
❌ Common L5 Trap

"We should build a custom rate limiter for full control and flexibility."

Why this misses: "Full control" is not a TCO argument. The interviewer asks: "What does 'full control' get us that Envoy RLS or AWS API Gateway throttling doesn't?" If the answer is "we can customize it more," the follow-up is: "Who maintains the custom rate limiter? Who's on-call for it? Who upgrades it when Redis has a CVE?" Custom infrastructure has organizational cost beyond engineering time.

8

Drill 8: Policy Changes Without Outages

Staff Answer

"Policy changes follow a lifecycle. 'Deploy tomorrow' is an organizational failure — we should already have self-service for this:

Stage 1 — Shadow mode (24 hours): Apply the new limit logically but don't enforce. Log what would have been rejected. Check: would this limit have affected any legitimate paying customers yesterday?

Stage 2 — Soft enforcement (1 week): Send warning headers to requests that would be rejected, but don't actually return 429. Example: X-RateLimit-Warning: You would be rate limited. Reduce usage. Teams see they're about to be throttled without disruption.

Stage 3 — Canary enforcement (48 hours): Apply hard enforcement to 5% of traffic. Monitor per-tenant rejection rates. If top tenants spike, the limit is misconfigured — rollback with one config change, no deploy.

Stage 4 — Full enforcement: Roll out to 100% after clean canary.

The organizational piece: who approves the change? The platform team sets guardrails (maximum limits per tier). Product teams can self-service within those bounds. Changes above the guardrails require an architecture review. This prevents 'product team set a limit too high because they were scared of throttling users' from making the limit useless."

Why this is L6:

  • Full policy lifecycle with shadow mode, soft enforcement, and canary — not "update the config"
  • Rollback is a one-line config change, not a deploy — pre-conditions for safe changes
  • Defines the organizational governance layer (guardrails + self-service + review)
❌ Common L5 Trap

"Update the rate limit config and deploy. We can use a feature flag to enable it."

Why this misses: A feature flag enables/disables the limit but doesn't tell you if the new limit is correct. The interviewer asks: "How do you know if the new limit is too tight before you've already throttled your top customers?" Answer: "We'll find out from support tickets." The shadow mode and soft enforcement stages exist specifically to catch misconfigured limits before they cause user impact. Deploying with a feature flag is not a deployment strategy — it's a kill switch.

8Deep Dive Scenarios

Scenario-based analysis for Staff-level depth

Deep Dive 1: Flash Sale Incident

Context: It's Black Friday. The rate limiter is returning 429s to legitimate users during a 10× traffic spike. The on-call engineer escalates to you.

Questions to Surface First:

  • Is the backend actually unhealthy, or is the limiter rejecting traffic the backend could handle?
  • Is this affecting all users or specific tenant tiers? Are VIP/enterprise customers hitting public limits?
  • Was this traffic spike predicted? Was there a capacity planning exercise for Black Friday?
  • What's the business cost per minute of rejecting legitimate buyers right now?
Staff Approach — Full Reasoning
PhaseWhat to Do
Immediate (0–5 min)Check backend health first — CPU, error rate, connection pool. If backend is healthy, the limiter is misconfigured, not the users.
TriageIs this one tenant hitting limits, or system-wide? Check per-tenant dashboards. Are enterprise customers (who have pre-negotiated higher limits) affected?
Quick fixIf backend is healthy and legitimate traffic is rejected: emergency limit increase via feature flag. Document the change, set an expiry.
GuardrailsWhile limits are raised: watch backend CPU and error rates. Don't let increased limits cause a cascade. Set a limit on the limit increase.
Post-mortemWhy didn't capacity planning catch this? Should we have elastic limits for planned events? Should marketing notify infra before campaigns?

Staff insight: The rate limiter's job is to protect the backend, not to be "correct." If the backend is healthy and users are being rejected, the limiter is misconfigured for the situation.

Metrics to Watch: backend.cpu_utilization, backend.error_rate, backend.connection_pool_utilization, rate_limiter.rejection_rate_by_tier (are VIP customers hitting limits?), business.revenue_per_minute (the actual cost of rejections)

Organizational Follow-up: Create a pre-event capacity review process with marketing and product — Black Friday is not a surprise. Build elastic limits for planned events (limits that automatically increase during pre-announced high-traffic windows). Add "large marketing campaigns" to the infra team's change management calendar.

Ownership Question: "Who decides whether to override the rate limit during a flash sale incident?" Staff answer: The Staff on-call engineer, with a time-bounded override that auto-expires in 2 hours. The decision criteria are codified in the runbook: (1) backend health is green, (2) legitimate traffic is being rejected, (3) override logged with reason. This cannot be a per-incident judgment call — it needs to be a pre-approved procedure.

Staff Signals:

  • Checks backend health before blaming the rate limiter
  • Has pre-established emergency procedures, not ad-hoc judgment calls
  • Identifies the organizational gap (capacity planning didn't include Black Friday) rather than just fixing the immediate symptom

Deep Dive 2: Silent Fail-Open

Context: You discover the rate limiter has been running in fail-open mode for 3 days because Redis was slow. No alerts fired. What went wrong, and how do you fix the system?

Questions to Surface First:

  • How long has fail-open been active? Did abuse increase during this window — check login failure rates, API error rates, unusual traffic patterns.
  • Is fail-open an intentional design choice, or did it happen by accident (no fallback configured)?
  • What other services degrade silently when their dependencies are slow? Is this a systemic pattern?
  • Who should have been alerted, and why wasn't existing monitoring sufficient?
Staff Approach — Full Reasoning
DimensionStaff Answer
Root causeMissing observability contract: bypass rate wasn't monitored, or alert threshold was wrong
Immediate actionIs this an active incident? Check for abuse during the 3-day window — login failures, API anomalies, unusual data access patterns
System fixAdd rate_limiter.bypass_rate metric with alert: bypass_rate > 1% for 5min → page. This is non-negotiable.
Process fixFail-open is a deliberate choice. But deliberate requires visibility. Add time-bounded fail-open: auto-expires after 30 minutes and requires explicit re-approval to extend.
Broader questionAudit all services with fail-open behavior for similar observability gaps. "Silent degradation" as a category in post-incident retrospectives.

Staff insight: Fail-open without alerting is the same as having no rate limiter. Every deliberate degradation mode needs a corresponding observability contract.

Metrics to Watch: rate_limiter.bypass_rate (alert if >1% for 5min), redis.latency_p99 (warn at 5ms, critical at 10ms), rate_limiter.enforcement_mode (enforcing vs bypass — should be a visible dashboard metric, not buried in logs), login.failure_rate (anomaly during bypass window), api.error_rate_by_endpoint (abuse signal during bypass window)

Organizational Follow-up: Audit all services with fail-open behavior for observability gaps — this is a systemic pattern, not a one-off. Add "silent degradation detection" as a quarterly resilience review checklist item. Create an organizational standard: any service that can degrade silently must have a bypass-rate equivalent metric with a PagerDuty alert.

Ownership Question: "Who is responsible for the 3-day window where we were unprotected?" Staff answer: Two failures of ownership. (1) The engineer who implemented fail-open without adding the bypass-rate alert — they made a deliberate design decision without the corresponding observability contract. (2) The on-call rotation for not detecting the anomaly in Redis latency metrics that would have indicated the issue. Both gaps need process fixes, not blame.

Staff Signals:

  • Treats this as a systemic observability contract failure, not a Redis monitoring gap
  • Proposes time-bounded fail-open that auto-expires and requires explicit re-approval
  • Audits all services with similar silent degradation modes

Deep Dive 3: Large Tenant Onboarding

Context: Sales just closed a deal 5× larger than your current biggest tenant. They go live in 2 weeks. Ensure the rate limiter handles them without affecting other tenants.

Questions to Surface First:

  • What's this tenant's expected traffic pattern — steady-state vs bursty? What are their peak hours?
  • If this tenant misbehaves, what's the blast radius to other tenants? Are counters shared with others on the same Redis shard?
  • What SLA did Sales promise? Is it documented, or implicit? Who approved the capacity commitment?
  • Do we have tenant isolation at the infrastructure level, or are all tenants sharing the same Redis shard?
Staff Approach — Full Reasoning
PhaseWhat to Do
Capacity mathNew tenant's peak QPS × 5 = new hot-key ceiling. Will this saturate a Redis shard? (Redis single shard: ~100K ops/sec)
Isolation assessmentDoes this tenant need dedicated Redis infrastructure, or can they share with a token-leasing model that bounds per-tenant Redis QPS?
Shadow modeDeploy with shadow mode first — log what the rate limiter would do, don't enforce. Run for 48 hours before live.
Staged rollout10% of tenant traffic → monitor shard health, latency, rejection rates → 100%
SLA documentation"Tenant X gets guaranteed 50K req/s. Platform reserves 20% headroom for burst. Any traffic above 50K may be throttled."

Staff insight: "Can we handle this tenant?" is the wrong question. The right question is "What's the blast radius if this tenant misbehaves, and how do we isolate it from affecting our other 10,000 tenants?"

Metrics to Watch: redis.shard_cpu_utilization (per shard, alert at 70% during onboarding), rate_limiter.tenant_rejection_rate (per tenant — the new tenant shouldn't be rejecting other tenants), rate_limiter.p99_by_tenant (fairness signal), onboarding.shadow_mode_rejection_rate (would-be rejections before enforcement goes live)

Organizational Follow-up: Create a "large tenant onboarding checklist" for Sales/Engineering handoff: capacity sign-off required before SLA commitment, technical discovery call between tenant's infra team and platform team, 2-week minimum shadow mode before enforcement. Add "large tenant onboarding" as a trigger for automatic capacity review.

Ownership Question: "Sales committed to a 200K req/s SLA without asking infra. Who owns the resulting capacity emergency?" Staff answer: Sales owns the customer relationship and the commitment made without sign-off. Infra owns the implementation of whatever we commit to. The organizational fix: rate limit SLA commitments require infra sign-off above a QPS threshold. Create a named "capacity commitment approval" process. Sales gets fast-path approval for standard tiers; above-standard requires a 1-week infra review.

Staff Signals:

  • Reframes from handling load to blast radius isolation
  • Uses shadow mode before hard enforcement for high-risk capacity changes
  • Negotiates explicit SLA documentation with Sales before committing infrastructure

Deep Dive 4: Post-Mortem — Limiter Let Through an Attack

Context: Post-mortem shows a credential-stuffing attack got through the rate limiter for 45 minutes. 50K accounts were compromised. What do you present to leadership?

Questions to Surface First:

  • Was rate limiting ever designed to be the primary defense against credential stuffing, or was it assumed?
  • What other defense layers exist (WAF, CAPTCHA, anomaly detection)? Which ones fired, which didn't?
  • How did the attacker evade detection — IP rotation, distributed botnet, credential replay from a breach database?
  • What's the regulatory exposure? Do we need to notify affected users under GDPR/CCPA?
Rendering diagram...

Defense-in-depth: rate limiting (Layer 2) cannot stop sophisticated attacks alone. Layers 3 and 4 catch what volume-based rules miss.

Staff Approach — Full Reasoning
SectionContent
What happenedAttacker rotated 5K IPs at 10 req/IP/min — each IP stayed under the per-IP threshold of 20 req/min. Identity was weak (IP-only on unauthenticated endpoint).
Why we missed itRate limiting is volume-based protection. This attack distributed volume across many identities. Behavioral detection (login failure rate anomaly) would have caught it, but wasn't the first line of defense.
Immediate actionsAdd global login-failure-rate limit (regardless of IP): alert at 1,000 failures/minute globally. Add per-IP login failure rate: alert at 5 failures/min per IP (different from request rate). Integrate with CAPTCHA at threshold.
Systemic fixLayered defense: CDN/WAF (known attack patterns) + rate limiting (volume protection) + anomaly detection (behavioral patterns) + CAPTCHA (human verification). Each layer catches different attack vectors.
Ownership boundaryDefine who owns login abuse detection: Security team (threat intelligence, attack patterns) vs Platform team (infrastructure enforcement). Both own the gap that allowed 45 minutes of undetected abuse.

Metrics to Watch: login.failure_rate_global (alert at 1K failures/min), login.failure_rate_per_ip (alert at 5/min per IP), login.unique_ips_per_minute (spread = botnet signal), waf.blocked_rate (Layer 1 effectiveness), captcha.trigger_rate (Layer 4 load)

Organizational Follow-up: Define ownership boundary: who owns login abuse detection? Security team owns threat intelligence and policy; Platform team owns enforcement infrastructure. Joint review of defense-in-depth posture with specific ownership assigned at each layer. File regulatory notification if breach threshold is met (GDPR: 72-hour notification). Quarterly adversarial simulation to test each defense layer.

Ownership Question: "The rate limiter was operating as designed. Who's accountable for the 50K compromised accounts?" Staff answer: Accountability is shared. (1) The original architecture decision that used rate limiting as the sole authentication defense — the engineer who made that decision and the architect who approved it. (2) Security team for not having behavioral anomaly detection on login failure rates. (3) Product for not prioritizing multi-factor authentication. Single points of accountability for systemic security failures are usually wrong — the postmortem should identify the layer of defense that should have caught this and didn't, and assign that to the team responsible for that layer.

Staff Signals:

  • Presents to leadership with architectural framing (rate limiting was never the right sole defense) rather than just fixing the specific gap
  • Proposes layered defense-in-depth, not just tighter thresholds
  • Defines ownership boundaries across Security and Platform teams as an organizational action item

Deep Dive 5: Multi-Region Expansion

Context: Your company is expanding to EU. The rate limiter is currently single-region. What's your recommendation?

Questions to Surface First:

  • What's the primary use case for EU expansion — latency reduction, data residency compliance, or both?
  • Which rate limiting intents need global accuracy (billing/quota) vs regional approximation (abuse protection)?
  • Do rate limit counters contain PII or request metadata that falls under GDPR data residency requirements?
  • What's the acceptable drift window — can a global abuser temporarily get 2× their limit across two regions?
Rendering diagram...

Hybrid approach: abuse protection runs independently per region with async sync; billing/quota uses region-scoped quotas or tighter global coordination where accuracy matters.

Staff Approach — Full Reasoning
OptionTradeoffs
Independent per-regionSimple. A global abuser can hit 2× their limit (once per region). No cross-region latency. Recommended for abuse protection.
Global coordinationAccurate. But cross-region latency (+50–100ms per request) adds to the hot path or requires complex async sync. Only for billing-critical endpoints.
Hybrid (per-region + async reconciliation)Per-region enforcement with async global reconciliation every 60 seconds. Accepts bounded drift. Best of both for most cases.

The GDPR question that Senior candidates miss: Rate limit counters store {user_id}:{endpoint}:{count}. User IDs are personal data under GDPR. Replicating EU user rate limit counters to US Redis may require a legal basis (SCCs, adequacy decision) or be prohibited under data sovereignty requirements. Check with Legal before designing cross-region replication of counters.

Staff recommendation:

  • For abuse protection: per-region enforcement with optional async reconciliation (async, not sync — never in the hot path)
  • For billing/quota: region-scoped quotas (EU users get EU quota, US users get US quota) OR tighter global coordination only on billing endpoints

Migration plan:

  • Week 1–2: Deploy EU rate limiter infrastructure (Redis, gateway middleware) — zero migration
  • Week 3–4: Route new EU signups to EU enforcement
  • Week 5–8: Migrate existing EU users to EU enforcement (canary, then full)
  • Week 9+: Optional global reconciliation for abuse counters

Metrics to Watch: cross_region.counter_sync_latency_ms, cross_region.drift_bound (max counter difference between regions at any point), geo_routing.misroute_rate (EU users accidentally hitting US enforcement), gdpr.counter_data_residency_violations (EU user counters found in US store)

Organizational Follow-up: Legal/compliance sign-off on counter data residency before any cross-region replication is deployed. Define SLA for cross-region reconciliation latency. Create runbook for region failover: if US goes down, does EU absorb global traffic with global limits or regional-only limits?

Ownership Question: "Who decides whether billing uses global or region-scoped quotas?" Staff answer: Finance (they understand the revenue impact of 2× quota across regions), Legal (data residency constraints on counter replication), and Product (customer contractual implications of region-scoped vs global quotas). Engineering provides the technical options and cost tradeoffs. The actual decision is a business decision, not an engineering decision.

Staff Signals:

  • Separates abuse protection (tolerates drift) from billing/quota (needs accuracy) rather than applying one architecture to both
  • Raises GDPR/data residency constraints on counter data — the non-technical dimension most candidates miss
  • Plans a phased rollout rather than a big-bang migration

9Level Expectations Summary

After studying this playbook, you should be able to:

  • Name the three intents and explain why they require incompatible failure modes
  • Quantify the latency trap — Redis per-request at 5–10ms overhead at scale
  • Design a local-first coordination model with a named mechanism (token leasing, key-owner routing, bounded reconciliation) and its specific failure mode
  • Explain fail-open vs fail-closed by intent, with bypass-rate observability as a mandatory component
  • Walk through the silent fail-open failure mode from detection to runbook
  • Explain why hot keys are structurally different from high-cardinality traffic and propose mitigations
  • Design a policy management system with a lifecycle (shadow → warn → canary → enforce) and organizational governance

The Bar for This Question

Mid-level (L4/E4): Understands rate limiting as a counter in Redis with a TTL. Can explain token bucket semantics. Knows rate limiting prevents abuse. Doesn't reason about distributed coordination, failure modes, or ownership.

Senior (L5/E5): Builds the baseline architecture quickly — gateway middleware + Redis counters with Lua atomicity. Reasons about the consistency-availability tradeoff. Has an opinion on local vs centralized counters and can explain the latency implications of a synchronous Redis call. Gets to fail-open vs fail-closed when prompted.

Staff+ (L6/E6+): Spends under 5 minutes on architecture and 30+ minutes on depth. Quantifies the latency trap and names the coordination mechanism (leasing, routing, or reconciliation) with its failure mode. Makes fail-open/fail-closed a proactive decision, not a response to the interviewer's question. Defines bypass-rate observability as mandatory. Reasons about policy management as an organizational problem. The interviewer should learn something from the answer.

10Staff Insiders: Controversial Opinions

10.1 "Exact Rate Limiting" Is a Myth

At scale, you are never enforcing the limit you think you're enforcing.

FactorImpact
Clock skewToken refill varies by 10–100ms across nodes
Network delayCoordination messages arrive late
Retry amplificationRejected requests retry, adding load
BatchingRequests arrive in bursts, not smoothly
Measurement lagBy the time you measure, you've already over-admitted

The Staff position: Stop pretending you're enforcing "exactly 1000 req/s." You're enforcing "approximately 1000 req/s ± ε." The Staff question is: what's your ε, and is it acceptable for your intent?

Why this matters in interviews: Candidates who claim exact enforcement without acknowledging drift reveal they haven't operated rate limiters at scale. The bar-raiser question: "What's the worst-case over-admission in your design, and who signed off on it?"

10.2 Abuse Protection and Billing Cannot Share an Algorithm

If you're using the same rate limiter for abuse protection and billing enforcement, one of them is wrong.

DimensionAbuse ProtectionBilling/Quota
Failure modeFail-open (limiter shouldn't be kill switch)Fail-closed (can't give away resources)
CorrectnessBounded drift acceptableDrift unacceptable; audit required
LatencyCannot add latency to hot pathLatency acceptable for accuracy
IdentityWeak (IP, fingerprint)Strong (authenticated user, API key)

The Staff position: These are fundamentally different systems. Trying to serve both with "one rate limiter" leads to billing drift (abuse limiter too loose) or availability problems (billing limiter too strict for abuse). Stripe figured this out and separated them. Companies that conflate them eventually have an incident.

10.3 Global Fairness Dies at Scale — And That's OK

Many companies that claim "global rate limiting" are actually doing per-region rate limiting and calling it global. At true global scale, they've abandoned global fairness.

ScaleWhat WorksWhat Breaks
Single regionRedis per-request, strong consistencyWorks fine
Multi-region, low QPSCross-region coordinationAcceptable for billing
Multi-region, high QPSPer-region enforcementGlobal accuracy is approximate
True global scale (10M+ req/s)Per-region, no reconciliationRegions are effectively independent

The dirty secret: At hyperscale, a user with a "1,000 req/min global limit" might actually get 1,000 × N (where N = number of regions) if they distribute traffic across regions.

The Staff position: Global fairness is a spectrum. The honest question is: "What's the blast radius of our approximation, and is that acceptable?" At hyperscale, the answer is almost always "yes — 2× regional burst is acceptable for abuse protection."

10.4 The Policy Management Problem Is Harder Than the Technical Problem

The rate limiter is a solved technical problem. The unsolved problem is getting 15 product teams to agree on rate limit policies, keep them updated, respond when limits are hit, and not override the platform team's guardrails when their customers complain.

In practice:

  • Product teams set limits too high because they fear blocking their customers — the limits become theater
  • Product teams set limits too low and then complain when customers are throttled
  • Nobody updates limits when a product changes behavior significantly
  • The platform team becomes a bottleneck for every policy change, or abdicates policy ownership entirely

The Staff position: Rate limiting's technical implementation is a 2-week project. Rate limiting's organizational governance is a 6-month project. Staff engineers work on the 6-month project after shipping the 2-week project. Self-service policy API with guardrails, policy lifecycle with shadow mode, and clear ownership boundaries are the deliverables that actually matter long-term.

Appendices
Appendix A: Algorithm Mechanics — Token Bucket, Fixed Window, Sliding Window

A.1 What "Limit" Actually Means

"100 requests per second" hides four decisions:

  1. Is burst allowed?
  2. Over what window?
  3. How much boundary skew is acceptable?
  4. What happens at the exact threshold?

A.2 Fixed Window — Why It's Almost Always Wrong

t=0.99s: 100 requests → Allowed (window 1)
t=1.01s: 100 requests → Allowed (window 2 just started)
Result:  200 requests in ~20ms

Users can time requests to get 2× burst at every window boundary.

A.3 Token Bucket — Gateway Default

Rendering diagram...

Parameters:

  • capacity (burst): maximum tokens the bucket can hold — sets the burst ceiling
  • refill_rate (steady-state): tokens added per second — sets the sustained rate
  • cost (optional): tokens per request (default 1; set >1 for expensive endpoints)

Timeline example:

t=0.0s: 150 requests → 150 allowed (tokens: 200→50)
t=0.5s: 120 requests → refilled 50 tokens (50→100)
                     → 100 allowed, 20 rejected (429)
t=0.6s: 10 requests  → refilled ~10 tokens (0→10)
                     → 10 allowed (tokens: 0)

Why token bucket fits gateway: allows controlled bursts, smooths traffic, O(1) evaluation, degrades gracefully, easy to approximate locally.

A.4 Sliding Window Counter — Memory-Efficient Approximation

Weighted average of current and previous fixed windows. ±1% error at boundaries. Memory: O(1) per client.

count ≈ (prev_window_count × overlap_fraction) + current_window_count

Good for high-cardinality clients where O(N) sliding window log is too expensive.

A.5 Leaky Bucket — Why It's Rarely Used for Abuse Protection

Leaky bucket queues requests and processes at a fixed rate. The problem: queuing abusive traffic is worse than rejecting it. Under attack, the queue fills with attacker requests, adding memory pressure and increasing latency for legitimate users waiting in queue. For abuse protection: reject fast (token bucket 429), don't queue.

Appendix B: Client Identification Patterns

B.1 Identity Waterfall

Identity resolution follows priority order — use the strongest available:

IdentityStrengthNotes
User ID (JWT sub, session)StrongAuthenticated; stable; 1:1 with account
API KeyMediumCan be leaked; should be rotatable
Device fingerprintWeakEvasive; false positives on shared devices
Source IPWeakNAT/corporate proxy = many users → one IP
IP + User-Agent hashSlightly strongerMinor improvement; still evasive

Principle: IP as coarse outer limiter only. Shift to stronger identity (API key / user_id) for authenticated paths.

B.2 Rate Limit Key Construction

rate_limit:{identity}:{scope}:{window}

Examples:
  rate_limit:ip:203.0.113.42:/login:60s        # IP + endpoint + window
  rate_limit:apikey:abc123:/search:10s          # API key + endpoint + window
  rate_limit:user:u_789:/api/v1:1m              # User ID + API prefix + window
  rate_limit:tenant:t_42:/api/v1:1m             # Tenant ID for multi-tenant fairness

Key design considerations:

  • Cardinality → memory pressure: user_id:endpoint has much higher cardinality than user_id alone
  • Hot key risk: user_id:{hot_user} can dominate a Redis shard
  • TTL strategy: refill_window + 10s buffer — allows slightly stale entries to expire naturally

B.3 Endpoint Sensitivity Classes

Different endpoints warrant different limits and policies:

EndpointAbuse RiskRecommended Strategy
/login, /signupHigh — credential stuffingStrict: per-IP + per-global + login-failure-rate
/search, /catalogMedium — scrapingModerate: per-user + per-IP fallback
/api/v1/* (authenticated)Low-Medium — business useGenerous: per-user, per-tenant
/health, /metricsMinimalExempt or extremely high limit
/webhook/*Partner trafficPer-partner limits, separate tier
/admin/*Security-criticalVery strict + 2FA required
Appendix C: Storage & Coordination Patterns

C.1 Centralized Redis — Lua Atomicity

The check-refill-decrement operation must be atomic to prevent race conditions:

Lua
-- KEYS[1] = bucket key
-- ARGV: capacity, refill_rate_per_sec, now_ms, cost_tokens
-- Returns: {allowed (0/1), remaining_tokens, retry_after_ms}

local tokens, last_ms = ...
local elapsed_ms = now_ms - last_ms
local refilled = elapsed_ms * refill_rate_per_sec / 1000
tokens = math.min(capacity, tokens + refilled)

if tokens >= cost then
    tokens = tokens - cost
    -- store updated tokens, return allowed=1
else
    -- return allowed=0, retry_after_ms = ceil((cost - tokens) / refill_rate * 1000)
end

Staff note on time: Use redis.call("TIME") inside the script to reduce clock-skew risk, at ~0.1ms overhead.

C.2 Token Leasing — Reduce Redis QPS by 10–100×

Rendering diagram...

Failure mode: Gateway crashes mid-lease → stranded tokens for ttl duration (5 seconds). Acceptable for abuse protection; not for billing.

Lease sizing: Too small → frequent renewals → higher Redis QPS. Too large → fairness issues (one gateway holds a disproportionate share) + more stranded on crash.

C.3 Key-Owner Routing — Single-Writer Per Identity

Route requests so one gateway instance owns enforcement for each identity. Eliminates multi-writer races.

hash(api_key) % gateway_count → routes to consistent owner
Owner handles all enforcement locally, no coordination needed

Failure mode: Owner crashes → requests reroute to new owner, who must fetch current counter from Redis (one-time fallback). Load skew if one identity dominates (all traffic for hot key hits one gateway).

C.4 Bounded Reconciliation — Explicit Drift Budget

Accept drift but make it explicit and bounded. Force global check when local tokens drop below a low-watermark.

if local_tokens < low_watermark:
    force_global_check(key)  # One-time Redis call, not per-request

# Periodic: every 500ms
send_usage_deltas(key, local_count_delta)
Redis reconciles global budget across all gateways

Drift bound: G × local_slack where G = gateway count and local_slack = tokens between forced global checks.

C.5 Quick Comparison

MechanismRedis QPSCorrectnessHot KeyBest For
Per-request1:1 with trafficHighestSaturation riskBilling-critical, low QPS
Token leasing~1:100 reductionMedium-HighRenewal onlyAbuse protection at scale
Key-owner routing~1:N (per owner)HighIsolated to ownerAuthenticated traffic
Bounded reconciliationPeriodic + watermarkMedium (ε)LowAbuse where drift OK
Appendix D: Response Semantics

D.1 On Rejection

http
HTTP/1.1 429 Too Many Requests
Retry-After: 2
Content-Type: application/json

{"error": "rate_limit_exceeded", "retry_after_seconds": 2}

D.2 On Allow (Optional Headers)

http
HTTP/1.1 200 OK
X-RateLimit-Remaining-Tokens: 42
X-RateLimit-Refill-Rate: 100
X-RateLimit-Burst-Capacity: 200

D.3 Token Bucket Reset Time

Token bucket has no single reset time — it refills continuously. If you must provide a reset-like value:

retry_after_seconds = ceil((cost - tokens_remaining) / refill_rate)

D.4 Retry Behavior — The Thundering Herd Risk

Rate limiting tightly couples to client retry behavior. Naive clients turn a 429 into a retry storm:

429 received → retry immediately → another 429 → retry immediately...
10,000 clients each retry 3 times → 30,000 requests in the next 100ms

For trusted clients: Publish guidance and SDKs enforcing exponential backoff + jitter. Retry-After header must be respected.

For untrusted callers: Assume they ignore guidance. Use local deny-cache (reject locally without Redis for known-bad identities), progressive backoff (increase delay on repeated 429s for same identity), temporary blocks.

Appendix E: Metrics & Observability

E.1 Core Metrics — Non-Negotiable

# Three categories every request falls into:
rate_limit_allowed_total{identity, endpoint, tier}
rate_limit_rejected_total{identity, endpoint, tier, reason}
rate_limit_bypassed_total{identity, endpoint, reason}  ← SILENT FAIL-OPEN DETECTOR

# Infrastructure health:
rate_limit_check_latency_ms{path}  # local vs Redis path
redis_latency_p99{operation}       # lease renewal, counter check
redis_timeout_total{operation}     # circuit break trigger

# Policy health:
rate_limit_enforcement_rate_pct    # % of requests checked against Redis (vs local only)

E.2 Critical Alerts

AlertConditionSeverityWhy
Silent fail-openbypass_rate > 1% for 5 minPageRunning unprotected
Enforcement rateenforcement_rate < 95%PageSystemic coordination failure
Redis latencyredis_p99 > 5msWarnApproaching circuit break
Redis latencyredis_p99 > 10msPageCircuit break imminent
Rejection spikerejected_total increase >10× in 5 minWarnAttack or misconfiguration
429 drops to zero during traffic spikerejected_total → 0 during high trafficPageRate limiter may have failed

E.3 Control Plane vs Data Plane

Control plane (policy management — cold path):

  • Policy store: versioned configs with schema validation
  • Rollout: shadow → warn → canary → enforce
  • Kill switch: fast rollback at endpoint/tenant/global scope
  • Audit trail: who approved, what traffic affected, when deployed

Data plane (per-request enforcement — hot path):

  • Gateway/sidecar reads policy from local cache (never calls policy store per-request)
  • Makes allow/deny decision locally
  • Emits metrics asynchronously

E.4 Debugging Silent Fail-Open

When bypass_rate fires:

  1. Check Redis health: redis-cli PING, cluster status, latency histogram
  2. Check circuit breaker state: is the gateway in local-only mode?
  3. Check bypass_rate trend: has it been elevated for hours (silent) or minutes (recent)?
  4. Check for abuse during the bypass window: login failures, API error rates, traffic anomalies
  5. Confirm local fallback caps are set: are per-instance limits protecting the backend?
Appendix F: Scaling Considerations

F.1 What Works at Each Scale

ScaleWhat WorksWhat BreaksRecommended Change
1K req/sRedis per-requestNothing yetKeep it simple
10K req/sRedis per-requestLatency starting to showConsider local-first
100K req/sLocal-first + leasingRedis cluster memoryToken leasing mandatory
1M req/sLocal-first, asyncGlobal accuracyPer-region enforcement
10M+ req/sPer-region, no global syncGlobal fairness is theoreticalAccept regional limits

F.2 Multi-Region Evolution Path

  1. Phase 1: Single region. Redis cluster handles all traffic. Simple, accurate.
  2. Phase 2: Multi-region, low QPS. Cross-region replication with 50–100ms latency — acceptable for billing paths.
  3. Phase 3: Multi-region, high QPS. Per-region enforcement with async reconciliation for abuse. Billing uses region-scoped quotas.
  4. Phase 4: True global scale. Regions are effectively independent. "Global" limits are per-region approximations.

F.3 What You Don't Build on Day One

  • Multi-region replication — start single-region, add regions when you have users there
  • Adaptive rate limiting (auto-tightens under load) — complex to tune, often causes oscillation
  • Per-endpoint granularity beyond 3–4 tiers — over-engineering for most products
  • Real-time abuse analytics dashboard — build this in month 3, not month 1

Start simple. Add complexity only when production data shows you need it.

Appendix G: Multi-Tenant Fairness Deep Dive

G.1 The Noisy Neighbor Failure Mode

Fairness failures show up as:

  • Uneven SLO burn: Small tenants see p99 spike during large tenant burst
  • Support ambiguity: "Your platform is unreliable" tickets with no global incident
  • Hidden starvation: Small tenant throttled while large tenant consumes most shared capacity

G.2 Hierarchical Token Buckets

Enforce at multiple layers simultaneously:

Global capacity bucket (protects platform)
  └── Enterprise Tier bucket (protects other tenants)
       └── Tenant A bucket (contracted 1,000 req/s burst 1,500)
            └── Sub-buckets per user/endpoint within Tenant A
       └── Tenant B bucket (contracted 500 req/s burst 750)
  └── Pro Tier bucket
  └── Free Tier bucket

Why this is Staff-grade: Answers "what if one user inside a tenant is the noisy neighbor?" The per-user sub-bucket within a tenant limits intra-tenant noisy neighbors, not just inter-tenant.

G.3 Reserved Floor + Shared Burst Pool

  • reserved_rate[tenant] is protected even under global contention — tenants always get their contracted rate
  • burst_pool absorbs temporary spikes if global slack exists
  • Under contention, burst pool allocation falls back to reserved floor

Borrow rules:

  • Who can borrow from burst pool? (Paid tiers only?)
  • How much can they borrow? (Cap at 2× reserved?)
  • Under contention? (First-come-first-served or proportional?)

Tradeoff: Reservations improve isolation but reduce utilization if tenants are idle. An enterprise customer paying for 1,000 req/s but only using 100 req/s holds 900 req/s of reserved capacity that other tenants can't access.

G.4 Observability for Fairness

Critical metrics sliced by tenant:

rate_limit_allowed_total{tenant}
rate_limit_rejected_total{tenant, reason}
request_latency_p99{tenant}  ← fairness failures appear here first

Fairness-specific alerts:

  • Tenant starvation: tenant is heavily throttled while global utilization < 80%
  • Plan-change regression: top tenants spike rejections after a policy rollout
  • Burst pool domination: one tenant continuously consumes the shared burst pool

G.5 Tradeoff Summary

MechanismIsolationUtilizationComplexityDebuggability
Flat quota per tierLowHighLowHigh
Weighted quotasMediumHighLow-MediumMedium
Reserved floor + poolHighMedium-HighMediumMedium
Hierarchical bucketsHighestMediumHighMedium-Low

Staff recommendation: Start with weighted quotas. Move to reserved floor + pool when a large tenant (>10% of global capacity) causes fairness incidents. Hierarchical buckets only when intra-tenant fairness (per-user-within-tenant) becomes a support issue.

You just read the full Design a Distributed Rate Limiter playbook.

Explore the full playbook library — the same depth, drills, and Staff-grade analysis across every topic.