Distributed State Coordination
What This Framework Covers
When multiple nodes need to coordinate on shared state, you face a fundamental tradeoff:
- Centralized: Simple, accurate, but creates a dependency
- Distributed: Resilient, fast, but requires explicit coordination
This framework covers the three production-grade coordination mechanisms and when to use each.
The Core Problem
Node A: READ counter=50
Node B: READ counter=50
Node A: WRITE counter=49
Node B: WRITE counter=49
Result: Two operations consumed one unit
Every multi-writer system must answer: How do we prevent double-spend, ensure fairness, and handle hot keys?
The Three Mechanisms
| Mechanism | Latency | Correctness | Complexity | Best For |
|---|---|---|---|---|
| Token Leasing | Low (mostly local) | Medium-High | Medium-High | High QPS, abuse protection |
| Key-Owner Routing | Medium | High | High | Authenticated traffic, clean correctness |
| Bounded Approximation | Low | Medium (explicit ε) | Medium | Abuse protection where drift is OK |
Rule of thumb:
- Abuse protection: Token leasing OR bounded approximation
- Authenticated internal traffic: Key-owner routing
- Billing/quota: Key-owner routing OR strict centralized + audit
Mechanism 1: Token Leasing / Reservations
The Idea
Reduce shared-state operations by having each node lease a chunk of capacity from the central store, then spend locally.
Key Design Choices
| Choice | Too Small | Too Large | Staff Move |
|---|---|---|---|
| Lease size (L) | Too many renewals (back to per-request bottleneck) | Fairness issues + stranded tokens on crash | Start with L = expected_traffic_per_node * lease_duration |
| Lease TTL | Frequent renewals | Long recovery on node crash | TTL = 2-5x expected renewal interval |
| Reclaim on crash | N/A | Tokens stranded until TTL expires | Track lease ownership + background reclaim |
Failure Modes
| Failure | Impact | Mitigation |
|---|---|---|
| Store slow/down | Can't renew leases | Fail-open with local caps (abuse) or fail-closed (billing) |
| Node crash mid-lease | Tokens stranded | TTL-based reclaim OR ownership tracking |
| Uneven traffic | Some nodes exhaust leases, others have surplus | Adaptive lease sizing or rebalancing |
When to Use
- Gateway/ingress rate limiting at scale
- High QPS where centralized store in critical path is unacceptable
- When bounded approximation is acceptable (or you implement audit)
Mechanism 2: Key-Owner Routing (Consistent Hashing)
The Idea
Route operations so a given key is handled by a single "owner" node. Single-writer semantics = no distributed coordination on every operation.
Implementation Shapes
- Load balancer consistent hash — Route at ingress (Envoy ring hash, NGINX)
- Dedicated coordination service — Sharded service called by nodes
- Peer-to-peer routing — Nodes forward to owner
Key Design Choices
| Choice | Consideration |
|---|---|
| Hash function | Must be stable; changes cause reshuffling |
| Replication | Single owner (simple) vs replicated owners (HA but complexity) |
| Failover | On owner failure, who takes over? Double-spend during transition? |
| Hot keys | Single owner means hot key = hot node |
Failure Modes
| Failure | Impact | Mitigation |
|---|---|---|
| Owner node failure | Keys reassigned; potential double-spend during transition | Graceful handoff protocol OR accept brief inconsistency |
| Hot key | One node overwhelmed | Dedicated capacity for hot keys OR split hot keys artificially |
| Network partition | Split-brain risk | Quorum-based ownership OR accept partition behavior |
When to Use
- Authenticated traffic with stable routing keys (API key, user ID)
- When you need clean correctness story
- When you already operate service mesh with consistent hashing
Mechanism 3: Bounded Approximation (Batch + Reconciliation)
The Idea
Accept that distributed enforcement will have drift, but make the drift explicit and bounded.
The Problem with "Sync Every 500ms"
If every node can admit traffic locally and only reconcile later:
G= number of nodesS= local slack per node- Worst-case over-admission =
G × S
Without bounding knobs, "hybrid" is just "we'll be wrong and hope it's fine."
Bounding Knobs
| Knob | How It Works |
|---|---|
| Low-watermark global check | When local capacity drops below threshold, force global check before admitting more |
| Explicit drift budget (ε) | Define allowed drift (e.g., "1% over-admission over 2s window"), cap local slack accordingly |
| Identity-aware slack | Unknown identities (IP-only) get small slack; authenticated get larger |
Key Design Choices
| Choice | Consideration |
|---|---|
| Sync interval | Shorter = more accurate, higher store load |
| Low watermark | Higher = more global checks, better accuracy |
| Drift budget (ε) | Must be explicit; "small drift" is not a bound |
Failure Modes
| Failure | Impact | Mitigation |
|---|---|---|
| Store slow/down | Nodes can't reconcile | Fail-open with local caps + alerting |
| Sync lag | Over-admission during lag | Tighter low watermark |
| Uneven traffic | Some nodes have stale view | More frequent sync for high-traffic nodes |
When to Use
- Abuse protection where bounded drift is acceptable
- When store latency is variable and you can't afford it in critical path
- When you want simpler implementation than leasing
Decision Framework
Quick Comparison
| Dimension | Token Leasing | Key-Owner Routing | Bounded Approximation |
|---|---|---|---|
| Store QPS | Low (renewals only) | Low (owner handles locally) | Medium (periodic sync) |
| Latency added | ~0 (local), occasional renewal | +hop to owner | ~0 (local) |
| Correctness | High with proper lease sizing | Highest (single-writer) | Medium (explicit ε) |
| Hot key handling | Renewal becomes hot, not every op | Isolated to owner shard | Still problematic |
| Implementation | Medium-High | High | Medium |
| Audit story | Needs lease logging | Clean (single source of truth) | Needs reconciliation logs |
L6 vs L7 Calibration
| Dimension | L6 (Staff) | L7 (Principal) |
|---|---|---|
| Mechanism choice | Picks one and defends it | Evaluates managed alternatives first |
| Failure modes | Names 2-3 for chosen mechanism | Designs the monitoring + runbook |
| Tradeoff articulation | "Latency vs accuracy" | "TCO vs correctness vs operational burden" |
| Hot key story | Has a mitigation | Has a capacity planning + incident response story |
Interview Probes
| After You Say... | They Will Ask... |
|---|---|
| "Hybrid local + sync" | "How does coordination actually work?" |
| "Token leasing" | "What's the lease size? What if a node crashes?" |
| "Consistent hashing" | "What about hot keys? Failover?" |
| "Bounded approximation" | "What's the worst-case over-admission?" |
| "We'll sync every 500ms" | "That's not a bound. What's the drift budget?" |
Staff Sentence Templates
Implementation Deep Dive
1. Token Leasing — Redis Implementation
Token leasing reduces central store pressure by pre-allocating capacity to each node. Here is a production-grade implementation.
Lease Manager
class LeaseManager:
leases = {} # key → { remaining, expiry }
leaseSize = 100 # Default tokens per lease
leaseTTL = 10_000 # 10 seconds
function tryConsume(key, cost=1):
lease = leases.get(key)
# Check local lease
if lease and lease.remaining >= cost and now() < lease.expiry:
lease.remaining -= cost
metrics.increment("lease.local_hit", tags=["key:" + key])
return { allowed: true, source: "local" }
# Lease exhausted or expired — request new lease from store
return renewLease(key, cost)
function renewLease(key, immediateCost):
try:
# Atomic: check remaining budget, grant lease, decrement
result = redis.eval("""
local budget = tonumber(redis.call('GET', KEYS[1]) or '0')
local leaseSize = tonumber(ARGV[1])
local grant = math.min(budget, leaseSize)
if grant <= 0 then return 0 end
redis.call('DECRBY', KEYS[1], grant)
return grant
""", keys=["budget:" + key], args=[leaseSize])
if result <= 0:
return { allowed: false, reason: "budget_exhausted" }
leases[key] = {
remaining: result - immediateCost,
expiry: now() + leaseTTL
}
metrics.increment("lease.renewed", tags=["key:" + key])
return { allowed: true, source: "renewed" }
except RedisError:
# Store unreachable — apply degraded mode policy
metrics.increment("lease.store_failure", tags=["key:" + key])
return fallbackPolicy(key, immediateCost)
function fallbackPolicy(key, cost):
# Fail-open for abuse protection, fail-closed for billing
if isBillingKey(key):
return { allowed: false, reason: "store_unavailable" }
else:
# Allow with conservative local cap
return { allowed: true, source: "fallback_local_cap" }
Adaptive Lease Sizing
# Adjust lease size based on consumption rate
function adaptLeaseSize(key):
consumptionRate = metrics.rate("lease.local_hit", window=60, tags=["key:" + key])
renewalRate = metrics.rate("lease.renewed", window=60, tags=["key:" + key])
# Target: 1 renewal per lease_ttl (every 10 seconds)
idealLeaseSize = consumptionRate * (leaseTTL / 1000)
# Clamp to reasonable bounds
leaseSize = clamp(idealLeaseSize, min=10, max=10_000)
metrics.gauge("lease.adaptive_size", leaseSize, tags=["key:" + key])
return leaseSize
2. Key-Owner Routing — Envoy Consistent Hashing
Key-owner routing eliminates distributed coordination by ensuring only one node handles operations for a given key.
Envoy Ring-Hash Configuration
# Envoy load balancer config — route by API key
clusters:
- name: rate-limit-service
lb_policy: RING_HASH
ring_hash_lb_config:
minimum_ring_size: 1024 # Higher = more even distribution
maximum_ring_size: 8388608
load_assignment:
endpoints:
- lb_endpoints:
- endpoint: { address: { socket_address: { address: shard-0, port: 8080 }}}
- endpoint: { address: { socket_address: { address: shard-1, port: 8080 }}}
- endpoint: { address: { socket_address: { address: shard-2, port: 8080 }}}
routes:
- match: { prefix: "/api" }
route:
cluster: rate-limit-service
hash_policy:
- header:
header_name: "X-API-Key" # Route by API key
Hot Key Detection and Mitigation
function handleRequest(key, request):
metrics.increment("key_owner.request", tags=["key:" + key])
# Detect hot key: >10x average traffic for this shard
if isHotKey(key):
metrics.increment("key_owner.hot_key", tags=["key:" + key])
# Option 1: Route to dedicated hot-key handler
return hotKeyHandler.process(key, request)
# Option 2: Split key artificially
# subKey = key + ":shard:" + hash(request.id) % 4
# return process(subKey, request)
return process(key, request)
function isHotKey(key):
keyRate = metrics.rate("key_owner.request", window=10, tags=["key:" + key])
avgRate = metrics.rate("key_owner.request", window=10) / shardCount
return keyRate > avgRate * 10
3. Bounded Approximation — Batch Sync Implementation
class BoundedApproximation:
localCounters = {} # key → count consumed since last sync
localBudgets = {} # key → remaining budget from last sync
syncInterval = 2_000 # Sync every 2 seconds
lowWatermark = 0.20 # Force global check at 20% remaining
function tryConsume(key, cost=1):
budget = localBudgets.get(key, 0)
consumed = localCounters.get(key, 0)
remaining = budget - consumed
if remaining < budget * lowWatermark:
# Below watermark — force global check
return globalCheck(key, cost)
if remaining >= cost:
localCounters[key] = consumed + cost
return { allowed: true, source: "local" }
return { allowed: false, reason: "local_budget_exhausted" }
function globalCheck(key, cost):
totalConsumed = localCounters.get(key, 0)
result = redis.eval("""
local budget = tonumber(redis.call('GET', KEYS[1]) or '0')
local delta = tonumber(ARGV[1])
local cost = tonumber(ARGV[2])
budget = budget - delta
if budget < cost then
redis.call('SET', KEYS[1], budget)
return -1
end
budget = budget - cost
redis.call('SET', KEYS[1], budget)
return budget
""", keys=["budget:" + key], args=[totalConsumed, cost])
localCounters[key] = 0 # Reset after sync
if result < 0:
return { allowed: false, reason: "global_budget_exhausted" }
localBudgets[key] = result
return { allowed: true, source: "global_check" }
# Background sync every 2 seconds
function periodicSync():
for key, consumed in localCounters.items():
if consumed > 0:
globalCheck(key, 0) # Sync deltas without consuming
Architecture Diagram
Authenticated traffic (API key present): Envoy routes by consistent hash on the API key. One node owns all requests for that key — no distributed coordination needed for per-key enforcement. Local counters are exact.
Unauthenticated traffic (IP only): Round-robin across nodes. Each node uses token leasing or bounded approximation with the central Redis store. Coordination is approximate but bounded.
Failure Scenarios
1. Central Store Outage — Redis Cluster Down
Timeline: Redis cluster loses quorum during a network partition. All lease renewals fail. Nodes fall back to their local policies. Abuse-protection nodes fail-open with conservative local caps. Billing-enforcement nodes fail-closed, rejecting all requests.
Blast radius: Abuse protection degrades — total over-admission bounded by G × local_cap × outage_duration. For 20 nodes with 100 req/s local cap and a 5-minute outage: worst case 600K extra requests. Billing enforcement blocks all paid-tier customers from making API calls.
Detection: Redis health check fails. Lease renewal error rate hits 100%. Fallback policy activation rate spikes.
Recovery:
- Abuse protection: local caps prevent unbounded abuse during the outage. Alert on sustained fallback usage >5 minutes
- Billing enforcement: switch to cached quota snapshots (each node caches the last known quota). Accept bounded staleness for the outage window
- Redis recovery: nodes re-lease from the restored store. Brief over-admission during the transition is acceptable
- Post-incident: evaluate whether a local fallback quota cache (snapshotted every 60 seconds) is acceptable for billing enforcement during short outages
2. Token Lease Stranding After Node Crash
Timeline: Node 3 leases 500 tokens from the central store for key acme-123. Node 3 crashes. The 500 tokens are stranded — they are decremented from the global budget but not available for consumption. Other nodes cannot access them. The customer's effective rate limit is reduced by 500 tokens until the lease TTL expires (10 seconds).
Blast radius: One customer's effective rate limit is temporarily reduced. If the lease is large relative to the total budget (500 out of 1,000 = 50%), the impact is significant. If the lease is small relative to the total budget (500 out of 100,000 = 0.5%), the impact is negligible.
Detection: Node health check fails. Lease ownership tracking shows unreachable node holding active leases.
Recovery:
- TTL-based recovery: the lease expires in 10 seconds. The central store reclaims the tokens automatically. No manual intervention.
- Active reclaim: a background process detects crashed nodes (via heartbeat failure) and reclaims their leases immediately
- Prevention: size leases relative to total budget so that stranding one lease does not materially impact the customer. Rule of thumb: lease size < 5% of total budget
3. Hot Key Overwhelming Single Owner
Timeline: A viral API customer (acme-123) sends 50K req/s. All requests hash to Node 2 (the owner). Node 2's CPU hits 100%. Other customers routed to Node 2 experience elevated latency. Node 2 starts dropping requests.
Blast radius: The hot customer and all other customers on Node 2 experience degradation. Customers on other nodes are unaffected.
Detection: Per-node CPU utilization alert. Per-key request rate monitoring. Node 2's error rate exceeds threshold while other nodes are healthy.
Recovery:
- Immediate: rate-limit the hot customer at the load balancer level (pre-routing) to cap their impact
- Short-term: route the hot key to a dedicated handler (Node 3 overflow) so other customers on Node 2 are not affected
- Long-term: split hot keys into sub-keys (
acme-123:0,acme-123:1, etc.) and distribute across multiple owners. The sub-key counter values are aggregated when the customer queries their usage.
Staff Interview Application
How to Introduce This Pattern
Lead with the identity question because it determines which mechanism is feasible. Then explain the specific mechanism.
When NOT to Use This Pattern
- Single-node system: If you have one server, there is no distributed coordination problem. Use in-process state.
- Coordination via database already exists: If all nodes share a PostgreSQL database and contention is low, a simple
SELECT FOR UPDATEor atomic increment is sufficient. Don't add Redis leasing when a database counter works. - Exact consistency is truly required on every operation: If the business literally cannot tolerate any approximation (financial trading, ballot counting), centralized with sub-millisecond timeout is the only answer. Accept the latency and scale the central store.
Follow-Up Questions to Anticipate
| Interviewer Asks | What They Are Testing | How to Respond |
|---|---|---|
| "What is the worst-case drift?" | Quantitative reasoning | "With token leasing: G nodes × lease_size. With bounded approximation: G × local_budget × (1 - watermark). I can quantify this for any configuration." |
| "What happens when the central store goes down?" | Failure mode thinking | "Policy-dependent. For abuse protection: fail-open with local caps. For billing: fail-closed with cached quota snapshots. The degradation behavior is designed, not discovered." |
| "How do you handle hot keys?" | Practical operations | "Detect via per-key request rate. Mitigate by routing to a dedicated handler or splitting into sub-keys. The key insight: hot keys are a capacity problem, not a correctness problem." |
| "Why not just call Redis on every request?" | Latency and scaling awareness | "At 100K QPS, that is 100K Redis round-trips/sec. Each round-trip adds 0.5-1ms of latency. Token leasing reduces it to ~1K renewals/sec — 100x less store pressure with bounded approximation." |