StaffSignal
Cross-Cutting Framework

Distributed Coordination Framework

Local-first vs centralized state, leasing, routing, and reconciliation patterns for distributed systems.

Distributed State Coordination

What This Framework Covers

When multiple nodes need to coordinate on shared state, you face a fundamental tradeoff:

  • Centralized: Simple, accurate, but creates a dependency
  • Distributed: Resilient, fast, but requires explicit coordination

This framework covers the three production-grade coordination mechanisms and when to use each.


The Core Problem

Node A: READ counter=50
Node B: READ counter=50
Node A: WRITE counter=49
Node B: WRITE counter=49
Result: Two operations consumed one unit

Every multi-writer system must answer: How do we prevent double-spend, ensure fairness, and handle hot keys?


The Three Mechanisms

MechanismLatencyCorrectnessComplexityBest For
Token LeasingLow (mostly local)Medium-HighMedium-HighHigh QPS, abuse protection
Key-Owner RoutingMediumHighHighAuthenticated traffic, clean correctness
Bounded ApproximationLowMedium (explicit ε)MediumAbuse protection where drift is OK

Rule of thumb:

  • Abuse protection: Token leasing OR bounded approximation
  • Authenticated internal traffic: Key-owner routing
  • Billing/quota: Key-owner routing OR strict centralized + audit

Mechanism 1: Token Leasing / Reservations

The Idea

Reduce shared-state operations by having each node lease a chunk of capacity from the central store, then spend locally.

Rendering diagram...

Key Design Choices

ChoiceToo SmallToo LargeStaff Move
Lease size (L)Too many renewals (back to per-request bottleneck)Fairness issues + stranded tokens on crashStart with L = expected_traffic_per_node * lease_duration
Lease TTLFrequent renewalsLong recovery on node crashTTL = 2-5x expected renewal interval
Reclaim on crashN/ATokens stranded until TTL expiresTrack lease ownership + background reclaim

Failure Modes

FailureImpactMitigation
Store slow/downCan't renew leasesFail-open with local caps (abuse) or fail-closed (billing)
Node crash mid-leaseTokens strandedTTL-based reclaim OR ownership tracking
Uneven trafficSome nodes exhaust leases, others have surplusAdaptive lease sizing or rebalancing

When to Use

  • Gateway/ingress rate limiting at scale
  • High QPS where centralized store in critical path is unacceptable
  • When bounded approximation is acceptable (or you implement audit)

Mechanism 2: Key-Owner Routing (Consistent Hashing)

The Idea

Route operations so a given key is handled by a single "owner" node. Single-writer semantics = no distributed coordination on every operation.

Rendering diagram...

Implementation Shapes

  1. Load balancer consistent hash — Route at ingress (Envoy ring hash, NGINX)
  2. Dedicated coordination service — Sharded service called by nodes
  3. Peer-to-peer routing — Nodes forward to owner

Key Design Choices

ChoiceConsideration
Hash functionMust be stable; changes cause reshuffling
ReplicationSingle owner (simple) vs replicated owners (HA but complexity)
FailoverOn owner failure, who takes over? Double-spend during transition?
Hot keysSingle owner means hot key = hot node

Failure Modes

FailureImpactMitigation
Owner node failureKeys reassigned; potential double-spend during transitionGraceful handoff protocol OR accept brief inconsistency
Hot keyOne node overwhelmedDedicated capacity for hot keys OR split hot keys artificially
Network partitionSplit-brain riskQuorum-based ownership OR accept partition behavior

When to Use

  • Authenticated traffic with stable routing keys (API key, user ID)
  • When you need clean correctness story
  • When you already operate service mesh with consistent hashing

Mechanism 3: Bounded Approximation (Batch + Reconciliation)

The Idea

Accept that distributed enforcement will have drift, but make the drift explicit and bounded.

The Problem with "Sync Every 500ms"

If every node can admit traffic locally and only reconcile later:

  • G = number of nodes
  • S = local slack per node
  • Worst-case over-admission = G × S

Without bounding knobs, "hybrid" is just "we'll be wrong and hope it's fine."

Bounding Knobs

KnobHow It Works
Low-watermark global checkWhen local capacity drops below threshold, force global check before admitting more
Explicit drift budget (ε)Define allowed drift (e.g., "1% over-admission over 2s window"), cap local slack accordingly
Identity-aware slackUnknown identities (IP-only) get small slack; authenticated get larger
Rendering diagram...

Key Design Choices

ChoiceConsideration
Sync intervalShorter = more accurate, higher store load
Low watermarkHigher = more global checks, better accuracy
Drift budget (ε)Must be explicit; "small drift" is not a bound

Failure Modes

FailureImpactMitigation
Store slow/downNodes can't reconcileFail-open with local caps + alerting
Sync lagOver-admission during lagTighter low watermark
Uneven trafficSome nodes have stale viewMore frequent sync for high-traffic nodes

When to Use

  • Abuse protection where bounded drift is acceptable
  • When store latency is variable and you can't afford it in critical path
  • When you want simpler implementation than leasing

Decision Framework

Rendering diagram...

Quick Comparison

DimensionToken LeasingKey-Owner RoutingBounded Approximation
Store QPSLow (renewals only)Low (owner handles locally)Medium (periodic sync)
Latency added~0 (local), occasional renewal+hop to owner~0 (local)
CorrectnessHigh with proper lease sizingHighest (single-writer)Medium (explicit ε)
Hot key handlingRenewal becomes hot, not every opIsolated to owner shardStill problematic
ImplementationMedium-HighHighMedium
Audit storyNeeds lease loggingClean (single source of truth)Needs reconciliation logs

L6 vs L7 Calibration

DimensionL6 (Staff)L7 (Principal)
Mechanism choicePicks one and defends itEvaluates managed alternatives first
Failure modesNames 2-3 for chosen mechanismDesigns the monitoring + runbook
Tradeoff articulation"Latency vs accuracy""TCO vs correctness vs operational burden"
Hot key storyHas a mitigationHas a capacity planning + incident response story

Interview Probes

After You Say...They Will Ask...
"Hybrid local + sync""How does coordination actually work?"
"Token leasing""What's the lease size? What if a node crashes?"
"Consistent hashing""What about hot keys? Failover?"
"Bounded approximation""What's the worst-case over-admission?"
"We'll sync every 500ms""That's not a bound. What's the drift budget?"

Staff Sentence Templates


Implementation Deep Dive

1. Token Leasing — Redis Implementation

Token leasing reduces central store pressure by pre-allocating capacity to each node. Here is a production-grade implementation.

Lease Manager

class LeaseManager:
    leases = {}            # key → { remaining, expiry }
    leaseSize = 100        # Default tokens per lease
    leaseTTL = 10_000      # 10 seconds

    function tryConsume(key, cost=1):
        lease = leases.get(key)

        # Check local lease
        if lease and lease.remaining >= cost and now() < lease.expiry:
            lease.remaining -= cost
            metrics.increment("lease.local_hit", tags=["key:" + key])
            return { allowed: true, source: "local" }

        # Lease exhausted or expired — request new lease from store
        return renewLease(key, cost)

    function renewLease(key, immediateCost):
        try:
            # Atomic: check remaining budget, grant lease, decrement
            result = redis.eval("""
                local budget = tonumber(redis.call('GET', KEYS[1]) or '0')
                local leaseSize = tonumber(ARGV[1])
                local grant = math.min(budget, leaseSize)
                if grant <= 0 then return 0 end
                redis.call('DECRBY', KEYS[1], grant)
                return grant
            """, keys=["budget:" + key], args=[leaseSize])

            if result <= 0:
                return { allowed: false, reason: "budget_exhausted" }

            leases[key] = {
                remaining: result - immediateCost,
                expiry: now() + leaseTTL
            }
            metrics.increment("lease.renewed", tags=["key:" + key])
            return { allowed: true, source: "renewed" }

        except RedisError:
            # Store unreachable — apply degraded mode policy
            metrics.increment("lease.store_failure", tags=["key:" + key])
            return fallbackPolicy(key, immediateCost)

    function fallbackPolicy(key, cost):
        # Fail-open for abuse protection, fail-closed for billing
        if isBillingKey(key):
            return { allowed: false, reason: "store_unavailable" }
        else:
            # Allow with conservative local cap
            return { allowed: true, source: "fallback_local_cap" }

Adaptive Lease Sizing

# Adjust lease size based on consumption rate
function adaptLeaseSize(key):
    consumptionRate = metrics.rate("lease.local_hit", window=60, tags=["key:" + key])
    renewalRate = metrics.rate("lease.renewed", window=60, tags=["key:" + key])

    # Target: 1 renewal per lease_ttl (every 10 seconds)
    idealLeaseSize = consumptionRate * (leaseTTL / 1000)

    # Clamp to reasonable bounds
    leaseSize = clamp(idealLeaseSize, min=10, max=10_000)

    metrics.gauge("lease.adaptive_size", leaseSize, tags=["key:" + key])
    return leaseSize

2. Key-Owner Routing — Envoy Consistent Hashing

Key-owner routing eliminates distributed coordination by ensuring only one node handles operations for a given key.

Envoy Ring-Hash Configuration

# Envoy load balancer config — route by API key
clusters:
  - name: rate-limit-service
    lb_policy: RING_HASH
    ring_hash_lb_config:
      minimum_ring_size: 1024      # Higher = more even distribution
      maximum_ring_size: 8388608
    load_assignment:
      endpoints:
        - lb_endpoints:
            - endpoint: { address: { socket_address: { address: shard-0, port: 8080 }}}
            - endpoint: { address: { socket_address: { address: shard-1, port: 8080 }}}
            - endpoint: { address: { socket_address: { address: shard-2, port: 8080 }}}

routes:
  - match: { prefix: "/api" }
    route:
      cluster: rate-limit-service
      hash_policy:
        - header:
            header_name: "X-API-Key"     # Route by API key

Hot Key Detection and Mitigation

function handleRequest(key, request):
    metrics.increment("key_owner.request", tags=["key:" + key])

    # Detect hot key: >10x average traffic for this shard
    if isHotKey(key):
        metrics.increment("key_owner.hot_key", tags=["key:" + key])
        # Option 1: Route to dedicated hot-key handler
        return hotKeyHandler.process(key, request)
        # Option 2: Split key artificially
        # subKey = key + ":shard:" + hash(request.id) % 4
        # return process(subKey, request)

    return process(key, request)

function isHotKey(key):
    keyRate = metrics.rate("key_owner.request", window=10, tags=["key:" + key])
    avgRate = metrics.rate("key_owner.request", window=10) / shardCount
    return keyRate > avgRate * 10

3. Bounded Approximation — Batch Sync Implementation

class BoundedApproximation:
    localCounters = {}           # key → count consumed since last sync
    localBudgets = {}            # key → remaining budget from last sync
    syncInterval = 2_000         # Sync every 2 seconds
    lowWatermark = 0.20          # Force global check at 20% remaining

    function tryConsume(key, cost=1):
        budget = localBudgets.get(key, 0)
        consumed = localCounters.get(key, 0)
        remaining = budget - consumed

        if remaining < budget * lowWatermark:
            # Below watermark — force global check
            return globalCheck(key, cost)

        if remaining >= cost:
            localCounters[key] = consumed + cost
            return { allowed: true, source: "local" }

        return { allowed: false, reason: "local_budget_exhausted" }

    function globalCheck(key, cost):
        totalConsumed = localCounters.get(key, 0)
        result = redis.eval("""
            local budget = tonumber(redis.call('GET', KEYS[1]) or '0')
            local delta = tonumber(ARGV[1])
            local cost = tonumber(ARGV[2])
            budget = budget - delta
            if budget < cost then
                redis.call('SET', KEYS[1], budget)
                return -1
            end
            budget = budget - cost
            redis.call('SET', KEYS[1], budget)
            return budget
        """, keys=["budget:" + key], args=[totalConsumed, cost])

        localCounters[key] = 0    # Reset after sync

        if result < 0:
            return { allowed: false, reason: "global_budget_exhausted" }

        localBudgets[key] = result
        return { allowed: true, source: "global_check" }

    # Background sync every 2 seconds
    function periodicSync():
        for key, consumed in localCounters.items():
            if consumed > 0:
                globalCheck(key, 0)    # Sync deltas without consuming

Architecture Diagram

Rendering diagram...

Authenticated traffic (API key present): Envoy routes by consistent hash on the API key. One node owns all requests for that key — no distributed coordination needed for per-key enforcement. Local counters are exact.

Unauthenticated traffic (IP only): Round-robin across nodes. Each node uses token leasing or bounded approximation with the central Redis store. Coordination is approximate but bounded.


Failure Scenarios

1. Central Store Outage — Redis Cluster Down

Timeline: Redis cluster loses quorum during a network partition. All lease renewals fail. Nodes fall back to their local policies. Abuse-protection nodes fail-open with conservative local caps. Billing-enforcement nodes fail-closed, rejecting all requests.

Blast radius: Abuse protection degrades — total over-admission bounded by G × local_cap × outage_duration. For 20 nodes with 100 req/s local cap and a 5-minute outage: worst case 600K extra requests. Billing enforcement blocks all paid-tier customers from making API calls.

Detection: Redis health check fails. Lease renewal error rate hits 100%. Fallback policy activation rate spikes.

Recovery:

  1. Abuse protection: local caps prevent unbounded abuse during the outage. Alert on sustained fallback usage >5 minutes
  2. Billing enforcement: switch to cached quota snapshots (each node caches the last known quota). Accept bounded staleness for the outage window
  3. Redis recovery: nodes re-lease from the restored store. Brief over-admission during the transition is acceptable
  4. Post-incident: evaluate whether a local fallback quota cache (snapshotted every 60 seconds) is acceptable for billing enforcement during short outages

2. Token Lease Stranding After Node Crash

Timeline: Node 3 leases 500 tokens from the central store for key acme-123. Node 3 crashes. The 500 tokens are stranded — they are decremented from the global budget but not available for consumption. Other nodes cannot access them. The customer's effective rate limit is reduced by 500 tokens until the lease TTL expires (10 seconds).

Blast radius: One customer's effective rate limit is temporarily reduced. If the lease is large relative to the total budget (500 out of 1,000 = 50%), the impact is significant. If the lease is small relative to the total budget (500 out of 100,000 = 0.5%), the impact is negligible.

Detection: Node health check fails. Lease ownership tracking shows unreachable node holding active leases.

Recovery:

  1. TTL-based recovery: the lease expires in 10 seconds. The central store reclaims the tokens automatically. No manual intervention.
  2. Active reclaim: a background process detects crashed nodes (via heartbeat failure) and reclaims their leases immediately
  3. Prevention: size leases relative to total budget so that stranding one lease does not materially impact the customer. Rule of thumb: lease size < 5% of total budget

3. Hot Key Overwhelming Single Owner

Timeline: A viral API customer (acme-123) sends 50K req/s. All requests hash to Node 2 (the owner). Node 2's CPU hits 100%. Other customers routed to Node 2 experience elevated latency. Node 2 starts dropping requests.

Blast radius: The hot customer and all other customers on Node 2 experience degradation. Customers on other nodes are unaffected.

Detection: Per-node CPU utilization alert. Per-key request rate monitoring. Node 2's error rate exceeds threshold while other nodes are healthy.

Recovery:

  1. Immediate: rate-limit the hot customer at the load balancer level (pre-routing) to cap their impact
  2. Short-term: route the hot key to a dedicated handler (Node 3 overflow) so other customers on Node 2 are not affected
  3. Long-term: split hot keys into sub-keys (acme-123:0, acme-123:1, etc.) and distribute across multiple owners. The sub-key counter values are aggregated when the customer queries their usage.

Staff Interview Application

How to Introduce This Pattern

Lead with the identity question because it determines which mechanism is feasible. Then explain the specific mechanism.

When NOT to Use This Pattern

  • Single-node system: If you have one server, there is no distributed coordination problem. Use in-process state.
  • Coordination via database already exists: If all nodes share a PostgreSQL database and contention is low, a simple SELECT FOR UPDATE or atomic increment is sufficient. Don't add Redis leasing when a database counter works.
  • Exact consistency is truly required on every operation: If the business literally cannot tolerate any approximation (financial trading, ballot counting), centralized with sub-millisecond timeout is the only answer. Accept the latency and scale the central store.

Follow-Up Questions to Anticipate

Interviewer AsksWhat They Are TestingHow to Respond
"What is the worst-case drift?"Quantitative reasoning"With token leasing: G nodes × lease_size. With bounded approximation: G × local_budget × (1 - watermark). I can quantify this for any configuration."
"What happens when the central store goes down?"Failure mode thinking"Policy-dependent. For abuse protection: fail-open with local caps. For billing: fail-closed with cached quota snapshots. The degradation behavior is designed, not discovered."
"How do you handle hot keys?"Practical operations"Detect via per-key request rate. Mitigate by routing to a dedicated handler or splitting into sub-keys. The key insight: hot keys are a capacity problem, not a correctness problem."
"Why not just call Redis on every request?"Latency and scaling awareness"At 100K QPS, that is 100K Redis round-trips/sec. Each round-trip adds 0.5-1ms of latency. Token leasing reduces it to ~1K renewals/sec — 100x less store pressure with bounded approximation."