StaffSignal
Cross-Cutting Framework

Managing Long-Running Processes

Sagas, state machines, compensation, and idempotency. Every long-running process needs a cancel button.

Managing Long-Running Processes — Cross-Cutting Pattern

The Problem

Any operation that outlives a single request/response cycle needs explicit state management. Without it, your only window into what happened is grepping logs after something already went wrong. Every long-running process needs three things from day one: a cancel button, a status query, and a recovery path.

Playbooks That Use This Pattern

The Core Tradeoff

StrategyWhat WorksWhat BreaksWho Pays
State machineExplicit states + transitions, queryable status, enforced invariantsUpfront design cost; every new edge case means a new stateEngineering team at design time — but you pay once
Saga (orchestration)Central coordinator owns visibility; easy to reason about step orderingCoordinator is a single point of failure; can become a god servicePlatform team maintaining the orchestrator
Saga (choreography)Fully decoupled services; each step owns its own logicNobody knows what step you're on; debugging is archaeologyOn-call engineer at 3 AM tracing events across six services
Polling / checkpointingResilient to crashes; resume from last good stateCheckpoint frequency is a latency-vs-durability knob you will tune foreverUsers waiting for progress, ops tuning intervals
Fire-and-forget + reconciliationSimplest to build; works when eventual consistency is truly fine"Eventually" can mean hours; reconciliation jobs become critical pathBusiness team explaining delays to customers

Staff Default Position

"Every long-running process needs a cancel button." If the process is business-critical, use an explicit state machine. You want to answer "what state is order 48271 in?" with a database query, not a log search. Prefer orchestration sagas over choreography — visibility beats decoupling when money or SLAs are involved. Make every step idempotent: idempotency keys, at-least-once delivery, deduplication at the consumer. Staff engineers ask "what happens when this crashes at step 3 of 5?" before writing the design doc, not after the first incident.

Compensation deserves its own mental model. Rollback is not undo. A refund is a new forward action, not an un-charge. Design compensation as explicit steps in the saga, not as afterthoughts bolted onto error handlers.

When to Deviate

  • Low-value, high-volume work (thumbnail generation, cache warming) — fire-and-forget with reconciliation is fine. The cost of orchestration exceeds the cost of occasional failure.
  • Single-service scope — if the entire process lives in one service with one database, a state machine column on the row is enough. Don't import a saga framework for what a status enum can solve.
  • Latency-critical paths — synchronous state machine transitions add milliseconds. If you're on a hot path, consider async checkpointing with a reconciler as a safety net.
  • Rapidly evolving domains — choreography can be the right call when teams ship independently and the process definition changes weekly. Accept the visibility cost and invest in distributed tracing instead.

Common Interview Mistakes

What Candidates SayWhat Interviewers HearWhat Staff Engineers Say
"We'd use a message queue"No failure model, no state tracking"We'd model the process as a state machine persisted in Postgres, with an orchestrator driving transitions and publishing events for observability."
"Each service fires an event to the next"Choreography with no visibility plan"Choreography works here only if we invest in correlation IDs and a trace view — otherwise on-call can't diagnose stuck orders."
"We'd just retry on failure"No idempotency, no poison-pill handling"Every step gets an idempotency key. Retries are safe because the consumer deduplicates, and we dead-letter after three attempts."
"We'd roll back the transaction"Distributed rollback doesn't exist"Compensation is a forward action — we issue a refund, release the inventory hold, and send a cancellation event."
"We can add monitoring later"Observability is an afterthought"The state machine gives us status queries for free. We alert on any process stuck in a state longer than its SLA."

Quick Reference

Rendering diagram...

Orchestrator drives each step. On failure at any point, compensation runs forward (not backward) through explicit undo-actions.


Staff Sentence Templates


Implementation Deep Dive

1. Saga Pattern with Compensation — Orchestrated Order Fulfillment

A saga decomposes a long-running transaction into a sequence of local transactions, each with a compensating action. The orchestrator drives the saga forward on success and backward (compensation) on failure.

Orchestrator Pseudocode

# Saga definition: order fulfillment
SAGA_STEPS = [
    {
        "name": "reserve_inventory",
        "action": inventoryService.reserve,
        "compensation": inventoryService.releaseReservation,
    },
    {
        "name": "charge_payment",
        "action": paymentService.charge,
        "compensation": paymentService.refund,
    },
    {
        "name": "create_shipment",
        "action": shippingService.createShipment,
        "compensation": shippingService.cancelShipment,
    },
    {
        "name": "send_confirmation",
        "action": notificationService.sendConfirmation,
        "compensation": None,  # No compensation for sent emails
    },
]

function executeSaga(orderId, orderData):
    saga = db.insert("sagas", {
        "saga_id": generateId(),
        "order_id": orderId,
        "status": "running",
        "current_step": 0,
        "step_results": {},
        "started_at": now()
    })

    for i, step in enumerate(SAGA_STEPS):
        saga = db.update("sagas", saga.id, { "current_step": i })

        try:
            result = step.action(orderId, orderData, idempotencyKey=saga.id + ":" + step.name)
            saga.step_results[step.name] = { "status": "completed", "result": result }
            db.update("sagas", saga.id, { "step_results": saga.step_results })

        except Exception as e:
            saga.step_results[step.name] = { "status": "failed", "error": str(e) }
            db.update("sagas", saga.id, {
                "status": "compensating",
                "step_results": saga.step_results
            })
            runCompensation(saga, i)
            return { "status": "failed", "failed_step": step.name }

    db.update("sagas", saga.id, { "status": "completed", "completed_at": now() })
    return { "status": "completed" }

function runCompensation(saga, failedStepIndex):
    # Compensate in reverse order, skipping the failed step
    for i in range(failedStepIndex - 1, -1, -1):
        step = SAGA_STEPS[i]
        if step.compensation is None:
            continue

        try:
            step.compensation(saga.order_id, saga.step_results[step.name]["result"],
                              idempotencyKey=saga.id + ":compensate:" + step.name)
            saga.step_results[step.name + ":compensation"] = { "status": "completed" }
        except Exception as e:
            # Compensation failure: manual intervention required
            saga.step_results[step.name + ":compensation"] = { "status": "failed", "error": str(e) }
            db.update("sagas", saga.id, {
                "status": "compensation_failed",
                "step_results": saga.step_results
            })
            alertOncall("Saga compensation failed", saga)
            return

    db.update("sagas", saga.id, {
        "status": "compensated",
        "step_results": saga.step_results,
        "completed_at": now()
    })

Why compensation is not rollback: A refund is a new financial transaction, not an un-charge. Releasing inventory is a new write, not an un-reserve. Each compensation action is idempotent and has its own idempotency key. If compensation itself fails, the saga enters compensation_failed state and alerts on-call for manual resolution.

Saga State Transitions

Rendering diagram...

2. State Machine with Persistent State — PostgreSQL + Status Column

For processes that live within a single service, a state machine backed by a database column is simpler than a full saga framework. The state column makes the process queryable, the transition rules enforce invariants, and the database provides durability.

State Machine Implementation

# State transitions defined as a map
TRANSITIONS = {
    "created":     ["payment_pending"],
    "payment_pending": ["payment_confirmed", "payment_failed"],
    "payment_confirmed": ["preparing", "cancelled"],
    "preparing":   ["ready_for_pickup", "cancelled"],
    "ready_for_pickup": ["picked_up", "cancelled"],
    "picked_up":   ["delivered", "delivery_failed"],
    "delivered":   [],                    # Terminal state
    "cancelled":   [],                    # Terminal state
    "payment_failed": ["payment_pending"],  # Retry allowed
    "delivery_failed": ["picked_up"],       # Retry allowed
}

function transitionOrder(orderId, targetState):
    # Atomic read + validate + update in a single transaction
    db.beginTransaction()

    order = db.query(
        "SELECT id, status, updated_at FROM orders WHERE id = ? FOR UPDATE",
        orderId
    )

    if targetState not in TRANSITIONS[order.status]:
        db.rollback()
        raise InvalidTransitionError(
            f"Cannot transition from {order.status} to {targetState}"
        )

    db.execute("""
        UPDATE orders
        SET status = ?, updated_at = now(), status_history = status_history || ?
        WHERE id = ?
    """, targetState, serialize({"from": order.status, "to": targetState, "at": now()}),
         orderId)

    db.commit()

    # Publish event for downstream consumers
    eventBus.publish("order.status_changed", {
        "order_id": orderId,
        "from": order.status,
        "to": targetState,
        "timestamp": now()
    })

    return { "status": targetState }

Key design choices:

  • FOR UPDATE prevents concurrent transitions on the same order
  • status_history (JSONB array) provides a full audit trail without a separate table
  • Event publishing after commit ensures downstream consumers react to state changes
  • Explicit transition map prevents invalid state changes (e.g., jumping from created to delivered)

Stuck Process Detection

# Alert on orders stuck in non-terminal states
function detectStuckOrders():
    stuck = db.query("""
        SELECT id, status, updated_at
        FROM orders
        WHERE status NOT IN ('delivered', 'cancelled')
          AND updated_at < now() - interval '1 hour'
        ORDER BY updated_at ASC
    """)

    for order in stuck:
        alertOncall("Order stuck in state",
            orderId=order.id,
            state=order.status,
            stuckSince=order.updated_at,
            suggestedAction=RECOVERY_ACTIONS.get(order.status, "manual_review"))

RECOVERY_ACTIONS = {
    "payment_pending": "retry_payment or cancel",
    "preparing": "check_kitchen_service_health",
    "ready_for_pickup": "notify_driver_again",
    "picked_up": "check_delivery_tracking",
}

3. Timeout and Retry with Exponential Backoff

Retries without backoff are a denial-of-service attack on your own infrastructure. Exponential backoff with jitter spreads retry traffic and prevents thundering herds.

Retry Implementation

function callWithRetry(operation, maxAttempts=5, baseDelay=1000):
    for attempt in range(maxAttempts):
        try:
            result = operation()
            return result

        except RetryableError as e:
            if attempt == maxAttempts - 1:
                # Exhausted retries — dead-letter
                deadLetterQueue.enqueue({
                    "operation": operation.name,
                    "error": str(e),
                    "attempts": maxAttempts,
                    "last_attempt": now()
                })
                raise MaxRetriesExceeded(e)

            # Exponential backoff with full jitter
            maxDelay = baseDelay * (2 ** attempt)     # 1s, 2s, 4s, 8s, 16s
            delay = random(0, maxDelay)               # Full jitter
            metrics.increment("retry.attempt",
                tags=["op:" + operation.name, "attempt:" + str(attempt)])
            sleep(delay)

        except NonRetryableError as e:
            # Business logic error — do not retry
            raise e

Why full jitter over equal jitter: Equal jitter (maxDelay/2 + random(0, maxDelay/2)) still clusters retries around the midpoint. Full jitter (random(0, maxDelay)) uniformly distributes retries across the entire window. AWS's analysis showed full jitter reduces total retry time by ~40% under high contention.

Retry Budget per Service

Error TypeRetryable?Max AttemptsBackoffDead-Letter?
Network timeoutYes3Exponential + jitterYes
HTTP 503Yes5Exponential + jitterYes
HTTP 429 (rate limit)Yes3Respect Retry-After headerYes
HTTP 400 (bad request)No1NoneNo — fix the caller
HTTP 409 (conflict)Depends2Short delayLog for investigation
Database deadlockYes3Fixed 100msYes
Constraint violationNo1NoneNo — data error

4. Idempotency Keys for Safe Retry

Every operation in a long-running process must be safe to retry. Idempotency keys ensure that executing the same step twice produces the same result without side effects.

Idempotency Key Implementation

function processPayment(request):
    idempotencyKey = request.idempotencyKey    # Client-provided or saga-generated

    # Check if this key has been processed before
    existing = db.query(
        "SELECT response, status FROM idempotency_store WHERE key = ?",
        idempotencyKey
    )

    if existing and existing.status == "completed":
        metrics.increment("idempotency.hit")
        return existing.response               # Return cached response

    if existing and existing.status == "in_progress":
        # Another request with the same key is currently processing
        return { "status": "conflict", "message": "Request in progress" }

    # Claim the key
    db.execute("""
        INSERT INTO idempotency_store (key, status, created_at, expires_at)
        VALUES (?, 'in_progress', now(), now() + interval '24 hours')
        ON CONFLICT (key) DO NOTHING
    """, idempotencyKey)

    try:
        # Execute the actual operation
        result = paymentGateway.charge(request.amount, request.currency)

        # Store the result
        db.execute("""
            UPDATE idempotency_store
            SET status = 'completed', response = ?, completed_at = now()
            WHERE key = ?
        """, serialize(result), idempotencyKey)

        return result

    except Exception as e:
        # Remove the key so the operation can be retried
        db.execute("DELETE FROM idempotency_store WHERE key = ?", idempotencyKey)
        raise e

Key design decisions:

  • in_progress status prevents concurrent execution of the same request (two retries arriving simultaneously)
  • ON CONFLICT DO NOTHING is an atomic claim — only one request wins the race
  • Delete on failure allows the operation to be retried. If you keep the key on failure, the operation can never succeed.
  • 24-hour expiry bounds storage growth. After 24 hours, a retry with the same key is either a bug or a new attempt.

Idempotency Scope

LayerKey SourceScopeExample
API endpointClient-provided header (Idempotency-Key)Single API callX-Idempotency-Key: order-123-charge
Saga stepSaga ID + step nameOne step in a workflowsaga-456:charge_payment
Background jobJob ID + attempt numberOne job executionjob-789:attempt-2
Event consumerEvent ID + consumer groupOne event processingevt-012:consumer-leaderboard

Architecture Diagram

Rendering diagram...

Happy path (solid lines): The orchestrator executes saga steps 1-5 in sequence. Each service checks the idempotency store before executing. The saga state is persisted after each step.

Compensation path (dashed lines): On failure at any step, the orchestrator runs compensations in reverse order. Each compensation is idempotent and recorded in the saga state store.

Failure path: Operations that fail after max retries go to the dead-letter queue. Sagas stuck in non-terminal states trigger monitoring alerts.


Failure Scenarios

1. Saga Compensation Failure — Payment Refund Fails

Timeline: An order saga reaches step 3 (create shipment). The shipping API returns a 500 error. The orchestrator begins compensation: step 2 compensation (refund) succeeds, but step 1 compensation (release inventory reservation) fails because the inventory service is experiencing a network partition.

Blast radius: The customer has been refunded but the inventory remains reserved. The reservation hold has a 15-minute TTL, so inventory will auto-release — but during those 15 minutes, other customers cannot purchase those items. If the inventory service outage lasts longer than the TTL, the hold expires naturally and no manual intervention is needed.

Detection: Saga enters compensation_failed state. Alert fires for sagas in compensation_failed longer than 5 minutes. Dashboard shows compensation success rate dropping.

Recovery:

  1. Immediate: the orchestrator retries the failed compensation with exponential backoff (up to 3 attempts)
  2. If retries fail: saga remains in compensation_failed — on-call manually releases the reservation after investigating the root cause
  3. Long-term: inventory reservations have TTL-based auto-release as a safety net. Even if compensation never succeeds, the reservation expires and inventory is reclaimed

2. State Machine Stuck in Non-Terminal State

Timeline: An order transitions to preparing at 14:00. The kitchen service processes the order but its callback to update the order status fails (network timeout). The order remains in preparing state indefinitely. The customer sees "preparing" on their app for 3 hours. The food was ready 2.5 hours ago.

Blast radius: One order is stuck. The customer has a bad experience. The driver is not dispatched. If multiple orders are stuck (systemic callback failure), the entire kitchen's throughput is blocked.

Detection: The stuck-order detection query runs every 5 minutes. Orders in preparing for >30 minutes trigger an alert. The monitoring system shows the gap between kitchen completions and order status updates.

Recovery:

  1. Immediate: on-call manually transitions the order to ready_for_pickup via an admin API
  2. Short-term: add a reconciliation job — poll the kitchen service for orders that completed but whose callback failed, and update status accordingly
  3. Long-term: switch from push-based callbacks to pull-based polling. The orchestrator polls the kitchen service every 30 seconds for orders in preparing state, eliminating the callback failure mode

3. Idempotency Key Collision Causing Lost Payment

Timeline: A client generates an idempotency key based on order_id alone: idem-key = "order-123". The first payment attempt charges $50 and succeeds. The key is stored with the $50 result. Later, the order amount is updated to $75 (price adjustment). The client retries with the same key "order-123". The idempotency store returns the cached $50 result. The customer is charged $50 instead of $75.

Blast radius: Revenue loss on every order with a price adjustment that shares an idempotency key with the original charge. Silent — no error is raised because the idempotency store is working as designed.

Detection: Revenue reconciliation job detects orders where the charged amount does not match the current order amount. Financial audit flags discrepancies.

Recovery:

  1. Immediate: charge the $25 difference as a separate transaction with a new idempotency key
  2. Short-term: include amount in the idempotency key: "order-123:charge:7500" (amount in cents)
  3. Long-term: idempotency keys should encode the full request fingerprint — amount, currency, destination account. If any parameter changes, it is a new request and needs a new key

Staff Interview Application

How to Introduce This Pattern

Lead with the scope question (single-service vs multi-service), then name the specific pattern. This tells the interviewer you have a decision framework, not just a tool preference.

When NOT to Use This Pattern

  • Process completes in <1 second: If the entire workflow fits in a single HTTP request/response with a database transaction, there is no long-running process. A simple BEGIN; ... COMMIT; is the right answer.
  • Failure is acceptable: Thumbnail generation, cache warming, non-critical background jobs. If a failure means "we'll try again on the next cron run" and the business impact is zero, fire-and-forget with reconciliation is simpler than a saga.
  • All steps are in one service and one database: Use a regular database transaction. A saga exists because distributed transactions don't. If you're not distributed, you don't need one.
  • Process changes weekly: If the workflow is evolving rapidly (startup iterating on the onboarding flow), a rigid state machine creates friction. Use a simpler workflow engine or even a checklist table until the process stabilizes.

Follow-Up Questions to Anticipate

Interviewer AsksWhat They Are TestingHow to Respond
"What if compensation fails?"Understanding of partial failure"Compensation has its own retry logic and idempotency key. If it exhausts retries, the saga enters 'compensation_failed' and alerts on-call. The underlying resource (reservation, auth hold) has a TTL that auto-expires as a safety net."
"Why not use distributed transactions (2PC)?"Understanding of distributed systems limits"2PC blocks all participants until the coordinator decides. A single slow participant blocks the entire transaction. Sagas with compensation are less correct — we have a window of inconsistency — but they never block."
"How do you handle concurrent updates to the same order?"Concurrency control"SELECT FOR UPDATE on the order row. Only one transition runs at a time. Concurrent requests get a 409 Conflict. The state machine's transition map prevents invalid jumps regardless of concurrency."
"How do you test a saga?"Practical engineering"Inject failures at each step and verify compensation runs correctly. The key test cases: failure at step 1 (no compensation needed), failure at step N (full compensation chain), and compensation failure (verify alerting and manual recovery path)."
"What about the orchestrator being a single point of failure?"Availability thinking"The orchestrator is stateless — the saga state is in the database. If the orchestrator crashes, a new instance picks up sagas in 'running' or 'compensating' state and resumes from the last recorded step. No data loss, just latency during failover."