StaffSignal
Cross-Cutting Framework

Data Pipeline Patterns

Batch vs stream, backfill strategies, schema evolution, and exactly-once delivery. Pipeline reliability is measured by the backfill you never had to run.

Data Pipeline Patterns — Cross-Cutting Pattern

The Problem

Data needs to flow from producers to consumers, with transformations along the way. The fundamental choice is batch vs stream — and the answer depends on latency requirements, exactly-once needs, and the cost of the backfill you'll inevitably need to run. Every team gets this wrong at least once, usually by picking the architecture that sounds impressive rather than the one that matches their actual SLA.

Playbooks That Use This Pattern

The Core Tradeoff

StrategyWhat WorksWhat BreaksWho Pays
Batch (scheduled)Simple, retryable, idempotent by designLatency floor is the batch interval; late-arriving data causes silent gapsOn-call when a 4-hour job fails at hour 3
Micro-batchCompromise between batch simplicity and stream latencyStill not real-time; window semantics leak into business logicPlatform team maintaining the scheduling layer
Stream processingSub-second latency, natural event-driven fitExactly-once is hard; state management is complex; debugging is painfulEvery engineer who touches the topology
Change Data Capture (CDC)Captures mutations at the source; no producer changes neededCouples pipeline to source schema; binlog formats vary across databasesThe team that owns the source when they want to migrate
Lambda architectureRun both batch and stream — correctness plus latencyTwo codepaths to maintain, two sets of bugs, drift between themEveryone, forever

Staff Default Position

Staff default: stream processing for latency-sensitive paths, batch for analytical workloads. Always design backfill capability from day one — you will need it. If someone proposes a pipeline and cannot answer "how do we reprocess the last N days?", that pipeline is not production-ready.

Schema evolution is the silent killer. Pipelines break when producers change schema without coordination. Versioning is not a technical problem — it is a governance problem. Staff engineers enforce schema registries and backward-compatibility contracts before the first event is published.

When to Deviate

  • Prototype or internal tooling — Batch with cron is fine. Do not build Kafka infrastructure for a daily CSV export.
  • Regulatory or audit requirements — Lambda architecture earns its operational cost when you need both real-time alerting and provably correct batch reconciliation.
  • Tiny data volumes — If your entire dataset fits in memory, skip the distributed framework. A single-node script with good error handling beats a Flink cluster you cannot debug.
  • Cross-region replication — CDC becomes the only sane option when you need to replicate state across regions without rewriting producer logic.

Common Interview Mistakes

What Candidates SayWhat Interviewers HearWhat Staff Engineers Say
"We'll use Kafka for everything"Cargo-culting a tool without understanding the tradeoff"Kafka for the event bus, but batch ETL for the warehouse backfill"
"Exactly-once is guaranteed"Has not operated a stream processor under failure conditions"We design for at-least-once and make consumers idempotent"
"We'll add streaming later"Backfill strategy is an afterthought"The replay path is designed before the forward path ships"
"Schema changes are handled by the consumer"No concept of data contracts or governance"Producers own backward compatibility; we enforce it at the registry"
"Lambda architecture solves everything"Willing to double operational burden without quantifying the benefit"We run dual paths only where the correctness-latency gap justifies the cost"

Quick Reference

Rendering diagram...

Staff Sentence Templates


Implementation Deep Dive

The dominant production pattern for sub-second data pipelines is: Kafka as the durable event bus, Flink for stateful stream processing, and a purpose-built sink (database, search index, cache) at the end. Each layer has a distinct failure mode.

End-to-End Pipeline Pseudocode

# ── Producer: application emits business events ──
function onOrderPlaced(order):
    event = {
        "event_type": "order.placed",
        "order_id": order.id,
        "merchant_id": order.merchantId,
        "amount": order.total,
        "timestamp": now(),
        "schema_version": 3
    }
    kafka.produce(
        topic = "orders.events",
        key = order.merchantId,          # Partition by merchant for ordering
        value = avroSerialize(event, "orders.events-value", version=3),
        acks = "all"                     # Wait for all in-sync replicas
    )

# ── Flink job: stateful stream transformation ──
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(30_000)          # Checkpoint every 30 seconds
env.setStateBackend(RocksDBStateBackend("/checkpoints"))

DataStream<Order> orders = env
    .addSource(FlinkKafkaConsumer("orders.events", avroSchema, kafkaProps))
    .keyBy(event -> event.merchantId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new OrderAggregator())    # Sum amounts per merchant per minute

# ── Sink: write aggregated results ──
orders.addSink(new JdbcSink(
    "INSERT INTO merchant_revenue_1min (merchant_id, window_start, total_amount, order_count) " +
    "VALUES (?, ?, ?, ?) " +
    "ON CONFLICT (merchant_id, window_start) DO UPDATE SET total_amount = EXCLUDED.total_amount, " +
    "order_count = EXCLUDED.order_count",    # Idempotent upsert
    connectionConfig
))

Why each component:

  • Kafka (source): Durable, replayable, partitioned. If Flink crashes, it restarts from the last committed offset. No data loss.
  • Flink (processing): Exactly-once state management via checkpoints. Handles event-time windowing, late data, watermarks. RocksDB backend for state that exceeds memory.
  • JDBC sink (destination): Idempotent upsert (ON CONFLICT DO UPDATE) ensures reprocessed windows don't create duplicate rows.

Pipeline Monitoring — What to Track

MetricTargetAlert ThresholdWhy
Consumer lag (messages)<1,000>10,000 sustained 5 minPipeline falling behind input rate
Checkpoint duration<5s>30sState too large or I/O pressure on checkpoint storage
Event-time skew<30s>2 minClock drift or partition imbalance; late events will be dropped
Backpressure (Flink)<50%>80% sustainedProcessing slower than input; need more parallelism or optimization
Records in / records out ratio~1.0 for pass-through, stable for aggregationSudden changeLogic bug or schema mismatch silently dropping records

2. Backfill Strategy — Replay from Source of Truth

Every pipeline breaks. Schema changes, bugs, infrastructure failures — you will need to reprocess historical data. If you cannot answer "how do we reprocess the last 7 days?" on day one, the pipeline is not production-ready.

Backfill Architecture

# Dual-path pipeline: forward (real-time) + backfill (batch)
#
# Forward path (normal operation):
#   Kafka topic → Flink job → Sink
#
# Backfill path (recovery):
#   Source DB → Batch exporter → Kafka backfill topic → Same Flink job → Sink

function triggerBackfill(startDate, endDate):
    # Step 1 — Pause the forward pipeline to avoid conflicts
    flink.savepoint("orders-pipeline")
    flink.stop("orders-pipeline")

    # Step 2 — Export source data to backfill topic
    batchExporter.run(
        query = "SELECT * FROM orders WHERE created_at BETWEEN ? AND ?",
        params = [startDate, endDate],
        outputTopic = "orders.events.backfill",
        batchSize = 10_000
    )

    # Step 3 — Run Flink job on backfill topic
    flink.run("orders-pipeline",
        source = "orders.events.backfill",
        startOffset = "earliest"
    )

    # Step 4 — Wait for backfill completion, then switch back
    flink.waitForCompletion("orders-pipeline")
    flink.run("orders-pipeline",
        source = "orders.events",
        startOffset = savedCheckpoint
    )

Backfill Tradeoffs

StrategyComplexityDowntimeCostBest For
Kafka offset resetLow — restart consumer from earlier offsetNone (run in parallel)Kafka storage for retention windowSmall backfills, <7 days
Separate backfill topicMedium — batch export + replayForward pipeline pausedCompute for batch export + reprocessingLarge backfills, schema changes
Dual-write during migrationHigh — run old and new pipeline in parallelNone2x compute during overlapZero-downtime migrations
Full rebuild from sourceHigh — re-export entire datasetForward pipeline pausedFull reprocessing costCatastrophic corruption, new pipeline versions

3. Schema Evolution with Avro + Schema Registry

Schema changes are the #1 cause of pipeline outages. A producer adds a field; a consumer does not expect it; deserialization fails; the pipeline stops. Schema Registry enforces compatibility rules so that changes are safe by default.

Schema Evolution Rules

# Avro schema for order events (v3)
{
  "type": "record",
  "name": "OrderEvent",
  "namespace": "com.example.orders",
  "fields": [
    {"name": "order_id",    "type": "string"},
    {"name": "merchant_id", "type": "string"},
    {"name": "amount",      "type": "double"},
    {"name": "timestamp",   "type": "long"},
    {"name": "currency",    "type": "string", "default": "USD"},       # Added in v2
    {"name": "channel",     "type": ["null", "string"], "default": null} # Added in v3
  ]
}

Compatibility rules enforced at the registry:

Change TypeBACKWARD Compatible?FORWARD Compatible?FULL Compatible?
Add field with defaultYesYesYes
Add field without defaultNo — old consumers crashYesNo
Remove field with defaultYesNo — old producers crashNo
Rename fieldNoNoNo
Change field typeNo (except promotions: int→long)NoNo

Schema Registry in the Pipeline

# Producer: serialize with schema registry
function produce(event):
    schemaId = registry.getSchemaId("orders.events-value", version="latest")
    # Avro serializer embeds schema ID in the first 5 bytes
    payload = avroSerialize(event, schemaId)
    kafka.produce("orders.events", key=event.merchantId, value=payload)

# Consumer: deserialize with schema registry
function consume(record):
    schemaId = extractSchemaId(record.value)    # Read from first 5 bytes
    writerSchema = registry.getSchema(schemaId) # Schema used by producer
    readerSchema = localSchema                  # Schema expected by consumer

    # Avro resolves field differences using default values
    event = avroDeserialize(record.value, writerSchema, readerSchema)
    process(event)

The consumer uses both the writer's schema (embedded in the message) and its own reader schema. Avro automatically handles field additions and removals using default values. This is why defaults are mandatory for safe evolution.

4. Exactly-Once with Idempotent Consumers

Distributed systems cannot guarantee exactly-once delivery. The practical approach is at-least-once delivery with idempotent processing — the same message processed twice produces the same result.

Idempotent Consumer Patterns

# Pattern 1: Deduplication table
function processEvent(event):
    # Check if already processed
    existing = db.query(
        "SELECT 1 FROM processed_events WHERE event_id = ?", event.id)
    if existing:
        metrics.increment("pipeline.dedup.hit")
        return  # Already processed — skip

    db.beginTransaction()
    # Process the event
    applyBusinessLogic(event)
    # Record processing in the same transaction
    db.execute("INSERT INTO processed_events (event_id, processed_at) VALUES (?, ?)",
               event.id, now())
    db.commit()

# Pattern 2: Idempotent upsert (no dedup table needed)
function processAggregation(windowResult):
    db.execute("""
        INSERT INTO merchant_revenue (merchant_id, window_start, total, count)
        VALUES (?, ?, ?, ?)
        ON CONFLICT (merchant_id, window_start)
        DO UPDATE SET total = EXCLUDED.total, count = EXCLUDED.count
    """, windowResult.merchantId, windowResult.windowStart,
         windowResult.total, windowResult.count)

Pattern 1 (dedup table) is the general-purpose approach — works for any event type. The dedup entry and the business effect are in the same database transaction, so they are atomically consistent.

Pattern 2 (idempotent upsert) is simpler but only works when the operation is naturally idempotent (SET, not INCREMENT). Aggregation results are idempotent — the same window reprocessed produces the same totals.

Exactly-Once End-to-End: The Full Picture

ComponentGuaranteeMechanism
Kafka producerAt-least-once (with acks=all)Retries on broker failure
Kafka producer (idempotent mode)Exactly-once within a sessionProducer ID + sequence number
Flink processingExactly-once internal stateCheckpoint barriers + state snapshots
Sink (database)Idempotent writeON CONFLICT DO UPDATE or dedup table
End-to-endEffectively exactly-onceEach layer handles its own guarantee; the chain is as strong as its weakest link

Architecture Diagram

Rendering diagram...

Forward path (solid lines): Producers serialize events through the Schema Registry and publish to Kafka. Flink consumes, processes (windowing, aggregation, enrichment), and writes to purpose-built sinks. Checkpoints to S3 enable exactly-once recovery.

Backfill path (dashed): When reprocessing is needed, the batch exporter reads from the source database and publishes to a separate Kafka topic. Flink processes the backfill topic using the same transformation logic, ensuring consistency between forward and backfill paths.


Failure Scenarios

1. Schema Mismatch Causing Silent Data Loss

Timeline: The Order Service team adds a required field shipping_method to the Avro schema without a default value. They deploy the producer on Monday. The Flink consumer, running schema v3, encounters v4 messages on Tuesday. Avro deserialization fails. Flink's error handler routes the failed messages to a dead-letter topic.

Blast radius: All orders placed after Monday are silently routed to the dead-letter topic. Revenue dashboards show a cliff — zero new orders. The drop looks like a business problem, not a pipeline problem. It takes 6 hours before someone correlates the dead-letter queue growth with the missing data.

Detection: Dead-letter queue depth alert (threshold: >100 messages/min). Schema compatibility check failed in CI but was overridden with a manual approval. Consumer deserialization error rate spikes.

Recovery:

  1. Immediate: roll back the producer to schema v3, or register a v4 schema with a default value for shipping_method
  2. Reprocess: replay the dead-letter topic through the consumer after the schema fix
  3. Long-term: enforce FULL compatibility in the schema registry with no manual override. Block deploys that fail compatibility checks.

Timeline: Flink's RocksDB state grows to 200GB after a month of operation. Checkpoint duration increases from 5 seconds to 90 seconds. The 30-second checkpoint interval overlaps with the running checkpoint, causing checkpoint timeout. Flink restarts the job from the last successful checkpoint — which is 10 minutes old. Those 10 minutes of data are reprocessed.

Blast radius: Downstream sinks receive duplicate writes for the reprocessed 10-minute window. If sinks are not idempotent, aggregation counts are inflated. If sinks are idempotent (upsert), no data corruption — but the reprocessing consumes resources and delays new data.

Detection: Checkpoint duration metric exceeds threshold. Checkpoint failure rate increases. RocksDB state size grows beyond capacity plan.

Recovery:

  1. Immediate: increase checkpoint timeout to 120 seconds to stop the failure loop
  2. Short-term: add state TTL — expire entries older than the business-relevant window (e.g., 24 hours for daily aggregations)
  3. Long-term: use incremental checkpoints (only changed state since last checkpoint), partition state across more Flink task managers, or offload cold state to an external store

3. Consumer Lag Spiral During Traffic Spike

Timeline: Black Friday traffic is 5x normal. The Kafka topic ingests 500K events/sec. Flink consumers process at 100K events/sec. Consumer lag grows by 400K messages/sec. After 30 minutes, lag is 720M messages — 2 hours behind real-time.

Blast radius: All downstream systems — dashboards, search indexes, leaderboards — show data that is 2 hours stale. Business decisions are made on stale data. Real-time fraud detection is compromised because events arrive too late for intervention.

Detection: Consumer group lag alert. Processing throughput vs ingestion throughput divergence. End-to-end latency (event timestamp vs processing timestamp) exceeds SLA.

Recovery:

  1. Immediate: scale Flink parallelism (add task managers) to match ingestion rate. Kafka partitions set the upper bound — you cannot have more Flink consumers than partitions
  2. Short-term: if partitions are the bottleneck, increase partition count for the topic (requires producer restart for new partitioning)
  3. Long-term: capacity plan for peak traffic, not average. If normal is 100K events/sec and Black Friday is 500K, provision Flink for 500K sustained, not 100K + auto-scaling (auto-scaling is too slow for burst absorption)
  4. Emergency: drop non-critical events (analytics, logging) and prioritize business-critical events (fraud detection, inventory updates) through separate topics with independent consumer groups

Staff Interview Application

How to Introduce This Pattern

Lead with latency requirement, then name the backfill and schema contract as non-negotiable. This signals that you've operated pipelines in production, not just built them in a tutorial.

When NOT to Use This Pattern

  • Data fits in memory: If the entire dataset is <1GB, a single-node script with good error handling beats a Flink cluster. Don't add distributed infrastructure for a problem that fits in pandas.
  • Batch interval matches the SLA: If the business needs data refreshed every hour, a cron job with psql COPY is the right answer. Stream processing for hourly updates is over-engineering.
  • No replay requirement: If the data is ephemeral and reprocessing is not needed (metrics aggregation with acceptable loss), a simpler push-based pipeline without Kafka's durability overhead is sufficient.
  • Team cannot operate the stack: Flink requires JVM tuning, checkpoint management, and state backend expertise. If the team does not have this, use a managed service (AWS Kinesis Data Analytics, Google Dataflow) or stick with batch.

Follow-Up Questions to Anticipate

Interviewer AsksWhat They Are TestingHow to Respond
"How do you handle late-arriving data?"Event-time vs processing-time understanding"Event-time windowing with watermarks. Late events within the allowed lateness period refire the window. Events beyond the allowed lateness go to a side output for separate processing."
"What happens when the pipeline breaks?"Operational maturity"Flink restarts from the last checkpoint — no data loss because Kafka retains events. The sink uses idempotent upserts so reprocessed data doesn't create duplicates."
"How do you backfill?"Production readiness"Kafka retention covers 7 days. For deeper backfills, we export from the source DB to a backfill topic and run the same Flink job against it. Same code, different input."
"How do you handle schema changes?"Data governance awareness"Schema Registry with FULL compatibility enforcement. New fields must have defaults. Producers that fail compatibility checks cannot deploy."
"Why not just use database triggers?"Understanding of coupling and scale"Triggers couple the pipeline to the database engine, don't scale horizontally, and can't be replayed. CDC via Debezium gives us the same change stream without the coupling."