StaffSignal
Cross-Cutting Framework26 min read

Data Pipeline Patterns

Batch vs stream, backfill strategies, schema evolution, and exactly-once delivery. Pipeline reliability is measured by the backfill you never had to run.

Why This Matters

Every system design interview that involves analytics, search, notifications, or data synchronization is secretly a data pipeline question. The moment data needs to flow from one system to another with transformations along the way, you are designing a pipeline — and the choices you make about batch vs stream, exactly-once vs at-least-once, and backfill strategy determine whether your system is production-ready or a time bomb waiting for its first schema change.

Most candidates reach for Kafka and Flink by default because those are the names they know. Staff engineers start with the latency SLA and work backward: if hours are acceptable, a cron job beats a streaming cluster. If sub-second is required, the complexity of stream processing is justified — but only if you also design the backfill path and the schema contract in the same breath. Pipeline reliability is not measured by uptime. It is measured by the backfill you never had to run because you got the schema governance and exactly-once semantics right from day one.

If you can explain when batch beats stream, why "exactly-once" is a lie that requires idempotent sinks, and how to reprocess the last 7 days without writing new code — you demonstrate the operational maturity that Staff interviews demand.

The 60-Second Version

  • Pipeline reliability is measured by the backfill you never had to run. If you cannot answer "how do we reprocess the last 7 days?" on day one, the pipeline is not production-ready.
  • Batch for hours-level latency, stream for sub-second. Most teams pick stream because it sounds impressive, then spend months debugging state management and exactly-once semantics for a pipeline that could have been a cron job.
  • "Exactly-once" requires idempotent sinks. Kafka gives you at-least-once. Flink gives you exactly-once internal state. But the database write at the end can still duplicate. The real guarantee comes from ON CONFLICT DO UPDATE or a dedup table — not from the framework.
  • Schema evolution is the #1 pipeline killer. A producer adds a field without a default; the consumer crashes at 2 AM. Schema Registry with FULL compatibility enforcement prevents this. Block deploys that fail compatibility checks.
  • Design two paths: forward and backfill. The forward path handles real-time. The backfill path uses the same transformation code against a different source (Kafka offset reset or batch export). Same code, different input.
  • Auto-scaling stream processors is a myth for sudden spikes. Flink needs minutes to tens of minutes to redistribute state across new workers (depending on state size and checkpoint storage throughput). Provision for peak from day one.

The Problem

Data needs to flow from producers to consumers, with transformations along the way. The fundamental choice is batch vs stream — and the answer depends on latency requirements, exactly-once needs, and the cost of the backfill you'll inevitably need to run. Every team gets this wrong at least once, usually by picking the architecture that sounds impressive rather than the one that matches their actual SLA.

Playbooks That Use This Pattern

The Core Tradeoff

StrategyWhat WorksWhat BreaksWho Pays
Batch (scheduled)Simple, retryable, idempotent by designLatency floor is the batch interval; late-arriving data causes silent gapsOn-call when a 4-hour job fails at hour 3
Micro-batchCompromise between batch simplicity and stream latencyStill not real-time; window semantics leak into business logicPlatform team maintaining the scheduling layer
Stream processingSub-second latency, natural event-driven fitExactly-once is hard; state management is complex; debugging is painfulEvery engineer who touches the topology
Change Data Capture (CDC)Captures mutations at the source; no producer changes neededCouples pipeline to source schema; binlog formats vary across databasesThe team that owns the source when they want to migrate
Lambda architectureRun both batch and stream — correctness plus latencyTwo codepaths to maintain, two sets of bugs, drift between themEveryone, forever

Staff Default Position

Staff default: stream processing for latency-sensitive paths, batch for analytical workloads. Always design backfill capability from day one — you will need it. If someone proposes a pipeline and cannot answer "how do we reprocess the last N days?", that pipeline is not production-ready.

Schema evolution is the silent killer. Pipelines break when producers change schema without coordination. Versioning is not a technical problem — it is a governance problem. Staff engineers enforce schema registries and backward-compatibility contracts before the first event is published.

When to Deviate

  • Prototype or internal tooling — Batch with cron is fine. Do not build Kafka infrastructure for a daily CSV export.
  • Regulatory or audit requirements — Lambda architecture earns its operational cost when you need both real-time alerting and provably correct batch reconciliation.
  • Tiny data volumes — If your entire dataset fits in memory, skip the distributed framework. A single-node script with good error handling beats a Flink cluster you cannot debug.
  • Cross-region replication — CDC becomes the only sane option when you need to replicate state across regions without rewriting producer logic.

Common Interview Mistakes

What Candidates SayWhat Interviewers HearWhat Staff Engineers Say
"We'll use Kafka for everything"Cargo-culting a tool without understanding the tradeoff"Kafka for the event bus, but batch ETL for the warehouse backfill"
"Exactly-once is guaranteed"Has not operated a stream processor under failure conditions"We design for at-least-once and make consumers idempotent"
"We'll add streaming later"Backfill strategy is an afterthought"The replay path is designed before the forward path ships"
"Schema changes are handled by the consumer"No concept of data contracts or governance"Producers own backward compatibility; we enforce it at the registry"
"Lambda architecture solves everything"Willing to double operational burden without quantifying the benefit"We run dual paths only where the correctness-latency gap justifies the cost"

Quick Reference

Rendering diagram...

Staff Sentence Templates


Implementation Deep Dive

The dominant production pattern for sub-second data pipelines is: Kafka as the durable event bus, Flink for stateful stream processing, and a purpose-built sink (database, search index, cache) at the end. Each layer has a distinct failure mode.

End-to-End Pipeline Pseudocode

# ── Producer: application emits business events ──
function onOrderPlaced(order):
    event = {
        "event_type": "order.placed",
        "order_id": order.id,
        "merchant_id": order.merchantId,
        "amount": order.total,
        "timestamp": now(),
        "schema_version": 3
    }
    kafka.produce(
        topic = "orders.events",
        key = order.merchantId,          # Partition by merchant for ordering
        value = avroSerialize(event, "orders.events-value", version=3),
        acks = "all"                     # Wait for all in-sync replicas
    )

# ── Flink job: stateful stream transformation ──
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(30_000)          # Checkpoint every 30 seconds
env.setStateBackend(RocksDBStateBackend("/checkpoints"))

DataStream<Order> orders = env
    .addSource(FlinkKafkaConsumer("orders.events", avroSchema, kafkaProps))
    .keyBy(event -> event.merchantId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new OrderAggregator())    # Sum amounts per merchant per minute

# ── Sink: write aggregated results ──
orders.addSink(new JdbcSink(
    "INSERT INTO merchant_revenue_1min (merchant_id, window_start, total_amount, order_count) " +
    "VALUES (?, ?, ?, ?) " +
    "ON CONFLICT (merchant_id, window_start) DO UPDATE SET total_amount = EXCLUDED.total_amount, " +
    "order_count = EXCLUDED.order_count",    # Idempotent upsert
    connectionConfig
))

Why each component:

  • Kafka (source): Durable, replayable, partitioned. If Flink crashes, it restarts from the last committed offset. No data loss.
  • Flink (processing): Exactly-once state management via checkpoints. Handles event-time windowing, late data, watermarks. RocksDB backend for state that exceeds memory.
  • JDBC sink (destination): Idempotent upsert (ON CONFLICT DO UPDATE) ensures reprocessed windows don't create duplicate rows.

Pipeline Monitoring — What to Track

MetricTargetAlert ThresholdWhy
Consumer lag (messages)<1,000>10,000 sustained 5 minPipeline falling behind input rate
Checkpoint duration<5s>30sState too large or I/O pressure on checkpoint storage
Event-time skew<30s>2 minClock drift or partition imbalance; late events will be dropped
Backpressure (Flink)<50%>80% sustainedProcessing slower than input; need more parallelism or optimization
Records in / records out ratio~1.0 for pass-through, stable for aggregationSudden changeLogic bug or schema mismatch silently dropping records

2. Backfill Strategy — Replay from Source of Truth

Every pipeline breaks. Schema changes, bugs, infrastructure failures — you will need to reprocess historical data. If you cannot answer "how do we reprocess the last 7 days?" on day one, the pipeline is not production-ready.

Backfill Architecture

# Dual-path pipeline: forward (real-time) + backfill (batch)
#
# Forward path (normal operation):
#   Kafka topic → Flink job → Sink
#
# Backfill path (recovery):
#   Source DB → Batch exporter → Kafka backfill topic → Same Flink job → Sink

function triggerBackfill(startDate, endDate):
    # Step 1 — Pause the forward pipeline to avoid conflicts
    flink.savepoint("orders-pipeline")
    flink.stop("orders-pipeline")

    # Step 2 — Export source data to backfill topic
    batchExporter.run(
        query = "SELECT * FROM orders WHERE created_at BETWEEN ? AND ?",
        params = [startDate, endDate],
        outputTopic = "orders.events.backfill",
        batchSize = 10_000
    )

    # Step 3 — Run Flink job on backfill topic
    flink.run("orders-pipeline",
        source = "orders.events.backfill",
        startOffset = "earliest"
    )

    # Step 4 — Wait for backfill completion, then switch back
    flink.waitForCompletion("orders-pipeline")
    flink.run("orders-pipeline",
        source = "orders.events",
        startOffset = savedCheckpoint
    )

Backfill Tradeoffs

StrategyComplexityDowntimeCostBest For
Kafka offset resetLow — restart consumer from earlier offsetNone (run in parallel)Kafka storage for retention windowSmall backfills, <7 days
Separate backfill topicMedium — batch export + replayForward pipeline pausedCompute for batch export + reprocessingLarge backfills, schema changes
Dual-write during migrationHigh — run old and new pipeline in parallelNone2x compute during overlapZero-downtime migrations
Full rebuild from sourceHigh — re-export entire datasetForward pipeline pausedFull reprocessing costCatastrophic corruption, new pipeline versions

3. Schema Evolution with Avro + Schema Registry

Schema changes are the #1 cause of pipeline outages. A producer adds a field; a consumer does not expect it; deserialization fails; the pipeline stops. Schema Registry enforces compatibility rules so that changes are safe by default.

Schema Evolution Rules

# Avro schema for order events (v3)
{
  "type": "record",
  "name": "OrderEvent",
  "namespace": "com.example.orders",
  "fields": [
    {"name": "order_id",    "type": "string"},
    {"name": "merchant_id", "type": "string"},
    {"name": "amount",      "type": "double"},
    {"name": "timestamp",   "type": "long"},
    {"name": "currency",    "type": "string", "default": "USD"},       # Added in v2
    {"name": "channel",     "type": ["null", "string"], "default": null} # Added in v3
  ]
}

Compatibility rules enforced at the registry:

Change TypeBACKWARD Compatible?FORWARD Compatible?FULL Compatible?
Add field with defaultYesYesYes
Add field without defaultNo — old consumers crashYesNo
Remove field with defaultYesNo — old producers crashNo
Rename fieldNoNoNo
Change field typeNo (except promotions: int→long)NoNo

Schema Registry in the Pipeline

# Producer: serialize with schema registry
function produce(event):
    schemaId = registry.getSchemaId("orders.events-value", version="latest")
    # Avro serializer embeds schema ID in the first 5 bytes
    payload = avroSerialize(event, schemaId)
    kafka.produce("orders.events", key=event.merchantId, value=payload)

# Consumer: deserialize with schema registry
function consume(record):
    schemaId = extractSchemaId(record.value)    # Read from first 5 bytes
    writerSchema = registry.getSchema(schemaId) # Schema used by producer
    readerSchema = localSchema                  # Schema expected by consumer

    # Avro resolves field differences using default values
    event = avroDeserialize(record.value, writerSchema, readerSchema)
    process(event)

The consumer uses both the writer's schema (embedded in the message) and its own reader schema. Avro automatically handles field additions and removals using default values. This is why defaults are mandatory for safe evolution.

4. Exactly-Once with Idempotent Consumers

Distributed systems cannot guarantee exactly-once delivery. The practical approach is at-least-once delivery with idempotent processing — the same message processed twice produces the same result.

Idempotent Consumer Patterns

# Pattern 1: Deduplication table
function processEvent(event):
    # Check if already processed
    existing = db.query(
        "SELECT 1 FROM processed_events WHERE event_id = ?", event.id)
    if existing:
        metrics.increment("pipeline.dedup.hit")
        return  # Already processed — skip

    db.beginTransaction()
    # Process the event
    applyBusinessLogic(event)
    # Record processing in the same transaction
    db.execute("INSERT INTO processed_events (event_id, processed_at) VALUES (?, ?)",
               event.id, now())
    db.commit()

# Pattern 2: Idempotent upsert (no dedup table needed)
function processAggregation(windowResult):
    db.execute("""
        INSERT INTO merchant_revenue (merchant_id, window_start, total, count)
        VALUES (?, ?, ?, ?)
        ON CONFLICT (merchant_id, window_start)
        DO UPDATE SET total = EXCLUDED.total, count = EXCLUDED.count
    """, windowResult.merchantId, windowResult.windowStart,
         windowResult.total, windowResult.count)

Pattern 1 (dedup table) is the general-purpose approach — works for any event type. The dedup entry and the business effect are in the same database transaction, so they are atomically consistent.

Pattern 2 (idempotent upsert) is simpler but only works when the operation is naturally idempotent (SET, not INCREMENT). Aggregation results are idempotent — the same window reprocessed produces the same totals.

Exactly-Once End-to-End: The Full Picture

ComponentGuaranteeMechanism
Kafka producerAt-least-once (with acks=all)Retries on broker failure
Kafka producer (idempotent mode)Exactly-once within a sessionProducer ID + sequence number
Flink processingExactly-once internal stateCheckpoint barriers + state snapshots
Sink (database)Idempotent writeON CONFLICT DO UPDATE or dedup table
End-to-endEffectively exactly-onceEach layer handles its own guarantee; the chain is as strong as its weakest link

Architecture Diagram

Rendering diagram...

Forward path (solid lines): Producers serialize events through the Schema Registry and publish to Kafka. Flink consumes, processes (windowing, aggregation, enrichment), and writes to purpose-built sinks. Checkpoints to S3 enable exactly-once recovery.

Backfill path (dashed): When reprocessing is needed, the batch exporter reads from the source database and publishes to a separate Kafka topic. Flink processes the backfill topic using the same transformation logic, ensuring consistency between forward and backfill paths.


Failure Scenarios

1. Schema Mismatch Causing Silent Data Loss

Timeline: The Order Service team adds a required field shipping_method to the Avro schema without a default value. They deploy the producer on Monday. The Flink consumer, running schema v3, encounters v4 messages on Tuesday. Avro deserialization fails. Flink's error handler routes the failed messages to a dead-letter topic.

Blast radius: All orders placed after Monday are silently routed to the dead-letter topic. Revenue dashboards show a cliff — zero new orders. The drop looks like a business problem, not a pipeline problem. It takes 6 hours before someone correlates the dead-letter queue growth with the missing data.

Detection: Dead-letter queue depth alert (threshold: >100 messages/min). Schema compatibility check failed in CI but was overridden with a manual approval. Consumer deserialization error rate spikes.

Recovery:

  1. Immediate: roll back the producer to schema v3, or register a v4 schema with a default value for shipping_method
  2. Reprocess: replay the dead-letter topic through the consumer after the schema fix
  3. Long-term: enforce FULL compatibility in the schema registry with no manual override. Block deploys that fail compatibility checks.

Timeline: Flink's RocksDB state grows to 200GB after a month of operation. Checkpoint duration increases from 5 seconds to 90 seconds. The 30-second checkpoint interval overlaps with the running checkpoint, causing checkpoint timeout. Flink restarts the job from the last successful checkpoint — which is 10 minutes old. Those 10 minutes of data are reprocessed.

Blast radius: Downstream sinks receive duplicate writes for the reprocessed 10-minute window. If sinks are not idempotent, aggregation counts are inflated. If sinks are idempotent (upsert), no data corruption — but the reprocessing consumes resources and delays new data.

Detection: Checkpoint duration metric exceeds threshold. Checkpoint failure rate increases. RocksDB state size grows beyond capacity plan.

Recovery:

  1. Immediate: increase checkpoint timeout to 120 seconds to stop the failure loop
  2. Short-term: add state TTL — expire entries older than the business-relevant window (e.g., 24 hours for daily aggregations)
  3. Long-term: use incremental checkpoints (only changed state since last checkpoint), partition state across more Flink task managers, or offload cold state to an external store

3. Consumer Lag Spiral During Traffic Spike

Timeline: Black Friday traffic is 5x normal. The Kafka topic ingests 500K events/sec. Flink consumers process at 100K events/sec. Consumer lag grows by 400K messages/sec. After 30 minutes, lag is 720M messages — 2 hours behind real-time.

Blast radius: All downstream systems — dashboards, search indexes, leaderboards — show data that is 2 hours stale. Business decisions are made on stale data. Real-time fraud detection is compromised because events arrive too late for intervention.

Detection: Consumer group lag alert. Processing throughput vs ingestion throughput divergence. End-to-end latency (event timestamp vs processing timestamp) exceeds SLA.

Recovery:

  1. Immediate: scale Flink parallelism (add task managers) to match ingestion rate. Kafka partitions set the upper bound — you cannot have more Flink consumers than partitions
  2. Short-term: if partitions are the bottleneck, increase partition count for the topic (requires producer restart for new partitioning)
  3. Long-term: capacity plan for peak traffic, not average. If normal is 100K events/sec and Black Friday is 500K, provision Flink for 500K sustained, not 100K + auto-scaling (auto-scaling is too slow for burst absorption)
  4. Emergency: drop non-critical events (analytics, logging) and prioritize business-critical events (fraud detection, inventory updates) through separate topics with independent consumer groups

In the Wild

Pipeline patterns are easier to internalize when you see how they were applied under real production pressure. These are public, documented examples — not speculation.

LinkedIn: Kafka as the Universal Data Backbone

LinkedIn invented Kafka because their existing data pipelines — a tangle of point-to-point ETL jobs between dozens of systems — had become unmaintainable. The insight was to decouple producers from consumers via a durable, replayable log. Every system writes events to Kafka; every consuming system reads from Kafka at its own pace. LinkedIn processes trillions of messages per day through Kafka, feeding search indexes, recommendation engines, analytics dashboards, and compliance systems.

The Staff-level insight: LinkedIn's key architectural decision was making Kafka the single source of truth for all data movement, not just a message queue. This means any new system can bootstrap itself by consuming from Kafka from the beginning of the retention window — no custom export jobs, no one-off data dumps. The backfill path and the forward path are the same path. This is why LinkedIn sets Kafka retention to 7+ days even for high-volume topics: the cost of storage is trivial compared to the cost of building custom backfill infrastructure for every new consumer.

Uber: Streaming ETL with Schema Enforcement

Uber processes billions of events per day across thousands of microservices — trip events, pricing signals, driver locations, payment transactions. Their pipeline platform, Marmaray, ingests from Kafka and writes to multiple sinks (Hive for analytics, Elasticsearch for search, Redis for real-time). The critical innovation was enforcing Avro schemas at the pipeline boundary: every event must pass schema validation before being written to Kafka. Events that fail validation are routed to a quarantine topic for investigation.

The Staff-level insight: Uber learned the schema governance lesson the hard way. Before schema enforcement, a single producer team renaming a field could silently break dozens of downstream consumers. Their fix was organizational, not technical: every schema change goes through a review process, the Schema Registry enforces backward compatibility, and any breaking change requires explicit opt-in from all affected consumer teams. The pipeline became reliable not because of better technology, but because of better contracts between teams.

Spotify: Batch-First with Cloud Dataflow

Spotify processes petabytes of listening data for features like Discover Weekly, Wrapped, and royalty calculations. Rather than building a streaming pipeline for these analytical workloads, they use Google Cloud Dataflow (Apache Beam) with a batch-first approach. Listening events are buffered in Pub/Sub and processed in hourly micro-batches. The key insight: Discover Weekly updates once a week, Wrapped updates once a year, and royalty calculations have contractual latency windows of hours to days. Sub-second pipeline latency would be wasted engineering.

The Staff-level insight: Spotify's batch-first approach is a deliberate rejection of the "stream everything" trend. Their pipeline latency is measured in hours, but their pipeline reliability is measured in 9s — because batch jobs are simpler to retry, simpler to backfill, and simpler to debug than stateful stream processors. The lesson: match the pipeline architecture to the actual latency requirement, not to the most impressive technology available.


Practice Drill

Staff-Caliber Answer Shape
Expand
  1. Classify the workload. Two modes: steady-state (5K updates/hour ≈ 1.4 updates/sec) and bulk import (500K updates ≈ 140/sec if spread over an hour). The 30-second SLA rules out hourly batch but does not require sub-second streaming. CDC with micro-batching is the sweet spot.

  2. Source: CDC from PostgreSQL. Use Debezium to capture changes from PostgreSQL's WAL. Debezium publishes row-level changes (INSERT, UPDATE, DELETE) to Kafka topics. No changes needed to the product catalog application — CDC is transparent to the producer.

  3. Event bus: Kafka with schema registry. Topic product.changes, partitioned by product_id for ordering. Avro schema with BACKWARD compatibility. Retention: 7 days (covers weekly backfills). At 140 events/sec peak, a single Kafka partition handles this easily — use 4 partitions for headroom.

  4. Processing: Flink micro-batch to Elasticsearch. Flink consumes from Kafka, transforms the database row format to the Elasticsearch document format (denormalize category names, compute searchable fields), and bulk-indexes to Elasticsearch every 10 seconds. Bulk API with 1000 documents per batch — amortizes the Elasticsearch indexing overhead.

  5. Exactly-once: idempotent by design. Each Elasticsearch document is identified by product_id. An upsert (index with the same _id) is naturally idempotent — reprocessing the same Kafka message produces the same document. No dedup table needed.

  6. Backfill: full catalog reindex. For bulk imports or corruption recovery, run a batch export from PostgreSQL to the Kafka backfill topic. The same Flink job processes it. For the initial 10M product load: 10M × 1KB average = 10GB. At 10K documents/sec bulk indexing rate, full reindex takes ~17 minutes.

  7. Monitor: end-to-end latency. Track the time between PostgreSQL WAL entry and Elasticsearch document availability. Target: p95 < 30 seconds. Alert on: Debezium lag > 10s, Kafka consumer lag > 1000 messages, Elasticsearch bulk indexing errors.

The Staff move: Show that CDC eliminates the need for application-level event publishing — the database is the event source. This decouples the search pipeline from the product catalog team entirely.


Staff Interview Application

How to Introduce This Pattern

Lead with latency requirement, then name the backfill and schema contract as non-negotiable. This signals that you've operated pipelines in production, not just built them in a tutorial.

When NOT to Use This Pattern

  • Data fits in memory: If the entire dataset is <1GB, a single-node script with good error handling beats a Flink cluster. Don't add distributed infrastructure for a problem that fits in pandas.
  • Batch interval matches the SLA: If the business needs data refreshed every hour, a cron job with psql COPY is the right answer. Stream processing for hourly updates is over-engineering.
  • No replay requirement: If the data is ephemeral and reprocessing is not needed (metrics aggregation with acceptable loss), a simpler push-based pipeline without Kafka's durability overhead is sufficient.
  • Team cannot operate the stack: Flink requires JVM tuning, checkpoint management, and state backend expertise. If the team does not have this, use a managed service (AWS Kinesis Data Analytics, Google Dataflow) or stick with batch.

Follow-Up Questions to Anticipate

Interviewer AsksWhat They Are TestingHow to Respond
"How do you handle late-arriving data?"Event-time vs processing-time understanding"Event-time windowing with watermarks. Late events within the allowed lateness period refire the window. Events beyond the allowed lateness go to a side output for separate processing."
"What happens when the pipeline breaks?"Operational maturity"Flink restarts from the last checkpoint — no data loss because Kafka retains events. The sink uses idempotent upserts so reprocessed data doesn't create duplicates."
"How do you backfill?"Production readiness"Kafka retention covers 7 days. For deeper backfills, we export from the source DB to a backfill topic and run the same Flink job against it. Same code, different input."
"How do you handle schema changes?"Data governance awareness"Schema Registry with FULL compatibility enforcement. New fields must have defaults. Producers that fail compatibility checks cannot deploy."
"Why not just use database triggers?"Understanding of coupling and scale"Triggers couple the pipeline to the database engine, don't scale horizontally, and can't be replayed. CDC via Debezium gives us the same change stream without the coupling."

This is one of 10 cross-cutting frameworks. Full access includes all patterns with evaluator-grade analysis and Staff-level practice drills. Explore the full library