Data Pipeline Patterns — Cross-Cutting Pattern

The Problem

Data needs to flow from producers to consumers, with transformations along the way. The fundamental choice is batch vs stream — and the answer depends on latency requirements, exactly-once needs, and the cost of the backfill you'll inevitably need to run. Every team gets this wrong at least once, usually by picking the architecture that sounds impressive rather than the one that matches their actual SLA.

Playbooks That Use This Pattern

Stream Processing — Real-time event aggregation
Web Crawling & Data Pipelines — Large-scale data ingestion
Leaderboard & Counting — Score aggregation pipelines
Search & Indexing — Index build and update pipelines
Notification & Delivery — Event-driven notification routing

The Core Tradeoff

Strategy	What Works	What Breaks	Who Pays
Batch (scheduled)	Simple, retryable, idempotent by design	Latency floor is the batch interval; late-arriving data causes silent gaps	On-call when a 4-hour job fails at hour 3
Micro-batch	Compromise between batch simplicity and stream latency	Still not real-time; window semantics leak into business logic	Platform team maintaining the scheduling layer
Stream processing	Sub-second latency, natural event-driven fit	Exactly-once is hard; state management is complex; debugging is painful	Every engineer who touches the topology
Change Data Capture (CDC)	Captures mutations at the source; no producer changes needed	Couples pipeline to source schema; binlog formats vary across databases	The team that owns the source when they want to migrate
Lambda architecture	Run both batch and stream — correctness plus latency	Two codepaths to maintain, two sets of bugs, drift between them	Everyone, forever

Staff Default Position

Staff default: stream processing for latency-sensitive paths, batch for analytical workloads. Always design backfill capability from day one — you will need it. If someone proposes a pipeline and cannot answer "how do we reprocess the last N days?", that pipeline is not production-ready.

Schema evolution is the silent killer. Pipelines break when producers change schema without coordination. Versioning is not a technical problem — it is a governance problem. Staff engineers enforce schema registries and backward-compatibility contracts before the first event is published.

When to Deviate

Prototype or internal tooling — Batch with cron is fine. Do not build Kafka infrastructure for a daily CSV export.
Regulatory or audit requirements — Lambda architecture earns its operational cost when you need both real-time alerting and provably correct batch reconciliation.
Tiny data volumes — If your entire dataset fits in memory, skip the distributed framework. A single-node script with good error handling beats a Flink cluster you cannot debug.
Cross-region replication — CDC becomes the only sane option when you need to replicate state across regions without rewriting producer logic.

Common Interview Mistakes

What Candidates Say	What Interviewers Hear	What Staff Engineers Say
"We'll use Kafka for everything"	Cargo-culting a tool without understanding the tradeoff	"Kafka for the event bus, but batch ETL for the warehouse backfill"
"Exactly-once is guaranteed"	Has not operated a stream processor under failure conditions	"We design for at-least-once and make consumers idempotent"
"We'll add streaming later"	Backfill strategy is an afterthought	"The replay path is designed before the forward path ships"
"Schema changes are handled by the consumer"	No concept of data contracts or governance	"Producers own backward compatibility; we enforce it at the registry"
"Lambda architecture solves everything"	Willing to double operational burden without quantifying the benefit	"We run dual paths only where the correctness-latency gap justifies the cost"

Quick Reference

Rendering diagram...

Staff Sentence Templates

Implementation Deep Dive

1. Kafka → Flink → Sink — The Canonical Stream Pipeline

The dominant production pattern for sub-second data pipelines is: Kafka as the durable event bus, Flink for stateful stream processing, and a purpose-built sink (database, search index, cache) at the end. Each layer has a distinct failure mode.

End-to-End Pipeline Pseudocode

# ── Producer: application emits business events ──
function onOrderPlaced(order):
    event = {
        "event_type": "order.placed",
        "order_id": order.id,
        "merchant_id": order.merchantId,
        "amount": order.total,
        "timestamp": now(),
        "schema_version": 3
    }
    kafka.produce(
        topic = "orders.events",
        key = order.merchantId,          # Partition by merchant for ordering
        value = avroSerialize(event, "orders.events-value", version=3),
        acks = "all"                     # Wait for all in-sync replicas
    )

# ── Flink job: stateful stream transformation ──
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(30_000)          # Checkpoint every 30 seconds
env.setStateBackend(RocksDBStateBackend("/checkpoints"))

DataStream<Order> orders = env
    .addSource(FlinkKafkaConsumer("orders.events", avroSchema, kafkaProps))
    .keyBy(event -> event.merchantId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new OrderAggregator())    # Sum amounts per merchant per minute

# ── Sink: write aggregated results ──
orders.addSink(new JdbcSink(
    "INSERT INTO merchant_revenue_1min (merchant_id, window_start, total_amount, order_count) " +
    "VALUES (?, ?, ?, ?) " +
    "ON CONFLICT (merchant_id, window_start) DO UPDATE SET total_amount = EXCLUDED.total_amount, " +
    "order_count = EXCLUDED.order_count",    # Idempotent upsert
    connectionConfig
))

Why each component:

Kafka (source): Durable, replayable, partitioned. If Flink crashes, it restarts from the last committed offset. No data loss.
Flink (processing): Exactly-once state management via checkpoints. Handles event-time windowing, late data, watermarks. RocksDB backend for state that exceeds memory.
JDBC sink (destination): Idempotent upsert (ON CONFLICT DO UPDATE) ensures reprocessed windows don't create duplicate rows.

Pipeline Monitoring — What to Track

Metric	Target	Alert Threshold	Why
Consumer lag (messages)	<1,000	>10,000 sustained 5 min	Pipeline falling behind input rate
Checkpoint duration	<5s	>30s	State too large or I/O pressure on checkpoint storage
Event-time skew	<30s	>2 min	Clock drift or partition imbalance; late events will be dropped
Backpressure (Flink)	<50%	>80% sustained	Processing slower than input; need more parallelism or optimization
Records in / records out ratio	~1.0 for pass-through, stable for aggregation	Sudden change	Logic bug or schema mismatch silently dropping records

2. Backfill Strategy — Replay from Source of Truth

Every pipeline breaks. Schema changes, bugs, infrastructure failures — you will need to reprocess historical data. If you cannot answer "how do we reprocess the last 7 days?" on day one, the pipeline is not production-ready.

Backfill Architecture

# Dual-path pipeline: forward (real-time) + backfill (batch)
#
# Forward path (normal operation):
#   Kafka topic → Flink job → Sink
#
# Backfill path (recovery):
#   Source DB → Batch exporter → Kafka backfill topic → Same Flink job → Sink

function triggerBackfill(startDate, endDate):
    # Step 1 — Pause the forward pipeline to avoid conflicts
    flink.savepoint("orders-pipeline")
    flink.stop("orders-pipeline")

    # Step 2 — Export source data to backfill topic
    batchExporter.run(
        query = "SELECT * FROM orders WHERE created_at BETWEEN ? AND ?",
        params = [startDate, endDate],
        outputTopic = "orders.events.backfill",
        batchSize = 10_000
    )

    # Step 3 — Run Flink job on backfill topic
    flink.run("orders-pipeline",
        source = "orders.events.backfill",
        startOffset = "earliest"
    )

    # Step 4 — Wait for backfill completion, then switch back
    flink.waitForCompletion("orders-pipeline")
    flink.run("orders-pipeline",
        source = "orders.events",
        startOffset = savedCheckpoint
    )

Backfill Tradeoffs

Strategy	Complexity	Downtime	Cost	Best For
Kafka offset reset	Low — restart consumer from earlier offset	None (run in parallel)	Kafka storage for retention window	Small backfills, <7 days
Separate backfill topic	Medium — batch export + replay	Forward pipeline paused	Compute for batch export + reprocessing	Large backfills, schema changes
Dual-write during migration	High — run old and new pipeline in parallel	None	2x compute during overlap	Zero-downtime migrations
Full rebuild from source	High — re-export entire dataset	Forward pipeline paused	Full reprocessing cost	Catastrophic corruption, new pipeline versions

3. Schema Evolution with Avro + Schema Registry

Schema changes are the #1 cause of pipeline outages. A producer adds a field; a consumer does not expect it; deserialization fails; the pipeline stops. Schema Registry enforces compatibility rules so that changes are safe by default.

Schema Evolution Rules

# Avro schema for order events (v3)
{
  "type": "record",
  "name": "OrderEvent",
  "namespace": "com.example.orders",
  "fields": [
    {"name": "order_id",    "type": "string"},
    {"name": "merchant_id", "type": "string"},
    {"name": "amount",      "type": "double"},
    {"name": "timestamp",   "type": "long"},
    {"name": "currency",    "type": "string", "default": "USD"},       # Added in v2
    {"name": "channel",     "type": ["null", "string"], "default": null} # Added in v3
  ]
}

Compatibility rules enforced at the registry:

Change Type	BACKWARD Compatible?	FORWARD Compatible?	FULL Compatible?
Add field with default	Yes	Yes	Yes
Add field without default	No — old consumers crash	Yes	No
Remove field with default	Yes	No — old producers crash	No
Rename field	No	No	No
Change field type	No (except promotions: int→long)	No	No

Schema Registry in the Pipeline

# Producer: serialize with schema registry
function produce(event):
    schemaId = registry.getSchemaId("orders.events-value", version="latest")
    # Avro serializer embeds schema ID in the first 5 bytes
    payload = avroSerialize(event, schemaId)
    kafka.produce("orders.events", key=event.merchantId, value=payload)

# Consumer: deserialize with schema registry
function consume(record):
    schemaId = extractSchemaId(record.value)    # Read from first 5 bytes
    writerSchema = registry.getSchema(schemaId) # Schema used by producer
    readerSchema = localSchema                  # Schema expected by consumer

    # Avro resolves field differences using default values
    event = avroDeserialize(record.value, writerSchema, readerSchema)
    process(event)

The consumer uses both the writer's schema (embedded in the message) and its own reader schema. Avro automatically handles field additions and removals using default values. This is why defaults are mandatory for safe evolution.

4. Exactly-Once with Idempotent Consumers

Distributed systems cannot guarantee exactly-once delivery. The practical approach is at-least-once delivery with idempotent processing — the same message processed twice produces the same result.

Idempotent Consumer Patterns

# Pattern 1: Deduplication table
function processEvent(event):
    # Check if already processed
    existing = db.query(
        "SELECT 1 FROM processed_events WHERE event_id = ?", event.id)
    if existing:
        metrics.increment("pipeline.dedup.hit")
        return  # Already processed — skip

    db.beginTransaction()
    # Process the event
    applyBusinessLogic(event)
    # Record processing in the same transaction
    db.execute("INSERT INTO processed_events (event_id, processed_at) VALUES (?, ?)",
               event.id, now())
    db.commit()

# Pattern 2: Idempotent upsert (no dedup table needed)
function processAggregation(windowResult):
    db.execute("""
        INSERT INTO merchant_revenue (merchant_id, window_start, total, count)
        VALUES (?, ?, ?, ?)
        ON CONFLICT (merchant_id, window_start)
        DO UPDATE SET total = EXCLUDED.total, count = EXCLUDED.count
    """, windowResult.merchantId, windowResult.windowStart,
         windowResult.total, windowResult.count)

Pattern 1 (dedup table) is the general-purpose approach — works for any event type. The dedup entry and the business effect are in the same database transaction, so they are atomically consistent.

Pattern 2 (idempotent upsert) is simpler but only works when the operation is naturally idempotent (SET, not INCREMENT). Aggregation results are idempotent — the same window reprocessed produces the same totals.

Exactly-Once End-to-End: The Full Picture

Component	Guarantee	Mechanism
Kafka producer	At-least-once (with `acks=all`)	Retries on broker failure
Kafka producer (idempotent mode)	Exactly-once within a session	Producer ID + sequence number
Flink processing	Exactly-once internal state	Checkpoint barriers + state snapshots
Sink (database)	Idempotent write	`ON CONFLICT DO UPDATE` or dedup table
End-to-end	Effectively exactly-once	Each layer handles its own guarantee; the chain is as strong as its weakest link

Architecture Diagram

Rendering diagram...

Forward path (solid lines): Producers serialize events through the Schema Registry and publish to Kafka. Flink consumes, processes (windowing, aggregation, enrichment), and writes to purpose-built sinks. Checkpoints to S3 enable exactly-once recovery.

Backfill path (dashed): When reprocessing is needed, the batch exporter reads from the source database and publishes to a separate Kafka topic. Flink processes the backfill topic using the same transformation logic, ensuring consistency between forward and backfill paths.

Failure Scenarios

1. Schema Mismatch Causing Silent Data Loss

Timeline: The Order Service team adds a required field shipping_method to the Avro schema without a default value. They deploy the producer on Monday. The Flink consumer, running schema v3, encounters v4 messages on Tuesday. Avro deserialization fails. Flink's error handler routes the failed messages to a dead-letter topic.

Blast radius: All orders placed after Monday are silently routed to the dead-letter topic. Revenue dashboards show a cliff — zero new orders. The drop looks like a business problem, not a pipeline problem. It takes 6 hours before someone correlates the dead-letter queue growth with the missing data.

Detection: Dead-letter queue depth alert (threshold: >100 messages/min). Schema compatibility check failed in CI but was overridden with a manual approval. Consumer deserialization error rate spikes.

Recovery:

Immediate: roll back the producer to schema v3, or register a v4 schema with a default value for shipping_method
Reprocess: replay the dead-letter topic through the consumer after the schema fix
Long-term: enforce FULL compatibility in the schema registry with no manual override. Block deploys that fail compatibility checks.

2. Flink Checkpoint Timeout Causing Reprocessing Cascade

Timeline: Flink's RocksDB state grows to 200GB after a month of operation. Checkpoint duration increases from 5 seconds to 90 seconds. The 30-second checkpoint interval overlaps with the running checkpoint, causing checkpoint timeout. Flink restarts the job from the last successful checkpoint — which is 10 minutes old. Those 10 minutes of data are reprocessed.

Blast radius: Downstream sinks receive duplicate writes for the reprocessed 10-minute window. If sinks are not idempotent, aggregation counts are inflated. If sinks are idempotent (upsert), no data corruption — but the reprocessing consumes resources and delays new data.

Detection: Checkpoint duration metric exceeds threshold. Checkpoint failure rate increases. RocksDB state size grows beyond capacity plan.

Recovery:

Immediate: increase checkpoint timeout to 120 seconds to stop the failure loop
Short-term: add state TTL — expire entries older than the business-relevant window (e.g., 24 hours for daily aggregations)
Long-term: use incremental checkpoints (only changed state since last checkpoint), partition state across more Flink task managers, or offload cold state to an external store

3. Consumer Lag Spiral During Traffic Spike

Timeline: Black Friday traffic is 5x normal. The Kafka topic ingests 500K events/sec. Flink consumers process at 100K events/sec. Consumer lag grows by 400K messages/sec. After 30 minutes, lag is 720M messages — 2 hours behind real-time.

Blast radius: All downstream systems — dashboards, search indexes, leaderboards — show data that is 2 hours stale. Business decisions are made on stale data. Real-time fraud detection is compromised because events arrive too late for intervention.

Detection: Consumer group lag alert. Processing throughput vs ingestion throughput divergence. End-to-end latency (event timestamp vs processing timestamp) exceeds SLA.

Recovery:

Immediate: scale Flink parallelism (add task managers) to match ingestion rate. Kafka partitions set the upper bound — you cannot have more Flink consumers than partitions
Short-term: if partitions are the bottleneck, increase partition count for the topic (requires producer restart for new partitioning)
Long-term: capacity plan for peak traffic, not average. If normal is 100K events/sec and Black Friday is 500K, provision Flink for 500K sustained, not 100K + auto-scaling (auto-scaling is too slow for burst absorption)
Emergency: drop non-critical events (analytics, logging) and prioritize business-critical events (fraud detection, inventory updates) through separate topics with independent consumer groups

Staff Interview Application

How to Introduce This Pattern

Lead with latency requirement, then name the backfill and schema contract as non-negotiable. This signals that you've operated pipelines in production, not just built them in a tutorial.

When NOT to Use This Pattern

Data fits in memory: If the entire dataset is <1GB, a single-node script with good error handling beats a Flink cluster. Don't add distributed infrastructure for a problem that fits in pandas.
Batch interval matches the SLA: If the business needs data refreshed every hour, a cron job with psql COPY is the right answer. Stream processing for hourly updates is over-engineering.
No replay requirement: If the data is ephemeral and reprocessing is not needed (metrics aggregation with acceptable loss), a simpler push-based pipeline without Kafka's durability overhead is sufficient.
Team cannot operate the stack: Flink requires JVM tuning, checkpoint management, and state backend expertise. If the team does not have this, use a managed service (AWS Kinesis Data Analytics, Google Dataflow) or stick with batch.

Follow-Up Questions to Anticipate

Interviewer Asks	What They Are Testing	How to Respond
"How do you handle late-arriving data?"	Event-time vs processing-time understanding	"Event-time windowing with watermarks. Late events within the allowed lateness period refire the window. Events beyond the allowed lateness go to a side output for separate processing."
"What happens when the pipeline breaks?"	Operational maturity	"Flink restarts from the last checkpoint — no data loss because Kafka retains events. The sink uses idempotent upserts so reprocessed data doesn't create duplicates."
"How do you backfill?"	Production readiness	"Kafka retention covers 7 days. For deeper backfills, we export from the source DB to a backfill topic and run the same Flink job against it. Same code, different input."
"How do you handle schema changes?"	Data governance awareness	"Schema Registry with FULL compatibility enforcement. New fields must have defaults. Producers that fail compatibility checks cannot deploy."
"Why not just use database triggers?"	Understanding of coupling and scale	"Triggers couple the pipeline to the database engine, don't scale horizontally, and can't be replayed. CDC via Debezium gives us the same change stream without the coupling."