Data Pipeline Patterns — Cross-Cutting Pattern
The Problem
Data needs to flow from producers to consumers, with transformations along the way. The fundamental choice is batch vs stream — and the answer depends on latency requirements, exactly-once needs, and the cost of the backfill you'll inevitably need to run. Every team gets this wrong at least once, usually by picking the architecture that sounds impressive rather than the one that matches their actual SLA.
Playbooks That Use This Pattern
- Stream Processing — Real-time event aggregation
- Web Crawling & Data Pipelines — Large-scale data ingestion
- Leaderboard & Counting — Score aggregation pipelines
- Search & Indexing — Index build and update pipelines
- Notification & Delivery — Event-driven notification routing
The Core Tradeoff
| Strategy | What Works | What Breaks | Who Pays |
|---|---|---|---|
| Batch (scheduled) | Simple, retryable, idempotent by design | Latency floor is the batch interval; late-arriving data causes silent gaps | On-call when a 4-hour job fails at hour 3 |
| Micro-batch | Compromise between batch simplicity and stream latency | Still not real-time; window semantics leak into business logic | Platform team maintaining the scheduling layer |
| Stream processing | Sub-second latency, natural event-driven fit | Exactly-once is hard; state management is complex; debugging is painful | Every engineer who touches the topology |
| Change Data Capture (CDC) | Captures mutations at the source; no producer changes needed | Couples pipeline to source schema; binlog formats vary across databases | The team that owns the source when they want to migrate |
| Lambda architecture | Run both batch and stream — correctness plus latency | Two codepaths to maintain, two sets of bugs, drift between them | Everyone, forever |
Staff Default Position
Staff default: stream processing for latency-sensitive paths, batch for analytical workloads. Always design backfill capability from day one — you will need it. If someone proposes a pipeline and cannot answer "how do we reprocess the last N days?", that pipeline is not production-ready.
Schema evolution is the silent killer. Pipelines break when producers change schema without coordination. Versioning is not a technical problem — it is a governance problem. Staff engineers enforce schema registries and backward-compatibility contracts before the first event is published.
When to Deviate
- Prototype or internal tooling — Batch with cron is fine. Do not build Kafka infrastructure for a daily CSV export.
- Regulatory or audit requirements — Lambda architecture earns its operational cost when you need both real-time alerting and provably correct batch reconciliation.
- Tiny data volumes — If your entire dataset fits in memory, skip the distributed framework. A single-node script with good error handling beats a Flink cluster you cannot debug.
- Cross-region replication — CDC becomes the only sane option when you need to replicate state across regions without rewriting producer logic.
Common Interview Mistakes
| What Candidates Say | What Interviewers Hear | What Staff Engineers Say |
|---|---|---|
| "We'll use Kafka for everything" | Cargo-culting a tool without understanding the tradeoff | "Kafka for the event bus, but batch ETL for the warehouse backfill" |
| "Exactly-once is guaranteed" | Has not operated a stream processor under failure conditions | "We design for at-least-once and make consumers idempotent" |
| "We'll add streaming later" | Backfill strategy is an afterthought | "The replay path is designed before the forward path ships" |
| "Schema changes are handled by the consumer" | No concept of data contracts or governance | "Producers own backward compatibility; we enforce it at the registry" |
| "Lambda architecture solves everything" | Willing to double operational burden without quantifying the benefit | "We run dual paths only where the correctness-latency gap justifies the cost" |
Quick Reference
Staff Sentence Templates
Implementation Deep Dive
1. Kafka → Flink → Sink — The Canonical Stream Pipeline
The dominant production pattern for sub-second data pipelines is: Kafka as the durable event bus, Flink for stateful stream processing, and a purpose-built sink (database, search index, cache) at the end. Each layer has a distinct failure mode.
End-to-End Pipeline Pseudocode
# ── Producer: application emits business events ──
function onOrderPlaced(order):
event = {
"event_type": "order.placed",
"order_id": order.id,
"merchant_id": order.merchantId,
"amount": order.total,
"timestamp": now(),
"schema_version": 3
}
kafka.produce(
topic = "orders.events",
key = order.merchantId, # Partition by merchant for ordering
value = avroSerialize(event, "orders.events-value", version=3),
acks = "all" # Wait for all in-sync replicas
)
# ── Flink job: stateful stream transformation ──
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
env.enableCheckpointing(30_000) # Checkpoint every 30 seconds
env.setStateBackend(RocksDBStateBackend("/checkpoints"))
DataStream<Order> orders = env
.addSource(FlinkKafkaConsumer("orders.events", avroSchema, kafkaProps))
.keyBy(event -> event.merchantId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new OrderAggregator()) # Sum amounts per merchant per minute
# ── Sink: write aggregated results ──
orders.addSink(new JdbcSink(
"INSERT INTO merchant_revenue_1min (merchant_id, window_start, total_amount, order_count) " +
"VALUES (?, ?, ?, ?) " +
"ON CONFLICT (merchant_id, window_start) DO UPDATE SET total_amount = EXCLUDED.total_amount, " +
"order_count = EXCLUDED.order_count", # Idempotent upsert
connectionConfig
))
Why each component:
- Kafka (source): Durable, replayable, partitioned. If Flink crashes, it restarts from the last committed offset. No data loss.
- Flink (processing): Exactly-once state management via checkpoints. Handles event-time windowing, late data, watermarks. RocksDB backend for state that exceeds memory.
- JDBC sink (destination): Idempotent upsert (
ON CONFLICT DO UPDATE) ensures reprocessed windows don't create duplicate rows.
Pipeline Monitoring — What to Track
| Metric | Target | Alert Threshold | Why |
|---|---|---|---|
| Consumer lag (messages) | <1,000 | >10,000 sustained 5 min | Pipeline falling behind input rate |
| Checkpoint duration | <5s | >30s | State too large or I/O pressure on checkpoint storage |
| Event-time skew | <30s | >2 min | Clock drift or partition imbalance; late events will be dropped |
| Backpressure (Flink) | <50% | >80% sustained | Processing slower than input; need more parallelism or optimization |
| Records in / records out ratio | ~1.0 for pass-through, stable for aggregation | Sudden change | Logic bug or schema mismatch silently dropping records |
2. Backfill Strategy — Replay from Source of Truth
Every pipeline breaks. Schema changes, bugs, infrastructure failures — you will need to reprocess historical data. If you cannot answer "how do we reprocess the last 7 days?" on day one, the pipeline is not production-ready.
Backfill Architecture
# Dual-path pipeline: forward (real-time) + backfill (batch)
#
# Forward path (normal operation):
# Kafka topic → Flink job → Sink
#
# Backfill path (recovery):
# Source DB → Batch exporter → Kafka backfill topic → Same Flink job → Sink
function triggerBackfill(startDate, endDate):
# Step 1 — Pause the forward pipeline to avoid conflicts
flink.savepoint("orders-pipeline")
flink.stop("orders-pipeline")
# Step 2 — Export source data to backfill topic
batchExporter.run(
query = "SELECT * FROM orders WHERE created_at BETWEEN ? AND ?",
params = [startDate, endDate],
outputTopic = "orders.events.backfill",
batchSize = 10_000
)
# Step 3 — Run Flink job on backfill topic
flink.run("orders-pipeline",
source = "orders.events.backfill",
startOffset = "earliest"
)
# Step 4 — Wait for backfill completion, then switch back
flink.waitForCompletion("orders-pipeline")
flink.run("orders-pipeline",
source = "orders.events",
startOffset = savedCheckpoint
)
Backfill Tradeoffs
| Strategy | Complexity | Downtime | Cost | Best For |
|---|---|---|---|---|
| Kafka offset reset | Low — restart consumer from earlier offset | None (run in parallel) | Kafka storage for retention window | Small backfills, <7 days |
| Separate backfill topic | Medium — batch export + replay | Forward pipeline paused | Compute for batch export + reprocessing | Large backfills, schema changes |
| Dual-write during migration | High — run old and new pipeline in parallel | None | 2x compute during overlap | Zero-downtime migrations |
| Full rebuild from source | High — re-export entire dataset | Forward pipeline paused | Full reprocessing cost | Catastrophic corruption, new pipeline versions |
3. Schema Evolution with Avro + Schema Registry
Schema changes are the #1 cause of pipeline outages. A producer adds a field; a consumer does not expect it; deserialization fails; the pipeline stops. Schema Registry enforces compatibility rules so that changes are safe by default.
Schema Evolution Rules
# Avro schema for order events (v3)
{
"type": "record",
"name": "OrderEvent",
"namespace": "com.example.orders",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "merchant_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "timestamp", "type": "long"},
{"name": "currency", "type": "string", "default": "USD"}, # Added in v2
{"name": "channel", "type": ["null", "string"], "default": null} # Added in v3
]
}
Compatibility rules enforced at the registry:
| Change Type | BACKWARD Compatible? | FORWARD Compatible? | FULL Compatible? |
|---|---|---|---|
| Add field with default | Yes | Yes | Yes |
| Add field without default | No — old consumers crash | Yes | No |
| Remove field with default | Yes | No — old producers crash | No |
| Rename field | No | No | No |
| Change field type | No (except promotions: int→long) | No | No |
Schema Registry in the Pipeline
# Producer: serialize with schema registry
function produce(event):
schemaId = registry.getSchemaId("orders.events-value", version="latest")
# Avro serializer embeds schema ID in the first 5 bytes
payload = avroSerialize(event, schemaId)
kafka.produce("orders.events", key=event.merchantId, value=payload)
# Consumer: deserialize with schema registry
function consume(record):
schemaId = extractSchemaId(record.value) # Read from first 5 bytes
writerSchema = registry.getSchema(schemaId) # Schema used by producer
readerSchema = localSchema # Schema expected by consumer
# Avro resolves field differences using default values
event = avroDeserialize(record.value, writerSchema, readerSchema)
process(event)
The consumer uses both the writer's schema (embedded in the message) and its own reader schema. Avro automatically handles field additions and removals using default values. This is why defaults are mandatory for safe evolution.
4. Exactly-Once with Idempotent Consumers
Distributed systems cannot guarantee exactly-once delivery. The practical approach is at-least-once delivery with idempotent processing — the same message processed twice produces the same result.
Idempotent Consumer Patterns
# Pattern 1: Deduplication table
function processEvent(event):
# Check if already processed
existing = db.query(
"SELECT 1 FROM processed_events WHERE event_id = ?", event.id)
if existing:
metrics.increment("pipeline.dedup.hit")
return # Already processed — skip
db.beginTransaction()
# Process the event
applyBusinessLogic(event)
# Record processing in the same transaction
db.execute("INSERT INTO processed_events (event_id, processed_at) VALUES (?, ?)",
event.id, now())
db.commit()
# Pattern 2: Idempotent upsert (no dedup table needed)
function processAggregation(windowResult):
db.execute("""
INSERT INTO merchant_revenue (merchant_id, window_start, total, count)
VALUES (?, ?, ?, ?)
ON CONFLICT (merchant_id, window_start)
DO UPDATE SET total = EXCLUDED.total, count = EXCLUDED.count
""", windowResult.merchantId, windowResult.windowStart,
windowResult.total, windowResult.count)
Pattern 1 (dedup table) is the general-purpose approach — works for any event type. The dedup entry and the business effect are in the same database transaction, so they are atomically consistent.
Pattern 2 (idempotent upsert) is simpler but only works when the operation is naturally idempotent (SET, not INCREMENT). Aggregation results are idempotent — the same window reprocessed produces the same totals.
Exactly-Once End-to-End: The Full Picture
| Component | Guarantee | Mechanism |
|---|---|---|
| Kafka producer | At-least-once (with acks=all) | Retries on broker failure |
| Kafka producer (idempotent mode) | Exactly-once within a session | Producer ID + sequence number |
| Flink processing | Exactly-once internal state | Checkpoint barriers + state snapshots |
| Sink (database) | Idempotent write | ON CONFLICT DO UPDATE or dedup table |
| End-to-end | Effectively exactly-once | Each layer handles its own guarantee; the chain is as strong as its weakest link |
Architecture Diagram
Forward path (solid lines): Producers serialize events through the Schema Registry and publish to Kafka. Flink consumes, processes (windowing, aggregation, enrichment), and writes to purpose-built sinks. Checkpoints to S3 enable exactly-once recovery.
Backfill path (dashed): When reprocessing is needed, the batch exporter reads from the source database and publishes to a separate Kafka topic. Flink processes the backfill topic using the same transformation logic, ensuring consistency between forward and backfill paths.
Failure Scenarios
1. Schema Mismatch Causing Silent Data Loss
Timeline: The Order Service team adds a required field shipping_method to the Avro schema without a default value. They deploy the producer on Monday. The Flink consumer, running schema v3, encounters v4 messages on Tuesday. Avro deserialization fails. Flink's error handler routes the failed messages to a dead-letter topic.
Blast radius: All orders placed after Monday are silently routed to the dead-letter topic. Revenue dashboards show a cliff — zero new orders. The drop looks like a business problem, not a pipeline problem. It takes 6 hours before someone correlates the dead-letter queue growth with the missing data.
Detection: Dead-letter queue depth alert (threshold: >100 messages/min). Schema compatibility check failed in CI but was overridden with a manual approval. Consumer deserialization error rate spikes.
Recovery:
- Immediate: roll back the producer to schema v3, or register a v4 schema with a default value for
shipping_method - Reprocess: replay the dead-letter topic through the consumer after the schema fix
- Long-term: enforce
FULLcompatibility in the schema registry with no manual override. Block deploys that fail compatibility checks.
2. Flink Checkpoint Timeout Causing Reprocessing Cascade
Timeline: Flink's RocksDB state grows to 200GB after a month of operation. Checkpoint duration increases from 5 seconds to 90 seconds. The 30-second checkpoint interval overlaps with the running checkpoint, causing checkpoint timeout. Flink restarts the job from the last successful checkpoint — which is 10 minutes old. Those 10 minutes of data are reprocessed.
Blast radius: Downstream sinks receive duplicate writes for the reprocessed 10-minute window. If sinks are not idempotent, aggregation counts are inflated. If sinks are idempotent (upsert), no data corruption — but the reprocessing consumes resources and delays new data.
Detection: Checkpoint duration metric exceeds threshold. Checkpoint failure rate increases. RocksDB state size grows beyond capacity plan.
Recovery:
- Immediate: increase checkpoint timeout to 120 seconds to stop the failure loop
- Short-term: add state TTL — expire entries older than the business-relevant window (e.g., 24 hours for daily aggregations)
- Long-term: use incremental checkpoints (only changed state since last checkpoint), partition state across more Flink task managers, or offload cold state to an external store
3. Consumer Lag Spiral During Traffic Spike
Timeline: Black Friday traffic is 5x normal. The Kafka topic ingests 500K events/sec. Flink consumers process at 100K events/sec. Consumer lag grows by 400K messages/sec. After 30 minutes, lag is 720M messages — 2 hours behind real-time.
Blast radius: All downstream systems — dashboards, search indexes, leaderboards — show data that is 2 hours stale. Business decisions are made on stale data. Real-time fraud detection is compromised because events arrive too late for intervention.
Detection: Consumer group lag alert. Processing throughput vs ingestion throughput divergence. End-to-end latency (event timestamp vs processing timestamp) exceeds SLA.
Recovery:
- Immediate: scale Flink parallelism (add task managers) to match ingestion rate. Kafka partitions set the upper bound — you cannot have more Flink consumers than partitions
- Short-term: if partitions are the bottleneck, increase partition count for the topic (requires producer restart for new partitioning)
- Long-term: capacity plan for peak traffic, not average. If normal is 100K events/sec and Black Friday is 500K, provision Flink for 500K sustained, not 100K + auto-scaling (auto-scaling is too slow for burst absorption)
- Emergency: drop non-critical events (analytics, logging) and prioritize business-critical events (fraud detection, inventory updates) through separate topics with independent consumer groups
Staff Interview Application
How to Introduce This Pattern
Lead with latency requirement, then name the backfill and schema contract as non-negotiable. This signals that you've operated pipelines in production, not just built them in a tutorial.
When NOT to Use This Pattern
- Data fits in memory: If the entire dataset is <1GB, a single-node script with good error handling beats a Flink cluster. Don't add distributed infrastructure for a problem that fits in
pandas. - Batch interval matches the SLA: If the business needs data refreshed every hour, a cron job with
psql COPYis the right answer. Stream processing for hourly updates is over-engineering. - No replay requirement: If the data is ephemeral and reprocessing is not needed (metrics aggregation with acceptable loss), a simpler push-based pipeline without Kafka's durability overhead is sufficient.
- Team cannot operate the stack: Flink requires JVM tuning, checkpoint management, and state backend expertise. If the team does not have this, use a managed service (AWS Kinesis Data Analytics, Google Dataflow) or stick with batch.
Follow-Up Questions to Anticipate
| Interviewer Asks | What They Are Testing | How to Respond |
|---|---|---|
| "How do you handle late-arriving data?" | Event-time vs processing-time understanding | "Event-time windowing with watermarks. Late events within the allowed lateness period refire the window. Events beyond the allowed lateness go to a side output for separate processing." |
| "What happens when the pipeline breaks?" | Operational maturity | "Flink restarts from the last checkpoint — no data loss because Kafka retains events. The sink uses idempotent upserts so reprocessed data doesn't create duplicates." |
| "How do you backfill?" | Production readiness | "Kafka retention covers 7 days. For deeper backfills, we export from the source DB to a backfill topic and run the same Flink job against it. Same code, different input." |
| "How do you handle schema changes?" | Data governance awareness | "Schema Registry with FULL compatibility enforcement. New fields must have defaults. Producers that fail compatibility checks cannot deploy." |
| "Why not just use database triggers?" | Understanding of coupling and scale | "Triggers couple the pipeline to the database engine, don't scale horizontally, and can't be replayed. CDC via Debezium gives us the same change stream without the coupling." |