Design with Elasticsearch — Staff-Level Technology Guide
The 60-Second Pitch
Elasticsearch is a distributed search and analytics engine built on Apache Lucene. Inverted indexes, BM25 relevance scoring, near-real-time indexing, and horizontal scaling across commodity hardware. In system design interviews, Elasticsearch is the default choice when the question involves full-text search, log aggregation, autocomplete, faceted navigation, or any workload where "find documents matching a complex query across billions of records in <100ms" is the requirement.
The Staff-level insight: Elasticsearch is not a database — it is a search engine that happens to store data. It has no transactions, no referential integrity, no durable single-write acknowledgment by default, and its eventual consistency model means a document indexed 50ms ago may not yet be visible to search. The Staff move is to pair Elasticsearch with a source of truth (PostgreSQL, Kafka) and treat it as a derived, eventually consistent read model optimized for query patterns that relational databases handle poorly.
Architecture & Internals
Cluster, Nodes, and Roles
An Elasticsearch cluster is a group of nodes that collectively store data and provide search capabilities. Each node can serve one or more roles:
Master-eligible nodes manage cluster state — index creation/deletion, shard allocation, node membership. The cluster elects one active master at a time. You need a minimum of 3 master-eligible nodes for quorum-based leader election (avoiding split brain). Master nodes should be lightweight — they do not store data or execute queries on dedicated clusters.
Data nodes store index shards and execute search and indexing operations. These are the workhorses — they need fast disks (SSD), significant RAM (for OS page cache and field data), and adequate CPU for query execution. In hot/warm/cold architectures, data nodes are tiered by hardware class: hot nodes (fast SSDs for recent data), warm nodes (HDDs for older data), cold nodes (cheapest storage for archived data).
Coordinating nodes (also called client nodes) route requests, scatter queries across shards, gather results, and merge them. Every node can coordinate, but dedicated coordinating nodes prevent expensive aggregation merges from competing with indexing on data nodes. For high-query-rate clusters, dedicated coordinators prevent search latency spikes during bulk indexing.
Ingest nodes run ingest pipelines (transformation, enrichment) before indexing. Useful for log parsing (Grok patterns), field extraction, GeoIP enrichment — preprocessing that would otherwise happen in Logstash or application code.
Shards and Replicas
An index is divided into primary shards (the unit of data distribution) and replica shards (copies of primaries for redundancy and read scaling). A document is routed to a primary shard by hash(routing_key) % num_primary_shards. The routing key defaults to the document _id but can be overridden — a critical optimization for queries that always filter on the same field (e.g., tenant_id).
The number of primary shards is set at index creation and cannot be changed afterward (without reindexing). This is similar to Kafka's partition count — it is a commitment. Too few shards limits parallelism and maximum index size. Too many shards wastes resources (each shard is a Lucene index with its own file handles, memory structures, and thread pool slots). The rule of thumb: 10-50GB per shard for search-heavy workloads, up to 50GB for logging workloads.
Replica shards serve two purposes: (1) fault tolerance — if a node with a primary shard fails, a replica is promoted to primary, and (2) read scaling — search queries can be served by either primary or replica. With 1 replica per shard across 3 data nodes, the cluster tolerates the loss of any single node without data loss or search downtime.
The Inverted Index
Elasticsearch's speed comes from the inverted index — a data structure that maps each unique term to the list of documents containing it. When you index a document with text "The quick brown fox", the analyzer tokenizes it into terms (the, quick, brown, fox), and each term is added to the inverted index with a pointer to the document. A search for "brown fox" looks up both terms in the inverted index, finds the intersection of document lists, and returns matching documents — without scanning a single document's full text.
This is fundamentally different from a B-tree index in a relational database. A B-tree on a text column supports exact match and prefix queries (LIKE 'foo%'). An inverted index supports full-text search, stemming (matching "running" when searching "run"), synonyms, fuzzy matching (typo tolerance), and relevance scoring — all in O(1) per term lookup plus O(N) for posting list intersection, where N is the number of matching documents.
The analyzer pipeline controls how text becomes terms: character filters (strip HTML, normalize unicode) → tokenizer (split on whitespace, ngrams, or language-specific rules) → token filters (lowercase, stemming, stop word removal, synonyms). Choosing the right analyzer is a relevance-critical decision. The standard analyzer (default) tokenizes on word boundaries and lowercases. For autocomplete, use an edge_ngram tokenizer. For exact matching on IDs or enums, use the keyword analyzer (no tokenization).
BM25 Relevance Scoring
Elasticsearch ranks search results using BM25 (Best Matching 25), a probabilistic relevance scoring algorithm. BM25 considers three factors: term frequency (how often the term appears in the document — more = higher score, with diminishing returns), inverse document frequency (how rare the term is across all documents — rarer terms score higher), and field length normalization (shorter fields score higher for the same term frequency — a match in a title is more relevant than the same match in a 10,000-word body).
BM25 replaced TF-IDF as the default scoring model in Elasticsearch 5.x. The key improvement: BM25 saturates term frequency impact (a term appearing 100 times is not 100x more relevant than appearing once), making it more robust against keyword stuffing and long documents.
In interviews, BM25 is relevant when discussing relevance tuning. The Staff answer to "how do you improve search quality?": "Boosting fields (title^3, body^1), custom analyzers for domain-specific tokenization, and function_score queries that blend BM25 relevance with business signals (recency, popularity, user personalization)."
Near-Real-Time Search
Documents are not immediately searchable after indexing. Elasticsearch writes to an in-memory buffer and periodically refreshes (default every 1 second), which creates a new Lucene segment and makes buffered documents visible to search. This 1-second delay is the "near-real-time" in Elasticsearch's description.
For bulk indexing (initial data load, reindexing), set refresh_interval=-1 to disable automatic refresh, index all documents, then manually refresh. This can improve indexing throughput by 5-10x by avoiding the overhead of creating many small segments.
Documents are durable (survive node restart) only after a flush, which writes the in-memory buffer and transaction log (translog) to disk as a Lucene commit. The translog provides durability between flushes — every write operation is appended to the translog, which is fsynced by default. If the node crashes before a flush, the translog is replayed on recovery.
Core Concepts for Interviews
Mapping Design
A mapping defines the schema of an index — field names, types, and analysis settings. Unlike relational schemas, mappings are not enforced at write time by default (dynamic mapping adds fields automatically). In production, always use explicit mappings to prevent mapping explosion — the accidental creation of thousands of fields from unstructured JSON, each consuming cluster state memory and degrading performance.
Key field types: text (analyzed, for full-text search), keyword (not analyzed, for exact match, sorting, aggregations), integer/long/float/double (numeric), date, geo_point, nested (for arrays of objects that need independent querying), object (flattened, fields merged with parent — no independent querying of array elements).
The most common mapping mistake: using text for fields that should be keyword. A status field with values active/inactive should be keyword — analyzing it into tokens (a, c, t, i, v, e) is meaningless. Use multi-fields ("status": { "type": "keyword", "fields": { "search": { "type": "text" } } }) only when you genuinely need both exact match and full-text search on the same field.
Nested vs. Parent-Child
When documents contain arrays of objects, Elasticsearch flattens them by default. An order with items: [{ "name": "shirt", "size": "M" }, { "name": "pants", "size": "L" }] becomes items.name: ["shirt", "pants"] and items.size: ["M", "L"] — losing the association between name and size. A query for "shirt in size L" incorrectly matches this document.
Nested fields preserve the association. Each nested object is stored as a hidden internal document. Queries use nested query syntax to match within a single nested object. Cost: each nested object is a separate Lucene document — a parent with 100 nested objects creates 101 Lucene documents, impacting indexing speed, shard size, and query complexity.
Parent-child (join field) stores parent and child as separate documents in the same index. Children can be updated independently without reindexing the parent. Cost: has_child and has_parent queries are slower than nested queries because they require a global ordinals lookup across the shard.
Decision rule: use nested when the number of nested objects per document is small (<100) and they are always read/written with the parent. Use parent-child when children are updated independently or the number of children per parent is large (>100).
Denormalization for Search
Elasticsearch has no joins. Every query operates on a single index (cross-index search is possible but not a join — it is parallel queries with merged results). This means search indexes must be denormalized: embed all data needed for search and display into each document.
A product search index denormalizes: { "product_id": 123, "name": "Running Shoe", "brand_name": "Nike", "category_path": ["Shoes", "Athletic", "Running"], "price": 129.99, "avg_rating": 4.5, "review_count": 2847, "in_stock": true }. The brand name is duplicated in every product document. If Nike changes its name, you reindex every Nike product. This is the tradeoff: query speed and simplicity in exchange for update complexity and storage redundancy.
Query Types
Full-Text Queries
match: the workhorse. Analyzes the query string, then searches for each term in the inverted index. { "match": { "title": "running shoes" } } finds documents containing "running" OR "shoes" (default operator). With "operator": "and", both terms must appear.
multi_match: searches across multiple fields. { "multi_match": { "query": "running shoes", "fields": ["title^3", "description", "brand"] } } boosts title matches 3x. Types: best_fields (score from the best-matching field), most_fields (sum scores across fields), cross_fields (term-centric — each term must appear in at least one field).
match_phrase: terms must appear in order and adjacent. { "match_phrase": { "title": "running shoes" } } matches "running shoes" but not "shoes for running". Use slop parameter to allow N words between terms.
Structured Queries
term: exact match on keyword fields (not analyzed). { "term": { "status": "active" } }. Never use term on text fields — the query is not analyzed but the field was, so "Active" won't match the lowercased token "active".
range: { "range": { "price": { "gte": 50, "lte": 200 } } }. Works on numeric, date, and keyword fields.
bool: combines queries with must (AND, affects score), filter (AND, no scoring — faster), should (OR, boosts score), must_not (NOT, no scoring). The Staff pattern for search: full-text in must, structured filters in filter (price range, category, in-stock), business boosting in should (featured products, high ratings).
Aggregations
Bucket aggregations: group documents — terms (group by field value), date_histogram (group by time interval), range (group by numeric ranges). These power faceted navigation ("500 results in Shoes, 200 in Clothing").
Metric aggregations: compute statistics within buckets — avg, sum, min, max, cardinality (approximate distinct count using HyperLogLog), percentiles.
Pipeline aggregations: operate on other aggregation results — moving averages, cumulative sums, bucket selectors. These power analytics dashboards and time-series analysis within Elasticsearch.
Autocomplete and Suggestions
Completion suggester: purpose-built for autocomplete. Uses an FST (finite state transducer) data structure for prefix lookups in O(length) time. The field type completion stores suggestions separately from the inverted index — sub-millisecond response time regardless of index size.
Edge n-grams: alternative approach — tokenize "elasticsearch" as ["e", "el", "ela", "elas", ...] at index time. Queries use a standard match query. More flexible than completion suggester (supports fuzzy matching, middle-of-word matching) but uses more disk space and indexing time.
The Staff recommendation: completion suggester for simple prefix autocomplete (product names, user names). Edge n-grams for search-as-you-type with fuzzy tolerance and more complex matching requirements.
Scaling Elasticsearch
Shard Sizing
The fundamental scaling unit is the shard. Each shard is an independent Lucene index with its own inverted index, file handles, and memory overhead (~500MB-1GB heap per shard). Oversized shards (>50GB) cause slow recoveries and rebalancing. Undersized shards (<1GB) waste resources with excessive overhead.
Target: 10-50GB per shard for search workloads. For a 500GB dataset with 1 replica: 500GB / 30GB_per_shard ≈ 17 primary shards, 34 total shards (17 primary + 17 replica). Distribute across data nodes so each node holds a manageable number of shards (aim for <500 shards per node).
Index Lifecycle Management (ILM)
For time-series data (logs, metrics, events), ILM automates index rotation through phases:
| Phase | Duration | Storage | Purpose |
|---|---|---|---|
| Hot | 0-7 days | Fast SSD | Active writes and most queries |
| Warm | 7-30 days | Standard SSD | Read-only, still frequently queried |
| Cold | 30-90 days | HDD or S3 (searchable snapshots) | Rarely queried, retained for compliance |
| Delete | >90 days | None | Purged |
Rollover indices: instead of one massive index, create date-based indices (logs-2024.01.15) with an alias (logs-write pointing to the current index). ILM rolls over when the index hits a size threshold (50GB) or age threshold (1 day). Queries use a read alias (logs-read) that spans all active indices. This pattern enables cheap deletion (drop an entire index vs. deleting documents) and per-phase hardware allocation.
Hot/Warm/Cold Architecture
Assign data nodes to tiers using node attributes (node.attr.data_tier: hot). ILM policies move indices between tiers automatically. Hot nodes run on fast NVMe SSDs with high CPU. Warm nodes run on standard SSDs with moderate CPU. Cold nodes use searchable snapshots from S3 — data is not stored locally, only fetched on demand.
This architecture reduces infrastructure cost by 60-80% compared to keeping all data on hot-tier hardware. A 10TB cluster with 90-day retention might need only 1TB of hot storage (7 days), 3TB warm (30 days), and 6TB cold (searchable snapshots on S3 at pennies per GB).
Failure Modes
Cluster Yellow / Red State
Yellow: all primary shards are assigned, but some replica shards are not. The cluster functions but has reduced redundancy — one more node failure could cause data loss. Common cause: cluster has fewer data nodes than the replication factor, or a node recently left and replicas are being relocated.
Red: one or more primary shards are unassigned. Queries that hit unassigned shards return partial results or errors. This is a data availability incident.
Detection: GET _cluster/health — status field. Alert immediately on red, within 15 minutes on yellow.
Staff Response: "Yellow after a node bounce is expected — replicas relocate within minutes. Sustained yellow means we have a structural problem: insufficient nodes, disk watermarks exceeded (cluster.routing.allocation.disk.watermark.low defaults to 85%), or shard allocation filtering misconfigured. Red requires immediate investigation — check GET _cluster/allocation/explain to understand why the shard cannot be assigned. Most common cause: disk full on all eligible nodes."
Shard Relocation Storms
Symptom: Cluster performance degrades during node addition, removal, or recovery. Network bandwidth is saturated with shard data transfers. Search latency spikes.
Detection: GET _cat/recovery shows many concurrent recoveries. Network I/O metrics on data nodes spike.
Staff Response: "Limit concurrent recoveries with cluster.routing.allocation.node_concurrent_recoveries (default 2). Set indices.recovery.max_bytes_per_sec to cap recovery bandwidth (default 40MB/s). For planned maintenance, use allocation filtering to drain a node gracefully before removing it: PUT _cluster/settings { "transient": { "cluster.routing.allocation.exclude._name": "node-5" } }."
Mapping Explosion
Symptom: Master node memory usage spikes. Cluster state size grows to hundreds of MB. Index creation and mapping updates become slow. Master node becomes unresponsive.
Detection: GET _cluster/state/metadata size exceeds 100MB. Index mapping has >1,000 fields.
Staff Response: "Set index.mapping.total_fields.limit (default 1000) and dynamic: strict on all production indices. Audit existing mappings for unused fields. For log ingestion with variable fields, use flattened field type (stores the entire object as a single field with keyword-only sub-fields) instead of dynamic mapping."
Slow Queries
Symptom: Search latency p99 exceeds SLA. Some queries take seconds instead of milliseconds.
Detection: Enable slow log (index.search.slowlog.threshold.query.warn: 1s). Monitor search_latency metrics per index.
Staff Response: "Profile with GET index/_search?profile=true. Common causes: (1) wildcard or regexp queries at the beginning of a term (no inverted index optimization — full scan), (2) terms aggregation on high-cardinality fields loading millions of unique values into heap, (3) deep pagination (from: 10000, size: 10 — Elasticsearch must fetch and sort 10,010 documents from every shard, then discard 10,000). Fix deep pagination with search_after or scroll API."
Split Brain (Legacy)
Symptom: Two master nodes active simultaneously. Each half of the cluster accepts writes independently. Data diverges.
Detection: In Elasticsearch 7.x+, this is effectively eliminated by the built-in cluster coordination (quorum-based voting configuration). In older versions, monitor number_of_master_nodes — should always be exactly 1.
Staff Response: "Elasticsearch 7.x+ uses a quorum-based voting configuration that prevents split brain without manual minimum_master_nodes tuning. For pre-7.x clusters: set discovery.zen.minimum_master_nodes to (master-eligible_nodes / 2) + 1. For any version: use 3 dedicated master nodes (never 2, never an even number) and ensure master nodes are in separate failure domains."
When to Use vs. Alternatives
| Use Case | Elasticsearch | PostgreSQL FTS | Algolia | Typesense | Solr |
|---|---|---|---|---|---|
| Full-text search (>1M docs) | ✅ Best | ⚠️ Viable to ~10M rows | ✅ Managed | ✅ Simpler | ✅ Comparable |
| Relevance tuning | ✅ Extensive | ⚠️ Basic (ts_rank) | ✅ AI-powered | ⚠️ Basic | ✅ Extensive |
| Log aggregation | ✅ Best (ELK stack) | ❌ Not designed for | ❌ Not designed for | ❌ Not designed for | ⚠️ Possible |
| Autocomplete | ✅ Completion suggester | ⚠️ pg_trgm | ✅ Built-in | ✅ Built-in | ✅ Possible |
| Faceted navigation | ✅ Aggregations | ⚠️ Manual GROUP BY | ✅ Built-in | ✅ Built-in | ✅ Built-in |
| Geo search | ✅ Geo queries | ✅ PostGIS (better) | ⚠️ Basic | ⚠️ Basic | ✅ Spatial |
| Ops complexity | ⚠️ High (cluster mgmt) | ✅ Part of existing DB | ✅ Managed SaaS | ✅ Simple | ⚠️ High |
| Real-time indexing | ✅ ~1s delay | ✅ Immediate | ✅ Immediate | ✅ Immediate | ✅ Near-real-time |
| Cost at scale | ⚠️ Memory-intensive | ✅ Cheapest | ❌ Expensive at scale | ✅ Affordable | ⚠️ Similar to ES |
Decision rule: Use Elasticsearch when you need full-text search at scale (>10M documents), log aggregation (ELK/OpenSearch), complex aggregations across large datasets, or relevance tuning beyond basic ranking. Use PostgreSQL tsvector + GIN when search is a secondary feature on <10M rows and you do not want to operate a separate cluster. Use Algolia when search quality and developer experience matter more than cost and you are under 10M records. Use Typesense when you want Algolia-like simplicity with self-hosted control.
Deployment Topologies
Staff-Level Operational Concerns
Heap sizing: Elasticsearch uses Java heap for field data caches, query caches, and cluster state. Set heap to 50% of available RAM, never exceeding 31GB (beyond 31GB, JVM loses compressed ordinary object pointers, and effective heap efficiency drops). The other 50% of RAM is used by the OS page cache for Lucene segment files — this is equally critical. A 64GB server should run with 31GB heap and 33GB for page cache.
Monitoring essentials: Cluster health (red/yellow/green), JVM heap usage and GC frequency (jvm.gc.collectors.old.collection_time_in_millis — long pauses indicate heap pressure), indexing rate and search latency per index, pending tasks queue (master node backlog), thread pool rejections (search.rejected, write.rejected), and disk watermarks.
Reindexing strategy: Unlike schema migrations in PostgreSQL, Elasticsearch mapping changes often require a full reindex. The zero-downtime pattern: create a new index with the updated mapping, reindex from the old index (or from the source of truth), swap the alias from old to new, delete the old index. Use _reindex API for Elasticsearch-to-Elasticsearch, or re-feed from Kafka/PostgreSQL for source-of-truth rebuilds.
Index aliases are non-negotiable in production. Never point application code at concrete index names (products-v1). Always use aliases (products). Aliases enable zero-downtime reindexing (swap alias from v1 to v2), blue-green deployments, and A/B testing of different mappings or analyzers.
Interview Application
Which Playbooks Use Elasticsearch
| Playbook | How Elasticsearch Is Used | Key Pattern |
|---|---|---|
| Search Indexing | Primary search engine | Inverted index with custom analyzers and BM25 tuning |
| Web Crawling | Crawled content indexing | Bulk indexing pipeline with deduplication |
| Feed Generation | Content search and discovery | Multi-field search with recency boosting |
| Notification Systems | Notification search and filtering | Structured queries with date range filters |
What L5 Says vs. What L6 Says
| Topic | L5 Says | L6 Says |
|---|---|---|
| Choice | "We'll use Elasticsearch for search" | "Elasticsearch for the search read model — inverted indexes give us sub-100ms full-text queries across 500M products. PostgreSQL remains the source of truth. We index via a Kafka consumer that denormalizes product data with brand, category, and inventory status." |
| Scaling | "We'll add more nodes" | "10 primary shards at ~30GB each for 300GB of product data. 1 replica for fault tolerance. Hot/warm tiers if we add search analytics logging. Shard count is fixed at index creation so we size for 2x current volume." |
| Relevance | "Elasticsearch handles ranking" | "BM25 base scoring with field boosting (title^3, description^1). Function score query blends relevance with conversion rate and recency decay. A/B test scoring changes via index aliases pointing to different index versions." |
| Failure | "Elasticsearch is distributed" | "Near-real-time means a product indexed now is searchable in ~1 second. During that window, a user who just added a product will not find it via search. We handle this with a read-your-writes pattern: check the source of truth for recently created items and merge with search results." |
| Consistency | "It's eventually consistent" | "The indexing pipeline lag from PostgreSQL change to ES visibility is typically 2-5 seconds via Kafka. If the pipeline falls behind, search results are stale. We monitor consumer lag and alert if search index is >30 seconds behind source of truth." |
Common Interview Mistakes
| Mistake | Why It's Wrong | What to Say Instead |
|---|---|---|
| "Elasticsearch as primary database" | No transactions, eventual consistency, no referential integrity | "Elasticsearch as a search read model, PostgreSQL/Kafka as source of truth" |
| "Just index everything" | Mapping explosion, wasted storage, slower queries | "Explicit mappings with dynamic: strict. Index only fields needed for search, filter, sort, and display." |
LIKE '%query%' in PostgreSQL instead | Full table scan, no relevance scoring, no fuzzy matching | "PostgreSQL tsvector for <10M rows. Elasticsearch for >10M or when relevance tuning, aggregations, or autocomplete are needed." |
| Ignoring near-real-time delay | Documents are not searchable for ~1 second after indexing | "1-second indexing delay is acceptable for most search use cases. For real-time requirements, supplement with a source-of-truth check." |
| "We'll use scroll API for pagination" | Scroll holds a point-in-time snapshot consuming heap; deprecated for user-facing pagination | "search_after with a sort tiebreaker for deep pagination. Scroll only for bulk export." |