Design Service Discovery | StaffSignal Playbook

Technologies referenced in this playbook: ZooKeeper & etcd · API Gateways

How to Use This Playbook

This playbook supports three reading modes:

Mode	Time	What to Read
Quick Review	15 min	Executive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study	1-2 hrs	Interview Walkthrough + Core Flow, expand appendices where you're weak
Deep Dive	3+ hrs	Everything, including all appendices

What is Service Discovery? — Why interviewers pick this topic

The Problem

In a microservices architecture, services need to find and communicate with each other. IP addresses and ports change constantly — containers restart, auto-scaling adds/removes instances, deploys roll new versions. Service discovery is the mechanism that lets services find each other dynamically, without hardcoded addresses. It's the phone book of your distributed system.

Common Use Cases

Service-to-Service Communication: Service A needs to call Service B — which instance? At what address?
Load Distribution: Spread requests across multiple instances of a service
Health-Aware Routing: Only route to healthy instances, skip unhealthy ones
Blue-Green/Canary Deployment: Route traffic to specific versions of a service
Multi-Region Routing: Route to the closest or most appropriate region

Why Interviewers Ask About This

Service discovery surfaces the core Staff-level tension: availability vs consistency of the service registry. A stale registry routes to dead instances (bad). A strict registry that requires consensus adds latency to every lookup (also bad). Interviewers want to see you reason about this tension, choose the right consistency model for discovery, and understand the operational cost of registry failures.

Mechanics Refresher: DNS Resolution for Service Discovery — How DNS-based discovery actually works

DNS Record Types for Discovery

Record Type	Returns	Example	Use Case
A record	IP address	`checkout.prod` → `10.0.1.5`	Simple service resolution
SRV record	IP + Port + Priority + Weight	`_http._tcp.checkout.prod` → `10.0.1.5:8080 priority=10 weight=50`	Port-aware routing, weighted load distribution
CNAME	Another hostname	`checkout.prod` → `checkout.us-east.elb.aws.com`	Indirection through load balancer

SRV records are the "right" record type for service discovery because they include port information and support weighted routing. But many client libraries only support A records, which is why most DNS-based discovery uses A records with a well-known port convention.

DNS Caching Layers (The TTL Problem)

When you set a DNS TTL of 10 seconds, the actual staleness is often longer due to multiple caching layers:

Layer                     Caches?     Respects TTL?
──────────────────────────────────────────────────
Application DNS cache     Sometimes   Often ignores (JVM: 30s default!)
OS resolver cache         Yes         Usually respects
Local DNS server          Yes         Respects
Upstream DNS server       Yes         Respects

Java gotcha: The JVM caches DNS results for 30 seconds by default (forever for successful lookups in some versions). Set networkaddress.cache.ttl=10 in java.security or use -Dsun.net.inetaddr.ttl=10. This is the #1 cause of "DNS TTL is 10s but my service still routes to dead instances for 30s."

Effective staleness = TTL + max(caching layer delays). For a 10s TTL with JVM defaults: up to 40 seconds before a client sees a change.

Mechanics Refresher: Health Check Protocols — How services prove they're alive

Health Check Types

Check Type	How It Works	What It Proves	What It Misses
TCP	Open a TCP connection to port	Process is running and accepting connections	Application-level health (may accept TCP but crash on requests)
HTTP	Send GET to `/health`, expect 200	Application is running and HTTP stack works	Deep health (DB may be disconnected)
gRPC	Call `grpc.health.v1.Health/Check`	gRPC server is running and responding	Same as HTTP — only proves the health endpoint works
Script/exec	Run a command inside the container	Arbitrary health logic	Slow — adds subprocess overhead per check

The 200 OK Lie

A common failure mode: the health endpoint returns 200 OK but the service can't actually serve real requests. This happens because:

Health endpoint is a simple handler that doesn't touch the database
Connection pool is exhausted but the health thread has its own connection
Service is in a degraded state but technically "alive"

Fix: Distinguish liveness (is the process alive?) from readiness (can it serve traffic?). The readiness check should verify downstream dependencies:

GET /health    → 200 if process is alive (for restart decisions)
GET /ready     → 200 only if DB connected AND cache reachable AND <100 pending requests

Timing Tradeoffs

Parameter	Aggressive (fast detection)	Conservative (fewer false positives)
Check interval	2s	10s
Failure threshold	1 failure → remove	3 failures → remove
Recovery threshold	1 success → add back	3 successes → add back
Detection time	2-4s	30s
False positive risk	High (network blip → removal)	Low

Staff insight: Aggressive for readiness checks (remove unhealthy instances fast), conservative for liveness checks (don't restart containers due to transient network issues).

What This Interview Actually Tests

Service discovery is not a "use Consul or Eureka" question.

This is a registry reliability and failure propagation question that tests:

Whether you understand that the registry is a critical dependency for every service
Whether you reason about what happens when the registry is wrong (stale entries, missing entries)
Whether you design for registry failure (what if the registry itself is down?)
Whether you understand the naming and routing abstractions that scale with organizational growth

The key insight: Service discovery failure doesn't cause one service to fail — it causes every service to fail. The registry is the most dangerous dependency in your microservices architecture because it's invisible until it breaks.

The L5 vs L6 Contrast (Memorize This)

Level Calibration

Behavior	L5 (Senior)	L6 (Staff)
First move	"We'll use Consul for service discovery"	Asks "How many services? What's the failure mode if discovery is unavailable?"
Architecture	Centralized registry (Consul/Eureka)	Evaluates DNS-based vs registry-based vs mesh-based based on organizational needs
Failure reasoning	"Registry has health checks"	Identifies the Registry SPOF: if discovery is down, no service can find any other service
Health checking	"Services register and deregister"	Designs multi-layer health: registry-level, client-level, and application-level health checks
Ownership	"Platform team manages the registry"	Defines naming conventions and ownership: who registers, who discovers, who gets paged

Default Staff Positions (Unless Proven Otherwise)

Default Staff Positions

Position	Rationale
DNS-based discovery as default	Simple, well-understood, no additional infrastructure for basic needs
Client-side caching of discovery results	Registry failure shouldn't immediately cascade to all services
Health checks at multiple layers	Registry health check + client health check + application health check
Naming conventions that encode ownership	`team-name.service-name.environment` — names should be self-documenting
Graceful degradation on registry failure	Services continue with last-known-good addresses, not immediate failure
Separate data plane from control plane	Discovery lookups (data plane) should not depend on consensus (control plane)

The Three Intents (Pick One and Commit)

Intent	Constraint	Strategy	Registry Model
Simple Discovery	Operational simplicity	DNS-based, minimal infrastructure	DNS records (SRV/A records), no dedicated registry
Dynamic Discovery	Flexibility, health-aware routing	Dedicated registry (Consul, Eureka), API-based lookup	Centralized registry with health checks
Mesh-Based Discovery	Zero-trust, observability	Service mesh (Istio, Linkerd), sidecar proxies	Distributed, mesh-integrated

The Four Fault Lines (The Core of This Interview)

Client-Side vs Server-Side Discovery — Who resolves the address: the calling service or a load balancer?
DNS vs Registry vs Mesh — What's the lookup mechanism? Each has different consistency and latency tradeoffs.
Push vs Pull Health Checks — Does the registry actively check services, or do services self-report?
Centralized vs Distributed Registry — One registry cluster or per-region/per-domain registries?

Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.

Quick Reference: What Interviewers Probe

Who Pays Analysis

After You Say...	They Will Ask...
"We'll use Consul"	"Consul is down. What happens to service-to-service communication?"
"DNS-based discovery"	"You deploy a new version. DNS TTL is 30 seconds. What happens to in-flight requests?"
"Services register on startup"	"Service crashes without deregistering. How long until other services stop routing to it?"
"Health checks every 10 seconds"	"A service is unhealthy but health check hasn't run yet. What happens to the 10-second window?"
"Service mesh handles discovery"	"What's the operational cost? Your team of 5 now operates a service mesh for 50 services."

Jump to Practice

→ Active Drills (§7) — 8 practice prompts with expected answer shapes

System Architecture Overview

Rendering diagram...

Interview Walkthrough

The six phases below are compressed for a deep-dive format. Phases 1-3 deliver the crisp answer in 2-3 minutes. If probed, Phase 5 has depth for 15+ minutes.

Phase 1: Requirements & Framing (30 seconds)

Name the problem before the solution:

"In a microservices architecture, services move — they scale up, scale down, get redeployed, fail, and recover. Hardcoded IP addresses don't work. Service discovery provides a dynamic registry so callers can always find healthy instances of any service."

Then frame the key decision:

"The core tradeoff is client-side vs server-side discovery. Does the caller query a registry and pick an instance (client-side), or does a load balancer handle routing transparently (server-side)?"

Phase 2: Core Entities & API (30 seconds)

Service Registry: the authoritative map of service name → healthy instances (IP:port)
Service Instance: a running copy of a service with its address, health status, metadata (version, zone, weight)
Health Check: active probe or heartbeat confirming an instance is alive and ready
Watch/Subscription: a mechanism for callers to be notified of registry changes in real-time

Phase 3: The 2-Minute Architecture (2 minutes)

Phase 4: Transition to Depth (15 seconds)

"The basic approaches are well-understood. The hard problems are: registry consistency during network partitions, stale instance information causing requests to dead instances, and the interaction between service discovery and deployment (canary, blue-green)."

Phase 5: Deep Dives (5-15 minutes if probed)

Probe 1: "What happens during a network partition?" (3-5 min)

"The registry is a distributed system — typically backed by Raft consensus (Consul, etcd). During a network partition, the minority side of the partition can't elect a leader and becomes read-only."

Walk through the failure modes:

Caller in minority partition: The registry is read-only. New registrations and deregistrations don't propagate. The caller's cached instance list becomes stale. "Stale is better than empty — the caller should continue using its last known good list."
Service instance in minority partition: The instance can't renew its registration heartbeat. The registry marks it as unhealthy after the TTL expires. Other callers stop routing to it — even though the instance may be healthy within its partition.
Split-brain: Both sides of the partition think they're authoritative. After partition heals, the registry must merge — services that registered during the partition on both sides need reconciliation.

The key design decision: "AP vs CP for the registry. Consul and etcd are CP — they sacrifice availability during partitions (minority can't write). Eureka (Netflix) is AP — it sacrifices consistency (both sides accept registrations, may serve stale data). For service discovery, I prefer AP: serving slightly stale data is better than serving no data."

Probe 2: "How do you handle stale instances?" (3-5 min)

"A service instance crashes without deregistering. Its entry remains in the registry until the health check TTL expires (typically 30-90 seconds). During that window, callers route requests to a dead instance."

Mitigations:

Client-side health tracking: The caller tracks success/failure per instance. If 3 consecutive requests to instance X fail, remove it from the local cache — don't wait for the registry to catch up. "The caller detects failure in 3 requests (~1 second). The registry detects it in 30-90 seconds. Client-side detection is 30-90x faster."
Retry with next instance: On failure, immediately retry on a different instance. The caller never surfaces a single-instance failure to the user if healthy instances are available.
Fast deregistration: Use a shutdown hook to deregister on graceful shutdown. Only crashes leave stale entries. "In practice, 99% of instance removals are graceful deployments with shutdown hooks. Crashes are the 1% edge case."
Lease-based registration with short TTL: Register with a 10-second TTL and heartbeat every 5 seconds. A crashed instance disappears in 10 seconds instead of 90. "The cost: more heartbeat traffic. At 1,000 instances × 1 heartbeat/5s = 200 heartbeats/sec. That's trivial."

Probe 3: "How does service discovery interact with deployments?" (3-5 min)

"A canary deployment puts 5% of traffic on the new version. How does service discovery enable this?"

With client-side discovery:

Register canary instances with metadata: version=2.1, canary=true
Callers read metadata and route 5% of traffic to canary instances based on a hash or random selection
"The routing logic is in the caller — which means every caller needs to support canary routing. Consistent behavior requires a shared client library."

With server-side discovery:

The load balancer (Kubernetes Ingress, Envoy) handles weighted routing: 95% to stable, 5% to canary
The caller doesn't know about the canary — routing is transparent
"Simpler for callers. But the LB must support weighted routing and the deployment pipeline must configure it."

Phase 6: Wrap-Up

"Service discovery is the nervous system of a microservices architecture. The technology choice (Consul vs Kubernetes DNS vs Eureka) matters less than three operational decisions: (1) AP vs CP registry behavior during partitions, (2) client-side vs registry-side health detection speed, and (3) how discovery integrates with your deployment model. Get these three wrong and your services find each other in theory but fail to communicate in practice."

Quick-Reference: The 30-Second Cheat Sheet

Level Calibration

Topic	The L5 Answer	The L6 Answer (say this)
Purpose	"Services find each other"	"Dynamic registry for ephemeral instances — services move, IPs change, the registry tracks them"
Client vs server	"Use Consul" or "Use K8s DNS"	"Server-side by default; client-side when you need routing control (canary, zone-aware)"
Partition behavior	"The registry is always available"	"AP vs CP tradeoff — stale data is better than no data for discovery"
Stale instances	"Health checks detect failures"	"Client-side detection in 3 requests; registry detection in 30-90 seconds — don't wait for the registry"
Deployment	"Deploy and it registers"	"Drain → deregister → update → register — every deployment step has a discovery step"

1The Staff Lens

1.1 Why This Problem Exists in Staff Interviews

This is NOT a "pick a service registry" question. Everyone knows Consul exists.

This is a Registry Reliability & Failure Propagation question that tests:

Whether you understand the registry as a single point of failure for the entire platform
Whether you design for registry unavailability (graceful degradation, not hard failure)
Whether you reason about stale discovery data (routing to dead instances, missing new instances)
Whether you can design naming conventions that scale with organizational growth

1.2 The L5 vs L6 Contrast

Recall the five key behaviors from the Executive Summary. Below, we explain why each matters and what interviewers listen for.

Behavior 1: First move (ask about failure modes)

Staff signal: Understand the blast radius of registry failure before choosing technology.

Why this matters (L5 vs L6)

L5: Jumps to technology selection — "We'll use Consul" or "Kubernetes services." This skips the critical design question: what happens when the registry is unavailable?

L6: Asks about the failure mode first: "If the registry is down for 5 minutes, what happens to service-to-service communication? Can services continue with cached addresses?" This shapes the entire design — if the answer is "services die when registry is down," you need client-side caching, local DNS, or mesh-based resolution.

Behavior 2: Architecture (match mechanism to organizational needs)

Staff signal: Choose the discovery mechanism based on organizational complexity, not hype.

Why this matters (L5 vs L6)

L5: Defaults to a dedicated registry because it's the "modern" approach. For 10 services with 2 teams, this is over-engineering — DNS-based discovery is simpler, cheaper, and sufficient.

L6: Matches mechanism to organizational needs: DNS for simple platforms (<20 services), dedicated registry for dynamic platforms (20-100+ services with frequent deploys), service mesh for platforms requiring zero-trust and advanced traffic management. Each step up adds capability and operational cost.

Rendering diagram...

Behavior 3: Failure reasoning (registry is a Tier-0 dependency)

Staff signal: Design for registry failure as a first-class concern, not an afterthought.

Why this matters (L5 vs L6)

L5: Treats the registry as reliable infrastructure — "Consul is highly available." This assumption fails when the registry cluster has a quorum issue, network partition, or operational error.

L6: Designs defense-in-depth: (1) client-side caching — services cache discovery results and continue with cached addresses during registry outage, (2) local DNS fallback — if registry-based DNS fails, fall back to static DNS entries, (3) circuit breaker on discovery calls — don't let registry latency cascade to every service call.

Behavior 4: Health checking (multi-layer defense)

Staff signal: Health checks at multiple layers prevent different classes of failures.

Why this matters (L5 vs L6)

L5: Relies on registry health checks: "Consul checks services every 10 seconds." This misses the 10-second window where a service is down but still registered, and doesn't account for application-level health (service is TCP-reachable but returning errors).

L6: Designs multi-layer health checking: (1) registry-level health check (TCP/HTTP) — is the process alive? (2) client-side health check — is the endpoint responding with acceptable latency? (3) application-level health check — is the service returning correct responses? Each layer catches different failure modes.

Behavior 5: Ownership (naming encodes organizational structure)

Staff signal: Service naming conventions should make ownership obvious.

Why this matters (L5 vs L6)

L5: Uses ad-hoc naming — user-service, my-api, service-v2. At 10 services this works. At 200 services, nobody knows who owns what or what depends on what.

L6: Designs naming conventions that encode ownership: payments.checkout-service.prod, catalog.search-api.staging. Names include team/domain, service name, and environment. This makes ownership discoverable from the service name itself and enables automated routing rules (all payments.* traffic goes through payments gateway).

1.3 The Staff Question That Cuts Through Everything

This reframes the entire interview. The registry isn't the problem — the health authority is the problem. Staff engineers design for the case where the health authority is wrong, not just for the case where the registry is down.

2Problem Framing & Intent

2.1 The Three Intents

Before choosing any technology, ask What's the organizational complexity?

Who Pays Analysis

Intent	Constraint	Strategy	Health Model	Operational Cost
Simple Discovery	Minimize ops	DNS SRV records, Kubernetes Services	Kubelet/LB health checks	Minimal
Dynamic Discovery	Health-aware, feature-rich	Consul/Eureka with API	Active + passive health checks	Medium
Mesh Discovery	Zero-trust, traffic management	Istio/Linkerd, sidecar proxies	Mesh-integrated health + circuit breaking	High

2.2 What's Intentionally Underspecified

The interviewer deliberately avoids specifying:

Number of services and expected growth
Container orchestration (Kubernetes vs bare metal vs hybrid)
Multi-region requirements
Security requirements (mTLS, zero-trust)
Current pain points

Staff engineers surface these unknowns. Senior engineers jump to technology selection.

2.3 How to Open (The First 2 Minutes)

Ask about organizational scale and growth trajectory
State your mechanism assumption explicitly
Outline your plan: discovery mechanism → naming → health checking → failure modes → observability

Example opening:

3Fault Lines

3.1 Fault Line 1: Client-Side vs Server-Side Discovery

The tension: Client-side discovery gives services control over routing but requires client libraries. Server-side discovery is transparent but adds a hop.

Who Pays Analysis

Approach	Latency	Flexibility	Client Complexity	Who Pays
Client-side (service resolves address directly)	Low — direct connection	High — client controls routing	High — needs discovery library	Engineering (client library maintenance, per-language implementations)
Server-side (LB/proxy resolves)	+1-3ms — extra hop	Medium — LB controls routing	Low — client just calls a hostname	Infra (LB infrastructure), Users (extra latency hop)
Sidecar (local proxy per service)	+0.5ms — local hop	High — sidecar controls routing	None — transparent to client	Infra (sidecar resource overhead per instance)

L6 answer: "Server-side or sidecar, depending on platform maturity. For most teams, server-side (Kubernetes Services or a service-aware LB) is sufficient — services call a DNS name, the LB resolves to a healthy instance. For platforms needing fine-grained traffic control (canary, circuit breaking), the sidecar pattern (Envoy) gives control without client library complexity. I'd avoid client-side discovery unless we have a strong reason — maintaining client libraries in 5 languages is organizational debt."

L7 answer: "The pattern should match the organizational stage. Client-side is fine for 3 services in one language. Server-side scales to 50+ services without per-team effort. Sidecar/mesh is for 100+ services needing zero-trust and traffic management. Each migration adds capability and cost. I'd start with server-side and evolve to sidecar only when the pain justifies the operational investment."

3.2 Fault Line 2: DNS vs Registry vs Mesh

The tension: DNS is simple but has TTL staleness. Registries are dynamic but add infrastructure. Service mesh is comprehensive but expensive to operate.

Who Pays Analysis

Mechanism	Staleness	Infrastructure Cost	Feature Set	Who Pays
DNS (SRV/A records, CoreDNS)	Bounded by TTL (5-60s)	Minimal — DNS is everywhere	Basic — name resolution only	Users (routing to stale/dead endpoints during TTL), Engineering (no health-aware routing)
Registry (Consul, Eureka, etcd)	Low (1-5s with watches/long-polling)	Medium — registry cluster to operate	Rich — health checks, KV store, watches	Infra (registry operational burden), Platform team (registry is Tier-0)
Service Mesh (Istio, Linkerd)	Very low (real-time updates)	High — control plane + sidecars	Full — mTLS, traffic management, observability	Infra (significant resource overhead), Engineering (mesh complexity)

Rendering diagram...

L6 answer: "DNS-based with intelligent DNS (CoreDNS or Consul DNS interface) as the default. Services resolve checkout.payments.svc via DNS. Behind the scenes, DNS returns only healthy instances with short TTLs (5-10 seconds). This gives us health-aware routing with DNS simplicity. For services needing watch-based real-time updates, the registry API is available for direct consumption."

L7 answer: "I'd layer the mechanisms. DNS as the universal interface (every language/framework supports it). Registry API for services needing real-time updates (deployment orchestration, canary tooling). Service mesh for services needing mTLS and traffic shaping. Each layer is optional — teams opt in based on their requirements. This prevents forcing mesh complexity on teams that only need DNS."

3.3 Fault Line 3: Push vs Pull Health Checks

The tension: Active health checks (registry pings services) detect failures proactively but add load. Passive health checks (observe real traffic) have zero overhead but detect failures reactively.

Who Pays Analysis

Approach	Detection Time	Overhead	False Positives	Who Pays
Active push (registry polls services)	Fast (check interval)	Medium — registry sends N health checks	Some — network blip ≠ service failure	Infra (health check traffic), Services (must expose health endpoint)
Self-report (services heartbeat to registry)	Fast (heartbeat interval)	Low — each service sends one heartbeat	Low	Services (must implement heartbeat), Registry (processing heartbeats)
Passive (observe real traffic errors)	Slow — requires real traffic failure first	Zero — piggybacks on real requests	Very low	Users (first N requests hit unhealthy instance before detection)
Hybrid (active + passive)	Fastest — active catches process death, passive catches application errors	Medium	Balanced	Infra (two systems), best user experience

L6 answer: "Hybrid: active health checks for process-level health (is the service reachable?), passive health checks for application-level health (is it returning correct responses?). The registry runs active TCP/HTTP checks every 5 seconds. The client-side LB or sidecar tracks real-time success rates — if a specific instance's error rate exceeds 50% over a 10-second window, remove it from the rotation immediately, don't wait for the next registry health check."

L7 answer: "I'd add a third layer: application-level readiness probes that reflect business health. A service can be TCP-healthy (process running), HTTP-healthy (returning 200), but business-unhealthy (database connection pool exhausted, returning errors for real requests). Readiness probes should check downstream dependencies, not just 'am I alive?' The three layers — liveness, readiness, and traffic-based — catch different failure classes."

3.4 Fault Line 4: Centralized vs Distributed Registry

The tension: A centralized registry is simple but is a single point of failure. Distributed registries are resilient but complex.

Who Pays Analysis

Approach	Availability	Consistency	Operational Cost	Who Pays
Single cluster (one registry)	Limited by cluster health	Strong	Low — one thing to manage	Everyone (blast radius if registry fails)
Per-region cluster (registry per region)	High — regional independence	Per-region (cross-region eventual)	Medium — N clusters	Infra (multiple clusters), Engineering (cross-region discovery)
Embedded (registry embedded in each service)	Very high — no external dependency	Eventual (gossip)	Low — no separate infrastructure	Engineering (gossip protocol complexity), Infra (convergence time)

L6 answer: "Per-region registry clusters with cross-region federation. Each region has its own registry cluster — services in us-east register with the us-east registry. For cross-region discovery, registries federate: they share service catalogs asynchronously. If one region's registry fails, other regions continue independently. Cross-region lookups use the federated catalog with eventual consistency (acceptable — cross-region calls are already higher latency)."

L7 answer: "I'd separate the registry's control plane from its data plane. The control plane (consensus-based, for registration and deregistration) can be per-region. The data plane (serving lookups) should be distributed — cached locally on each node, refreshed from the control plane periodically. This way, the control plane can be briefly unavailable without affecting service-to-service communication."

4Failure Modes & Operational Reality

4.1 The Registry Cascade Failure

Timeline of a registry outage:

t=0:     Registry cluster loses quorum (AZ failure)
t=5s:    Services can't resolve new lookups
t=10s:   Client-side DNS cache starts expiring (TTL=10s)
t=15s:   Services with expired cache can't reach any other service
t=30s:   50 services report "service unavailable" errors
t=1m:    Every inter-service call fails
t=5m:    Registry quorum restored
t=5.5m:  Services re-resolve, traffic resumes

Staff insight: The 5-minute outage duration is registry-specific. The blast radius (50 services) is because every service depends on the registry. Client-side caching with longer fallback TTLs (e.g., cache discovery results for 5 minutes even if registry is unreachable) would have reduced the blast radius to zero for the majority of services.

4.2 The Stale Registry Entry

Scenario	Impact	Detection Time	Mitigation
Service crashes, no deregister	Registry routes to dead instance	Health check interval (5-10s)	Active health checks + client-side circuit breaker
Zombie instance (process alive, not serving)	Registry shows healthy, requests fail	Until health check catches app-level failure	Application-level readiness probes
DNS cache stale	Client routes to old IP after service restart	DNS TTL (5-60s)	Short DNS TTL + connection retry on first failure
Registry split-brain	Different services see different service catalogs	Until partition heals	Per-region registries, client-side caching

4.3 The Naming Collision

The scenario: Two teams independently name their service user-service. Both register in the same registry. Traffic intended for Team A's service routes to Team B's service.

Staff position: This is an organizational problem, not a technical one. Enforce naming conventions with namespace requirements: {team}.{service}.{environment}. Registry registration validates namespace ownership — Team A can only register services under team-a.*. Naming collisions become impossible by construction.

4.4 Service Discovery in Deployments

During a rolling deployment:

Rendering diagram...

Key concerns: (1) New instances must pass health checks before receiving traffic. (2) Old instances must drain connections before deregistering. (3) The registry must update quickly enough that clients don't route to terminated instances.

5Evaluation Rubric

5.1 Level-Based Signals

Level Calibration

Dimension	L5 (Senior)	L6 (Staff)	L7 (Principal)
First move	"Use Consul/Eureka"	Asks about failure modes and organizational scale	Frames as platform reliability problem
Mechanism	Dedicated registry for everything	Matches mechanism to scale (DNS → registry → mesh)	Layered mechanisms with opt-in complexity
Failure design	"Registry has HA"	Client-side caching, graceful degradation	Registry data plane separation, defense-in-depth
Health checks	"Registry health checks"	Multi-layer: registry + client + application	Three-layer health: liveness, readiness, traffic-based
Naming	Ad-hoc service names	Namespace conventions encoding ownership	Naming as platform contract with governance
Operational cost	Ignores operational burden	Quantifies registry as Tier-0 dependency	TCO analysis: DNS vs registry vs mesh per organizational stage

5.2 Strong Hire Signals

Signal	What It Looks Like
Registry failure reasoning	"If the registry is down for 5 minutes, services should continue with cached addresses."
Multi-layer health	"Registry checks TCP health. Client-side checks response health. App checks readiness."
Naming governance	"Names follow `{team}.{service}.{env}` — ownership is discoverable from the name."
Graceful degradation	"Services cache discovery results with 5-minute fallback. Registry failure doesn't cascade."

5.3 Lean No Hire Signals

Signal	What It Looks Like
Technology-first	Spends 10 minutes on Consul features without discussing registry failure
No degradation design	"Registry handles HA" without addressing what happens during registry outage
No naming strategy	Ad-hoc names without namespace or ownership conventions
Single-layer health	Only registry health checks, no client-side or application-level health

5.4 Common False Positives

Deep Consul/Eureka knowledge ≠ good discovery design (technology, not architecture)
"Service mesh solves everything" ≠ understanding the operational cost
Complex routing rules ≠ Staff thinking (often over-engineering)
"Zero-downtime deployments" without explaining the discovery update mechanism

6Interview Flow & Pivots

6.1 Typical 45-Minute Structure

Phase	Time	What Happens
Framing	0-5 min	Clarify organizational scale, ask about failure tolerance
Requirements	5-10 min	Surface unknowns: service count, deployment frequency, multi-region
High-Level Design	10-20 min	Discovery mechanism, naming, health checking strategy
Deep Dive	20-35 min	Fault lines: registry failure, stale entries, deployment interactions
Wrap-Up	35-45 min	Observability, operational readiness, evolution path

6.2 How Interviewers Pivot

After You Say...	They Will Probe...
"DNS-based discovery with short TTLs"	"Instance dies mid-TTL. How many requests fail before DNS refreshes?"
"Client-side caching for resilience"	"How do you handle cache staleness? What if a cached address points to a new, different service?"
"Health checks every 5 seconds"	"Service has a 3-second startup time. It's receiving traffic before it's ready. What's wrong?"
"Consul for registration and lookup"	"Consul cluster is in us-east. Your service is in eu-west. What's the lookup latency?"
"Service mesh for zero-trust"	"What's the resource overhead per service? You have 200 services with sidecars."

6.3 What Silence Means

After registry failure question → Interviewer wants specific degradation behavior, not just "we have HA"
After naming question → Interviewer is testing organizational thinking, not technical naming
After health check question → Interviewer wants you to distinguish process health from application health
After deployment question → Interviewer wants to see awareness of the discovery-deployment interaction

6.4 Follow-Up Questions to Expect

"How do you handle a service that takes 30 seconds to start? When should it receive traffic?"
"Service A depends on Service B. Service B's registry entry is stale. How does Service A cope?"
"You have 200 services. How do you visualize and debug the dependency graph?"
"A team renames their service. How do you handle backward compatibility?"
"How do you support canary deployments through service discovery?"
"Your registry cluster runs out of memory. What happens?"

7Active Drills

Drill 1: Discovery Mechanism Selection

Interview Prompt

Interview prompt: "You're building service discovery for a 30-service platform. What mechanism do you choose?"

Staff Answer

DNS-based with a registry backend. Services resolve each other via DNS names (checkout.payments.svc). Behind the DNS interface, a registry (Consul or CoreDNS with health-aware backends) returns only healthy instances. DNS TTL of 5-10 seconds balances freshness with DNS server load.

Why DNS as the interface: every language and framework supports DNS natively — no client library needed. Why registry backend: plain DNS doesn't do health checking. The registry provides health awareness while DNS provides the universal lookup interface.

For 30 services, a dedicated registry (Consul) is justified. For 10 services, Kubernetes Services (built-in DNS + health checks) is sufficient. For 200+ services, a service mesh becomes worth the operational investment.

Why this is L6:

Matches mechanism to organizational scale
Uses DNS as the universal interface with registry backend for health awareness
Provides clear scaling inflection points (10 → 30 → 200 services)

Drill 2: Registry Failure Resilience

Interview Prompt

Interview prompt: "Your service registry is down for 5 minutes. What happens to inter-service communication?"

Staff Answer

With naive implementation: all service-to-service calls fail because nobody can resolve addresses.

With proper design: services cache discovery results locally. When the registry is unreachable, services continue using cached addresses. The cache has a fallback TTL (e.g., 5 minutes) — longer than the normal TTL (10 seconds). During the registry outage:

Existing services continue communicating using cached addresses (99% of traffic)
New service instances can't be discovered (but they can still be reached by IP if known)
Instances that died during the outage are still in cache — clients experience connection failures to those instances, but client-side circuit breaking removes them from the rotation

The net impact: brief increase in connection errors to recently-failed instances, but the majority of service-to-service communication continues unaffected.

Why this is L6:

Designs for degradation, not perfection
Client-side caching as first defense
Client-side circuit breaking as second defense
Quantifies impact: "99% of traffic continues"

Drill 3: Health Check Design

Interview Prompt

Interview prompt: "Design the health check strategy for a payment processing service."

Staff Answer

Three layers for a critical service like payments:

Layer 1 — Liveness (is the process alive?): TCP check on the service port, every 5 seconds. Registry removes instance after 3 consecutive failures (15 seconds). This catches process crashes and network issues.

Layer 2 — Readiness (can it serve traffic?): HTTP GET to /ready endpoint, every 5 seconds. The /ready endpoint checks database connectivity, downstream service connectivity, and that initialization is complete. New instances don't receive traffic until readiness passes. This catches startup delays and dependency failures.

Layer 3 — Traffic-based (is it performing well?): Client-side monitoring of response times and error rates. If an instance's p99 latency exceeds 500ms or error rate exceeds 10% over a 30-second window, the client-side LB removes it from rotation. This catches slow degradation that liveness/readiness checks miss.

For payments specifically, I'd add a synthetic transaction check: every 60 seconds, process a $0.01 test transaction end-to-end. This validates the entire payment pipeline, not just HTTP health.

Why this is L6:

Three distinct layers catching different failure modes
Readiness check includes dependency health (not just "am I alive")
Traffic-based checks for progressive degradation
Domain-specific synthetic check for payments

Drill 4: Naming Convention Design

Interview Prompt

Interview prompt: "Design a service naming convention for a company with 20 teams and 100+ services."

Staff Answer

Format: {domain}.{service-name}.{environment}[.{region}]

Examples:

payments.checkout-api.prod.us-east
catalog.search-service.staging
identity.auth-gateway.prod

Rules: (1) Domain maps to team ownership — the payments domain is owned by the payments team. (2) Service names are lowercase, hyphenated, descriptive. (3) Environment is required — prevents accidental cross-environment calls. (4) Region is optional — only needed for multi-region routing.

Enforcement: registry registration validates that the registering service's namespace matches its team ownership. Team payments can only register payments.* services. This prevents naming collisions by construction.

Migration: existing services keep their old names as aliases. New convention is enforced for all new services. Over 6 months, teams migrate existing services during routine maintenance.

Why this is L6:

Names encode ownership (discoverable from the name)
Namespace enforcement prevents collisions
Migration strategy (not big-bang rename)
Includes environment to prevent cross-environment mistakes

Drill 5: Stale Discovery Handling

Interview Prompt

Interview prompt: "A service instance dies but is still registered. How do you handle the 10-second health check gap?"

Staff Answer

Three defenses for the gap between instance death and health check detection:

Client-side retry with next instance: When a connection to an instance fails, the client immediately retries with the next instance from the discovery result. The dead instance is tried once, fails, and the client transparently falls over. User-visible impact: +50ms latency for the retry, no error.
Client-side circuit breaker: After 2-3 consecutive failures to an instance within 10 seconds, the client marks it as unhealthy locally. Subsequent requests skip it. The circuit breaker resets after 30 seconds (by then, the registry has also deregistered it).
Connection draining on shutdown: For graceful shutdowns, the service deregisters from the registry BEFORE stopping. This eliminates the gap for planned shutdowns. The 10-second gap only applies to unplanned crashes.

Net impact: for unplanned crashes, the first 1-2 requests to the dead instance see a +50ms retry. All subsequent requests are routed to healthy instances by the client-side circuit breaker. No user-visible errors.

Why this is L6:

Client-side retry as first defense (transparent to user)
Client-side circuit breaking as second defense (removes from rotation)
Connection draining for graceful shutdowns (eliminates gap entirely)
Quantifies user impact: "+50ms for 1-2 requests"

Drill 6: Cross-Region Discovery

Interview Prompt

Interview prompt: "Service A in us-east needs to call Service B in eu-west. How does discovery work?"

Staff Answer

Two options depending on the routing intent:

Option 1: Region-aware routing (prefer local, fallback to remote). Service A first looks up Service B in the us-east registry. If Service B has instances in us-east, route locally. If not, the registry federates — it queries the eu-west registry and returns eu-west instances. Service A connects cross-region (80-100ms latency).

Option 2: Explicit cross-region (Service B is only in eu-west). The naming convention includes region: data-team.service-b.prod.eu-west. Service A explicitly resolves the eu-west address. This is intentional cross-region communication.

I'd default to Option 1 for most services — prefer local instances when available. The discovery system should handle failover transparently: if us-east instances of Service B are all unhealthy, automatically fall back to eu-west instances (with latency awareness).

Why this is L6:

Distinguishes region-aware routing from explicit cross-region calls
Prefers local with automatic fallback (reduces latency for common case)
Naming convention supports both patterns

Drill 7: Service Mesh vs Registry

Interview Prompt

Interview prompt: "Your team is debating: dedicated registry (Consul) vs service mesh (Istio). How do you decide?"

Staff Answer

The decision depends on what you need beyond discovery. If you only need discovery + health checking, a registry is simpler and cheaper. If you need mTLS, traffic management (canary, circuit breaking), and deep observability — and you need it platform-wide — a service mesh provides these as infrastructure.

Criteria for mesh adoption: (1) >100 services (enough to justify operational cost), (2) security requirement for mTLS between all services, (3) need for platform-wide traffic management (canary deploys, traffic mirroring), (4) team with mesh operational expertise.

Criteria against mesh: (1) <50 services (over-engineering), (2) no mTLS requirement, (3) no dedicated platform team to operate the mesh, (4) performance-sensitive services that can't tolerate sidecar latency.

I'd start with a registry and migrate to mesh only when the organizational pain (manual mTLS, per-service circuit breaking implementation, inconsistent observability) justifies the mesh investment.

Why this is L6:

Clear decision criteria, not opinion
Acknowledges mesh operational cost
Recommends evolutionary adoption, not big-bang

Drill 8: Discovery During Deploys

Interview Prompt

Interview prompt: "You're doing a rolling deployment of Service B. How does service discovery interact with the deployment?"

Staff Answer

The deployment lifecycle must coordinate with discovery:

New instance starts → passes liveness check → registered in registry → DOES NOT receive traffic yet
Readiness check passes → registry marks instance as "serving" → receives traffic
Old instance begins drain → registry marks as "draining" → stops receiving NEW requests → finishes in-flight requests (30-second grace period)
Drain complete → old instance deregisters → terminates

Critical timing: the readiness check gap. If a new instance is registered before it's ready (warm-up, cache loading, connection pooling), it receives traffic it can't serve. The readiness probe must confirm the instance can handle production traffic, not just that the process started.

For canary deployments: register the canary instance with a weight of 5% (or a canary tag). The discovery system routes 5% of traffic to the canary. Monitor error rates for 5 minutes. If healthy, increase weight to 25%, then 100%.

Why this is L6:

Full lifecycle coordination (start → ready → drain → terminate)
Readiness vs liveness distinction
Canary deployment through weighted discovery
Specific timing: 30-second drain period

8Deep Dive Scenarios

Scenario-based analysis for Staff-level depth

Deep Dive 1: The Registry Outage

Context: Your Consul cluster (3 nodes) loses quorum at 2am. 80 services are unable to discover each other. Your on-call engineer has never dealt with a Consul failure before.

Questions You Should Ask First:

Are services actually failing right now, or are they operating on cached discovery results — how much time do we have before caches expire?
Why did the 3-node Consul cluster lose quorum — is this an AZ failure, disk issue, or network partition, and does the node placement tolerate single-AZ failure?
Does the on-call engineer have a Consul failure runbook, or are they debugging from scratch at 2am?
After recovery, should we upgrade to a 5-node cluster that tolerates 2 failures instead of 1?

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would focus on restoring the Consul cluster — restarting failed nodes, checking disk space, and getting quorum back. Once the cluster recovers, they would verify services can discover each other again and close the incident. This is technically correct but reactive. The Senior treats the registry as infrastructure that broke and was fixed, without examining why 80 services were simultaneously affected or why the on-call engineer had no playbook to follow.

Staff Approach: The Staff answer starts by triaging actual impact rather than assuming the worst — are services using cached discovery results, giving a window before real failures begin? Then it addresses recovery, but the real depth is in the post-incident response: upgrading from 3-node to 5-node Consul (tolerating 2 failures instead of 1), implementing client-side caching with fallback TTLs so future registry outages do not cascade, creating a Consul failure runbook before the next 2am incident, and re-classifying the registry as a Tier-0 dependency with dedicated monitoring and blast radius analysis. The Staff engineer sees the organizational gap — a critical dependency treated as "just infrastructure" — not just the technical failure.

Staff Approach

Immediate triage: (1) Are services actually failing, or are they using cached discovery results? Check service error rates. If most services are still operating (cached results), we have a 5-minute window before caches expire. (2) Why did Consul lose quorum? Check node health — AZ failure, disk full, OOM, network partition?

Recovery: If one node is recoverable (restart it), quorum is restored. If two nodes are lost, we need to bootstrap a new cluster from backup or the remaining node (unsafe but necessary in emergency). Document the procedure in a runbook BEFORE this happens.

Post-incident: (1) Move to 5-node Consul cluster (tolerates 2 failures instead of 1). (2) Implement client-side caching with 5-minute fallback TTL — future registry outages don't cascade. (3) Create a Consul failure runbook with step-by-step recovery. (4) Add synthetic monitoring that detects registry unavailability within 30 seconds.

The organizational fix: Consul is a Tier-0 dependency but was treated as "just infrastructure." It needs dedicated operational runbooks, monitoring, and blast radius analysis.

Metrics to Watch: consul.quorum.status (healthy/degraded/lost — alert immediately on degraded), service.discovery.cache_age_seconds (how stale cached results are — critical during registry outage), service.discovery.lookup_failure_rate (percentage of lookups failing due to registry unavailability), consul.node.health_by_az (per-AZ node health to detect placement problems).

Organizational Follow-up: Create a Consul failure runbook with step-by-step recovery procedures before the next 2am incident. Reclassify the service registry as a Tier-0 dependency with dedicated monitoring and blast radius analysis. Upgrade from 3-node to 5-node Consul with 2-2-1 AZ placement. Implement client-side caching with 5-minute fallback TTL so future registry outages don't cascade to service-to-service communication.

Staff Signals:

Triages based on actual impact (are services failing or using cache?)
Provides immediate recovery steps AND architectural improvements
Identifies the operational gap (no runbook, no monitoring)

Deep Dive 2: The Service Naming Migration

Context: Your company has 150 services with ad-hoc names (user-svc, my-api, service-v2-new-FINAL). Nobody knows who owns what. You need to migrate to a structured naming convention without downtime.

Questions You Should Ask First:

Is this actually a technical DNS migration, or is it an organizational ownership discovery exercise disguised as a rename?
Can we implement dual registration — both old and new names resolve simultaneously — so there's zero traffic disruption during migration?
How do we track migration progress with data instead of deadlines — can we monitor old-name lookup counts and deprecate names only when they hit zero?
What's the realistic timeline — is this 2 weeks of DNS changes, or 4-6 months of team coordination?

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would define a naming convention, write documentation, and ask all teams to rename their services by a deadline. They might build a migration script that updates DNS records and registry entries. This is technically sound but treats the problem as a one-time technical rename. The Senior underestimates that 20 teams each need to update their callers' configurations, that some services have no identified owner, and that a hard cutover deadline will be missed by half the organization.

Staff Approach: The Staff answer recognizes this as an organizational migration disguised as a technical one. It uses a phased approach: first audit and map ownership (which often reveals orphaned services nobody owns), then implement dual registration so both old and new names resolve simultaneously with zero disruption, then track old-name lookup counts to drive migration completion with data rather than deadlines, and finally enforce the convention for new services while old names become deprecated aliases. The timeline is 4-6 months, not 2 weeks, because the bottleneck is team coordination and ownership discovery — not DNS changes.

Staff Approach

This is an organizational migration disguised as a technical one. The naming convention reflects ownership — you need team buy-in, not just DNS changes.

Phase 1 (weeks 1-4): Audit and map. Create a service catalog: current name → owner → new name under the convention. Get each team to claim their services. This often reveals orphaned services nobody owns.

Phase 2 (weeks 4-8): Dual registration. Services register under both old and new names simultaneously. Both names resolve to the same instances. No traffic disruption. Callers can use either name.

Phase 3 (weeks 8-16): Caller migration. Teams update their service calls to use new names. Track old-name resolution counts — when a service's old name has zero lookups for 2 weeks, deprecate it.

Phase 4 (week 16+): Enforce convention. New services must follow the convention (registry validation). Old names are aliases that generate deprecation warnings in logs.

Timeline: 4-6 months for 150 services. The bottleneck is team coordination, not technology.

Metrics to Watch: discovery.old_name.lookup_count (per-service old-name resolution volume — migration is done when this reaches zero), discovery.naming.convention_compliance_pct (percentage of services following the new convention), discovery.orphaned_service.count (services with no identified owner discovered during audit), discovery.dual_registration.active_count (services currently registered under both old and new names).

Organizational Follow-up: Conduct an ownership audit: have every team claim their services in the catalog, surfacing orphaned services nobody owns. Implement dual registration so both names resolve during migration. Enforce the naming convention for all new services via registry validation. Set up a migration dashboard showing per-team progress, using old-name lookup counts as the completion signal — not arbitrary deadlines.

Staff Signals:

Recognizes this as organizational migration, not just technical
Dual registration prevents disruption
Uses lookup metrics to drive migration completion
Realistic timeline (months, not days)

Deep Dive 3: The Thundering Herd on Discovery

Context: Your registry serves 500K lookups/sec. During a large-scale deployment (30 services deploying simultaneously), the registry sees a spike to 2M lookups/sec as clients refresh their caches. The registry becomes overloaded, causing lookup failures across the platform.

Questions You Should Ask First:

Is the 4x traffic spike caused by synchronized cache invalidation across all callers during simultaneous deployments?
Can we stagger deployments in waves of 5 with 2-minute gaps instead of deploying 30 services simultaneously?
Are clients polling the registry on a fixed interval, or do they use random jitter so cache refreshes spread across a time window?
Can we switch from polling to push-based mechanisms (Consul blocking queries, etcd watches) so clients don't need to poll at all?

Rendering diagram...

Thundering herd on discovery: 30 simultaneous deployments trigger cache invalidation across all callers at once, creating a 4x traffic spike that overloads the registry. Fix: stagger deployments, add client-side refresh jitter, and switch to push-based updates.

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would scale the registry infrastructure to handle the 2M lookups/sec spike — add nodes, increase memory, or move to larger instances. This addresses the symptom (registry overload) but not the cause (synchronized cache invalidation creating a thundering herd). The Senior treats the traffic spike as a capacity planning problem rather than a coordination problem, and the fix requires over-provisioning the registry by 4x for spikes that happen a few times per day.

Staff Approach: The Staff answer identifies the thundering herd pattern as the root cause: 30 simultaneous deployments trigger cache invalidation across all callers at the same moment, creating a 4x traffic spike. The fix is layered: stagger deployments in waves of 5 with 2-minute gaps to spread the load (immediate operational fix), add random jitter to cache refresh timing so clients do not all poll at the same instant (architectural fix), switch to watch/push mechanisms where available so clients receive updates without polling, and deploy per-node DNS caches that absorb lookup volume so only cache misses reach the registry. The registry is also sized for 3-5x normal traffic to handle residual spikes without degradation.

Staff Approach

Root cause: deployment triggers cache invalidation across all callers simultaneously. 30 services × N callers × cache refresh = thundering herd on the registry.

Immediate fix: stagger deployments. Don't deploy 30 services simultaneously — deploy in waves of 5 with 2-minute gaps. This spreads the cache refresh load.

Architecture fix: (1) Client-side jitter — add random jitter (0-5 seconds) to cache refresh timing. Instead of all clients refreshing at TTL expiry, they refresh within a 5-second window. (2) Watch/push instead of poll — use registry watch mechanisms (Consul blocking queries, etcd watches) where available. Clients receive updates pushed by the registry, not polling for changes. (3) Local DNS cache — services resolve through a per-node DNS cache. The cache handles the lookup load; only cache misses hit the registry.

The registry should be sized for 3-5x normal traffic to handle deployment spikes without degradation.

Metrics to Watch: registry.lookup_rate_vs_baseline (current lookups/sec as a multiple of steady-state — alert at 2x), registry.deployment_concurrency_count (number of services deploying simultaneously), registry.cache_refresh.jitter_spread_seconds (variance in client refresh timing — higher is better), registry.push_subscription.active_count (clients using watch/push instead of polling).

Organizational Follow-up: Implement deployment staggering: no more than 5 services deploy simultaneously, with 2-minute gaps between waves. Add random jitter (0-5 seconds) to all client-side cache refresh timers. Create a registry capacity dashboard showing lookup rate with deployment events overlaid. Migrate clients from polling to watch/push mechanisms, tracking adoption weekly.

Staff Signals:

Identifies thundering herd as the root cause pattern
Proposes both immediate (stagger) and architectural (jitter, push, cache) fixes
Sizes infrastructure for spikes, not just steady state

Deep Dive 4: Multi-Region Service Discovery

Context: Your company operates in US and EU. Each region has its own service registry. A new requirement says that if all US instances of a service are unhealthy, traffic should automatically fail over to EU instances. How do you implement this?

Questions You Should Ask First:

What's the failover priority chain — do we prefer local unhealthy instances with circuit breakers before falling back to cross-region, or go directly to the remote region?
What's the latency increase when traffic fails over cross-region — have we communicated that users will see 5ms to 80ms latency increase as an explicit tradeoff?
How does failback work when the local region recovers — do we shift traffic back gradually to prevent flapping, or instantly?
Is the cross-region registry federation async with bounded staleness, or synchronous — and what's the acceptable sync interval?

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would set up cross-region DNS with health checks — when US health checks fail, DNS returns EU instance IPs. This is a correct basic implementation, but it does not address several critical concerns: the failover priority ordering (should it prefer local unhealthy instances with circuit breakers before going cross-region?), the latency increase that clients will experience (5ms to 80ms), or what happens during failback when US instances recover. A naive failback can cause traffic flapping between regions.

Staff Approach: The Staff answer designs cross-region registry federation with a clear priority chain: local healthy instances first, then local unhealthy with circuit breaker, then remote region as last resort. Federation happens via async replication with a 10-second sync interval between regional registries. The critical addition is the failback strategy: when US instances recover, traffic should shift back gradually over 5 minutes rather than instantly, preventing a flapping pattern where traffic bounces between regions as health checks oscillate. The Staff engineer also explicitly acknowledges the latency increase as a tradeoff that users will notice (5ms to 80ms) and treats it as a conscious business decision.

Staff Approach

This requires cross-region registry federation with health-aware failover.

Design: (1) Each region's registry maintains a local service catalog AND a federated view of other regions' catalogs. Federation happens via async replication (10-second sync interval). (2) Discovery lookup priority: local healthy instances → local unhealthy instances with circuit breaker → remote region instances. (3) Cross-region failover trigger: when local instance count drops to 0 (or all are unhealthy), the discovery system returns remote region instances with a latency warning.

Implementation: DNS-based with geo-aware records. checkout.payments.svc resolves to local instances normally. When local health checks fail, DNS returns eu-west instances. Client sees higher latency (80ms vs 5ms) but the service stays available.

Critical consideration: failback. When US instances recover, traffic should gradually shift back (not instantly — validate US instances are stable for 5 minutes before shifting all traffic back). Instant failback risks a "flapping" pattern where traffic bounces between regions.

Metrics to Watch: discovery.cross_region.failover_active (boolean per service — is cross-region failover currently engaged), discovery.cross_region.latency_increase_ms (latency delta between local and cross-region resolution), discovery.federation.sync_lag_seconds (replication delay between regional registries), discovery.failback.traffic_shift_rate_pct_per_minute (how gradually traffic returns to recovered region).

Organizational Follow-up: Define a cross-region failover policy: failover triggers when local healthy instance count drops to zero, failback is gradual over 5 minutes. Communicate the latency tradeoff (5ms to 80ms) to product teams so they can set user expectations. Create a failover testing schedule: quarterly drills that simulate single-region failure and measure actual failover time. Document the failover and failback procedures in the on-call runbook.

Staff Signals:

Federated registry with local-first resolution
Clear failover trigger and priority ordering
Addresses failback carefully (gradual, not instant)
Acknowledges latency increase as explicit tradeoff

Deep Dive 5: The Ghost Service

Context: A service was decommissioned 6 months ago but its registry entry was never cleaned up. Other services still have it in their dependency graphs. A new service is assigned the same port on the same host. Traffic intended for the decommissioned service now reaches the new service.

Questions You Should Ask First:

Why did the registry entry persist for 6 months undetected — is there no TTL-based expiry or heartbeat requirement?
Is service identity based on host:port (vulnerable to port reuse collisions) or unique instance identifiers (container ID, hostname+PID)?
Do we have an automated registry audit that detects entries with no traffic — or does cleanup depend entirely on someone remembering a decommission checklist?
Is service decommission defined as a multi-step process (stop traffic, verify zero connections, deregister, remove DNS, update dependency graphs), or a single event?

Staff-grade phrasing

Typical L5 Approach: A Senior engineer would remove the stale entry from the registry and add a step to the decommission checklist to clean up registry entries. This fixes the immediate problem but is a manual, process-dependent solution. The Senior does not address why the entry persisted for 6 months undetected, why port reuse caused a collision, or what automated safeguards would prevent this class of problem from recurring. The next time someone decommissions a service and forgets the checklist, the same issue returns.

Staff Approach: The Staff answer treats this as a registry hygiene and identity safety problem requiring automated defenses. Every registry entry requires periodic heartbeat renewal — when a decommissioned service stops heartbeating, its entry expires automatically with no manual cleanup needed. Service identity includes a unique instance identifier (container ID or hostname+PID), not just host:port, preventing port reuse collisions by construction. An automated monthly audit scans for entries with no traffic in 7 days and alerts the owning team. The organizational fix redefines service decommission as a multi-step process (stop traffic, verify zero connections, deregister, remove DNS, update dependency graphs, archive) rather than a single event that depends on someone remembering a checklist.

Staff Approach

This is a registry hygiene and naming safety problem.

Immediate fix: remove the stale entry from the registry. Audit all registry entries for services that haven't sent a heartbeat in >24 hours — these are likely ghost entries.

Prevention: (1) TTL on all entries — every registry entry must be renewed by the service periodically (heartbeat). If the service stops heartbeating (because it's decommissioned), the entry expires automatically. No manual cleanup needed. (2) Unique service identity — register with a service ID that includes a unique instance identifier (container ID, hostname + PID), not just host:port. This prevents port reuse collisions. (3) Decommission checklist — add "remove from service registry" to the decommission runbook. Automate it if possible (decommission pipeline triggers deregistration). (4) Registry audit — monthly automated scan for entries with no traffic in >7 days. Alert the owning team.

The organizational fix: service decommission is a process, not an event. It includes: stop traffic → verify zero connections → deregister → remove DNS entries → update dependency graphs → archive.

Metrics to Watch: registry.entry.heartbeat_age_seconds (time since last heartbeat — entries older than 24 hours are likely ghosts), registry.ghost_service.count (entries with no traffic for >7 days), registry.port_collision.detected_count (instances where a new service received traffic intended for a decommissioned service), registry.decommission.checklist_completion_pct (percentage of decommissions following the full process).

Organizational Follow-up: Implement TTL-based entry expiry: every registry entry requires periodic heartbeat renewal, and entries expire automatically when the service stops heartbeating. Add unique instance identifiers (container ID) to service registration to prevent port reuse collisions. Create a monthly automated registry audit that alerts owning teams about entries with no traffic in 7 days. Redefine service decommission as a formal multi-step process with automated triggers, not a manual checklist.

Staff Signals:

Identifies root cause: no TTL enforcement + no decommission process
Proposes automatic expiry as primary defense
Includes organizational process fix, not just technical fix
Addresses the port reuse collision specifically

9Level Expectations Summary

After studying this playbook, you should be able to:

Choose between DNS-based, registry-based, and mesh-based discovery based on organizational scale
Design client-side caching for graceful degradation during registry outages
Implement multi-layer health checking (liveness, readiness, traffic-based)
Design naming conventions that encode team ownership and prevent collisions
Walk through a registry failure scenario with timeline and blast radius analysis
Explain the tradeoffs between client-side and server-side discovery patterns
Design cross-region service discovery with health-aware failover

The Bar for This Question

Mid-level (L4/E4): Understands the fundamental problem service discovery solves — how services find each other when instances are ephemeral and IP addresses change. Can describe DNS-based or registry-based discovery at a high level and explain why hardcoded addresses don't work in dynamic environments. Knows that services register themselves and consumers look them up, even if the design lacks depth on failure modes.

Senior (L5/E5): Makes a deliberate choice between client-side discovery (consumer queries registry, picks an instance) and server-side discovery (load balancer queries registry, consumer hits a stable endpoint) with clear tradeoffs. Integrates health checks with registration so that unhealthy instances are removed from the registry promptly. Handles stale registrations from crashed services (TTL-based expiry, active health probes). Reasons about how rolling deployments interact with discovery — new instances registering before they're ready, old instances deregistering while draining connections.

Staff+ (L6/E6+): Identifies the service registry as the most invisible single point of failure in the platform — every service depends on it, but nobody thinks about its availability until it fails. Designs cache-based resilience so that services can continue routing to last-known-good endpoints during a registry outage. Addresses the bootstrapping problem: how does a service discover the registry itself? Thinks about organizational ownership — who runs the discovery platform, what SLOs govern it, and how is the cost of operating a consensus-backed registry justified to leadership.

10Staff Insiders: Controversial Opinions

10.1 "DNS Is Underrated for Service Discovery"

The conventional wisdom is that you need a dedicated service registry (Consul, Eureka) for microservices. The reality: DNS-based discovery covers 80% of use cases with zero additional infrastructure. Every language, framework, and tool supports DNS. Short TTLs (5-10 seconds) provide near-real-time updates. Health-aware DNS (CoreDNS with health checks, Consul DNS interface) provides the best of both worlds.

The experienced approach: start with DNS. Add a registry API only when you need features DNS can't provide (watches, KV store, ACLs). Most teams add a registry too early and pay the operational cost of running a consensus-based cluster for a problem that DNS solves.

And that's OK. Simple solutions that work are better than sophisticated solutions that require a dedicated team to operate.

10.2 "The Service Registry Is Your Most Invisible Single Point of Failure"

Everyone worries about database availability. Nobody worries about registry availability — until it fails and takes down every service simultaneously. The registry is the one dependency that every service shares, and it's invisible because it "just works" until it doesn't.

The uncomfortable truth: your microservices platform's availability is capped by your registry's availability. A 99.99% available payment service depending on a 99.9% available Consul cluster is effectively 99.9% available. Invest in registry resilience — 5-node clusters, cross-AZ deployment, client-side caching, automated recovery, and chaos testing.

Treat the registry like you treat the database: with respect, dedicated resources, and a runbook for 3am failures.

10.3 "Service Mesh Is the Right Answer for the Wrong Time"

Service mesh (Istio, Linkerd) provides everything: discovery, health checking, mTLS, traffic management, observability. It's the "right" answer architecturally. But the operational cost is enormous — sidecar proxies consume 10-15% of your compute, the control plane is a complex distributed system, and debugging mesh issues requires specialized expertise.

The counterintuitive truth: most companies adopt service mesh too early. They have 30 services, 3 teams, and no mTLS requirement — but they deploy Istio because it's the "modern" approach. The result: 6 months of mesh debugging instead of shipping features.

Staff insight: adopt mesh when the organizational pain (manual mTLS, inconsistent observability, per-service circuit breaking) exceeds the mesh operational cost. For most companies, that's at 100+ services with a dedicated platform team. Before that, DNS + registry + client-side libraries is simpler, cheaper, and more debuggable.

Appendices

Appendix A: Service Discovery Technologies

A.1 Technology Comparison

Technology	Discovery Model	Health Check	Consistency	Best For
Consul	Registry + DNS	Active (TCP/HTTP/gRPC)	Raft consensus	General purpose, multi-datacenter
etcd	Key-value store	Client-managed	Raft consensus	Kubernetes, config management
ZooKeeper	Hierarchical registry	Ephemeral nodes + watches	ZAB consensus	Legacy systems, Kafka
Eureka	Registry (AP model)	Client heartbeat	Eventual (no consensus)	Spring ecosystem, availability-first
Kubernetes Services	Built-in DNS + endpoints	Kubelet probes (liveness/readiness)	etcd-backed	Kubernetes-native workloads
CoreDNS	DNS server	Plugin-based	Depends on backend	DNS-based discovery, Kubernetes

A.2 Decision Matrix

Requirement	Consul	Eureka	K8s Services	Plain DNS
Health-aware routing	✅	✅	✅	❌
Multi-datacenter	✅	✅	❌ (needs federation)	✅
No additional infra	❌	❌	✅ (if on K8s)	✅
KV store / config	✅	❌	✅ (ConfigMaps)	❌
Service mesh integration	✅ (Consul Connect)	❌	✅ (Istio/Linkerd)	❌

Appendix B: Health Check Patterns

B.1 Three-Layer Health Model

Rendering diagram...

B.2 Health Endpoint Design

JSON

// GET /health (liveness)
{ "status": "ok" }

// GET /ready (readiness)
{
  "status": "ready",
  "checks": {
    "database": "connected",
    "cache": "connected",
    "downstream_auth": "reachable"
  }
}

// GET /health/detailed (debugging, not for load balancer)
{
  "status": "ready",
  "uptime_seconds": 86400,
  "version": "2.1.3",
  "connections": { "db": 45, "cache": 12 },
  "memory_mb": 256,
  "cpu_percent": 15
}

B.3 Health Check Timing

Check Type	Interval	Failure Threshold	Recovery Threshold
Liveness	5s	3 failures (15s)	1 success
Readiness	5s	1 failure (immediate)	3 successes (15s)
Traffic-based	Continuous	>10% error rate over 30s	<1% error rate over 60s

Note: readiness is stricter (1 failure removes from rotation) because sending traffic to an unready service causes user-visible errors. Recovery is stricter (3 successes) to prevent flapping.

Appendix C: Naming Conventions Reference

C.1 Naming Format

{domain}.{service-name}.{environment}[.{region}]

Examples:
  payments.checkout-api.prod.us-east-1
  catalog.search-service.staging
  identity.auth-gateway.prod
  platform.api-gateway.prod.eu-west-1

C.2 Naming Rules

Rule	Good	Bad
Lowercase, hyphenated	`checkout-api`	`CheckoutAPI`, `checkout_api`
Descriptive	`order-processor`	`service-v2`, `my-api`
Domain-prefixed	`payments.checkout-api`	`checkout-api` (no domain)
No version in name	`checkout-api`	`checkout-api-v2`
Environment explicit	`checkout-api.prod`	`checkout-api` (which env?)

C.3 Reserved Domains

Domain	Purpose	Owner
`platform`	Infrastructure services (gateway, monitoring)	Platform team
`identity`	Auth, user management	Identity team
`internal`	Internal tools, admin services	Various

C.4 Migration from Ad-Hoc Names

Create service catalog mapping old → new names
Register services under both old and new names (dual registration)
Track old-name lookup counts via metrics
When old-name lookups reach zero for 2 weeks, deprecate the old name
Enforce new naming convention for all new services via registry validation