StaffSignal

Design Service Discovery

Staff-Level Playbook

Technologies referenced in this playbook: ZooKeeper & etcd · API Gateways

How to Use This Playbook

This playbook supports three reading modes:

ModeTimeWhat to Read
Quick Review15 minExecutive Summary → Interview Walkthrough → Fault Lines (§3) → Drills (§7)
Targeted Study1-2 hrsInterview Walkthrough + Core Flow, expand appendices where you're weak
Deep Dive3+ hrsEverything, including all appendices
What is Service Discovery? — Why interviewers pick this topic

The Problem

In a microservices architecture, services need to find and communicate with each other. IP addresses and ports change constantly — containers restart, auto-scaling adds/removes instances, deploys roll new versions. Service discovery is the mechanism that lets services find each other dynamically, without hardcoded addresses. It's the phone book of your distributed system.

Common Use Cases

  • Service-to-Service Communication: Service A needs to call Service B — which instance? At what address?
  • Load Distribution: Spread requests across multiple instances of a service
  • Health-Aware Routing: Only route to healthy instances, skip unhealthy ones
  • Blue-Green/Canary Deployment: Route traffic to specific versions of a service
  • Multi-Region Routing: Route to the closest or most appropriate region

Why Interviewers Ask About This

Service discovery surfaces the core Staff-level tension: availability vs consistency of the service registry. A stale registry routes to dead instances (bad). A strict registry that requires consensus adds latency to every lookup (also bad). Interviewers want to see you reason about this tension, choose the right consistency model for discovery, and understand the operational cost of registry failures.

Mechanics Refresher: DNS Resolution for Service Discovery — How DNS-based discovery actually works

DNS Record Types for Discovery

Record TypeReturnsExampleUse Case
A recordIP addresscheckout.prod10.0.1.5Simple service resolution
SRV recordIP + Port + Priority + Weight_http._tcp.checkout.prod10.0.1.5:8080 priority=10 weight=50Port-aware routing, weighted load distribution
CNAMEAnother hostnamecheckout.prodcheckout.us-east.elb.aws.comIndirection through load balancer

SRV records are the "right" record type for service discovery because they include port information and support weighted routing. But many client libraries only support A records, which is why most DNS-based discovery uses A records with a well-known port convention.

DNS Caching Layers (The TTL Problem)

When you set a DNS TTL of 10 seconds, the actual staleness is often longer due to multiple caching layers:

Layer                     Caches?     Respects TTL?
──────────────────────────────────────────────────
Application DNS cache     Sometimes   Often ignores (JVM: 30s default!)
OS resolver cache         Yes         Usually respects
Local DNS server          Yes         Respects
Upstream DNS server       Yes         Respects

Java gotcha: The JVM caches DNS results for 30 seconds by default (forever for successful lookups in some versions). Set networkaddress.cache.ttl=10 in java.security or use -Dsun.net.inetaddr.ttl=10. This is the #1 cause of "DNS TTL is 10s but my service still routes to dead instances for 30s."

Effective staleness = TTL + max(caching layer delays). For a 10s TTL with JVM defaults: up to 40 seconds before a client sees a change.

Mechanics Refresher: Health Check Protocols — How services prove they're alive

Health Check Types

Check TypeHow It WorksWhat It ProvesWhat It Misses
TCPOpen a TCP connection to portProcess is running and accepting connectionsApplication-level health (may accept TCP but crash on requests)
HTTPSend GET to /health, expect 200Application is running and HTTP stack worksDeep health (DB may be disconnected)
gRPCCall grpc.health.v1.Health/CheckgRPC server is running and respondingSame as HTTP — only proves the health endpoint works
Script/execRun a command inside the containerArbitrary health logicSlow — adds subprocess overhead per check

The 200 OK Lie

A common failure mode: the health endpoint returns 200 OK but the service can't actually serve real requests. This happens because:

  • Health endpoint is a simple handler that doesn't touch the database
  • Connection pool is exhausted but the health thread has its own connection
  • Service is in a degraded state but technically "alive"

Fix: Distinguish liveness (is the process alive?) from readiness (can it serve traffic?). The readiness check should verify downstream dependencies:

GET /health    → 200 if process is alive (for restart decisions)
GET /ready     → 200 only if DB connected AND cache reachable AND <100 pending requests

Timing Tradeoffs

ParameterAggressive (fast detection)Conservative (fewer false positives)
Check interval2s10s
Failure threshold1 failure → remove3 failures → remove
Recovery threshold1 success → add back3 successes → add back
Detection time2-4s30s
False positive riskHigh (network blip → removal)Low

Staff insight: Aggressive for readiness checks (remove unhealthy instances fast), conservative for liveness checks (don't restart containers due to transient network issues).

What This Interview Actually Tests

Service discovery is not a "use Consul or Eureka" question.

This is a registry reliability and failure propagation question that tests:

  • Whether you understand that the registry is a critical dependency for every service
  • Whether you reason about what happens when the registry is wrong (stale entries, missing entries)
  • Whether you design for registry failure (what if the registry itself is down?)
  • Whether you understand the naming and routing abstractions that scale with organizational growth

The key insight: Service discovery failure doesn't cause one service to fail — it causes every service to fail. The registry is the most dangerous dependency in your microservices architecture because it's invisible until it breaks.

The L5 vs L6 Contrast (Memorize This)

Level Calibration
BehaviorL5 (Senior)L6 (Staff)
First move"We'll use Consul for service discovery"Asks "How many services? What's the failure mode if discovery is unavailable?"
ArchitectureCentralized registry (Consul/Eureka)Evaluates DNS-based vs registry-based vs mesh-based based on organizational needs
Failure reasoning"Registry has health checks"Identifies the Registry SPOF: if discovery is down, no service can find any other service
Health checking"Services register and deregister"Designs multi-layer health: registry-level, client-level, and application-level health checks
Ownership"Platform team manages the registry"Defines naming conventions and ownership: who registers, who discovers, who gets paged

Default Staff Positions (Unless Proven Otherwise)

Default Staff Positions
PositionRationale
DNS-based discovery as defaultSimple, well-understood, no additional infrastructure for basic needs
Client-side caching of discovery resultsRegistry failure shouldn't immediately cascade to all services
Health checks at multiple layersRegistry health check + client health check + application health check
Naming conventions that encode ownershipteam-name.service-name.environment — names should be self-documenting
Graceful degradation on registry failureServices continue with last-known-good addresses, not immediate failure
Separate data plane from control planeDiscovery lookups (data plane) should not depend on consensus (control plane)

The Three Intents (Pick One and Commit)

IntentConstraintStrategyRegistry Model
Simple DiscoveryOperational simplicityDNS-based, minimal infrastructureDNS records (SRV/A records), no dedicated registry
Dynamic DiscoveryFlexibility, health-aware routingDedicated registry (Consul, Eureka), API-based lookupCentralized registry with health checks
Mesh-Based DiscoveryZero-trust, observabilityService mesh (Istio, Linkerd), sidecar proxiesDistributed, mesh-integrated

The Four Fault Lines (The Core of This Interview)

  1. Client-Side vs Server-Side Discovery — Who resolves the address: the calling service or a load balancer?
  2. DNS vs Registry vs Mesh — What's the lookup mechanism? Each has different consistency and latency tradeoffs.
  3. Push vs Pull Health Checks — Does the registry actively check services, or do services self-report?
  4. Centralized vs Distributed Registry — One registry cluster or per-region/per-domain registries?

Each fault line has a tradeoff matrix with explicit "who pays" analysis. See §3.

Quick Reference: What Interviewers Probe

Who Pays Analysis
After You Say...They Will Ask...
"We'll use Consul""Consul is down. What happens to service-to-service communication?"
"DNS-based discovery""You deploy a new version. DNS TTL is 30 seconds. What happens to in-flight requests?"
"Services register on startup""Service crashes without deregistering. How long until other services stop routing to it?"
"Health checks every 10 seconds""A service is unhealthy but health check hasn't run yet. What happens to the 10-second window?"
"Service mesh handles discovery""What's the operational cost? Your team of 5 now operates a service mesh for 50 services."

Jump to Practice

Active Drills (§7) — 8 practice prompts with expected answer shapes

System Architecture Overview

Rendering diagram...

Interview Walkthrough

The six phases below are compressed for a deep-dive format. Phases 1-3 deliver the crisp answer in 2-3 minutes. If probed, Phase 5 has depth for 15+ minutes.

Phase 1: Requirements & Framing (30 seconds)

Name the problem before the solution:

  • "In a microservices architecture, services move — they scale up, scale down, get redeployed, fail, and recover. Hardcoded IP addresses don't work. Service discovery provides a dynamic registry so callers can always find healthy instances of any service."

Then frame the key decision:

  • "The core tradeoff is client-side vs server-side discovery. Does the caller query a registry and pick an instance (client-side), or does a load balancer handle routing transparently (server-side)?"

Phase 2: Core Entities & API (30 seconds)

  • Service Registry: the authoritative map of service name → healthy instances (IP:port)
  • Service Instance: a running copy of a service with its address, health status, metadata (version, zone, weight)
  • Health Check: active probe or heartbeat confirming an instance is alive and ready
  • Watch/Subscription: a mechanism for callers to be notified of registry changes in real-time

Phase 3: The 2-Minute Architecture (2 minutes)

Phase 4: Transition to Depth (15 seconds)

"The basic approaches are well-understood. The hard problems are: registry consistency during network partitions, stale instance information causing requests to dead instances, and the interaction between service discovery and deployment (canary, blue-green)."

Phase 5: Deep Dives (5-15 minutes if probed)

Probe 1: "What happens during a network partition?" (3-5 min)

"The registry is a distributed system — typically backed by Raft consensus (Consul, etcd). During a network partition, the minority side of the partition can't elect a leader and becomes read-only."

Walk through the failure modes:

  1. Caller in minority partition: The registry is read-only. New registrations and deregistrations don't propagate. The caller's cached instance list becomes stale. "Stale is better than empty — the caller should continue using its last known good list."
  2. Service instance in minority partition: The instance can't renew its registration heartbeat. The registry marks it as unhealthy after the TTL expires. Other callers stop routing to it — even though the instance may be healthy within its partition.
  3. Split-brain: Both sides of the partition think they're authoritative. After partition heals, the registry must merge — services that registered during the partition on both sides need reconciliation.

The key design decision: "AP vs CP for the registry. Consul and etcd are CP — they sacrifice availability during partitions (minority can't write). Eureka (Netflix) is AP — it sacrifices consistency (both sides accept registrations, may serve stale data). For service discovery, I prefer AP: serving slightly stale data is better than serving no data."

Probe 2: "How do you handle stale instances?" (3-5 min)

"A service instance crashes without deregistering. Its entry remains in the registry until the health check TTL expires (typically 30-90 seconds). During that window, callers route requests to a dead instance."

Mitigations:

  1. Client-side health tracking: The caller tracks success/failure per instance. If 3 consecutive requests to instance X fail, remove it from the local cache — don't wait for the registry to catch up. "The caller detects failure in 3 requests (~1 second). The registry detects it in 30-90 seconds. Client-side detection is 30-90x faster."
  2. Retry with next instance: On failure, immediately retry on a different instance. The caller never surfaces a single-instance failure to the user if healthy instances are available.
  3. Fast deregistration: Use a shutdown hook to deregister on graceful shutdown. Only crashes leave stale entries. "In practice, 99% of instance removals are graceful deployments with shutdown hooks. Crashes are the 1% edge case."
  4. Lease-based registration with short TTL: Register with a 10-second TTL and heartbeat every 5 seconds. A crashed instance disappears in 10 seconds instead of 90. "The cost: more heartbeat traffic. At 1,000 instances × 1 heartbeat/5s = 200 heartbeats/sec. That's trivial."

Probe 3: "How does service discovery interact with deployments?" (3-5 min)

"A canary deployment puts 5% of traffic on the new version. How does service discovery enable this?"

With client-side discovery:

  • Register canary instances with metadata: version=2.1, canary=true
  • Callers read metadata and route 5% of traffic to canary instances based on a hash or random selection
  • "The routing logic is in the caller — which means every caller needs to support canary routing. Consistent behavior requires a shared client library."

With server-side discovery:

  • The load balancer (Kubernetes Ingress, Envoy) handles weighted routing: 95% to stable, 5% to canary
  • The caller doesn't know about the canary — routing is transparent
  • "Simpler for callers. But the LB must support weighted routing and the deployment pipeline must configure it."

Phase 6: Wrap-Up

"Service discovery is the nervous system of a microservices architecture. The technology choice (Consul vs Kubernetes DNS vs Eureka) matters less than three operational decisions: (1) AP vs CP registry behavior during partitions, (2) client-side vs registry-side health detection speed, and (3) how discovery integrates with your deployment model. Get these three wrong and your services find each other in theory but fail to communicate in practice."

Quick-Reference: The 30-Second Cheat Sheet

Level Calibration
TopicThe L5 AnswerThe L6 Answer (say this)
Purpose"Services find each other""Dynamic registry for ephemeral instances — services move, IPs change, the registry tracks them"
Client vs server"Use Consul" or "Use K8s DNS""Server-side by default; client-side when you need routing control (canary, zone-aware)"
Partition behavior"The registry is always available""AP vs CP tradeoff — stale data is better than no data for discovery"
Stale instances"Health checks detect failures""Client-side detection in 3 requests; registry detection in 30-90 seconds — don't wait for the registry"
Deployment"Deploy and it registers""Drain → deregister → update → register — every deployment step has a discovery step"

1The Staff Lens

1.1 Why This Problem Exists in Staff Interviews

This is NOT a "pick a service registry" question. Everyone knows Consul exists.

This is a Registry Reliability & Failure Propagation question that tests:

  • Whether you understand the registry as a single point of failure for the entire platform
  • Whether you design for registry unavailability (graceful degradation, not hard failure)
  • Whether you reason about stale discovery data (routing to dead instances, missing new instances)
  • Whether you can design naming conventions that scale with organizational growth

1.2 The L5 vs L6 Contrast

Recall the five key behaviors from the Executive Summary. Below, we explain why each matters and what interviewers listen for.

Behavior 1: First move (ask about failure modes)

Staff signal: Understand the blast radius of registry failure before choosing technology.

Why this matters (L5 vs L6)

L5: Jumps to technology selection — "We'll use Consul" or "Kubernetes services." This skips the critical design question: what happens when the registry is unavailable?

L6: Asks about the failure mode first: "If the registry is down for 5 minutes, what happens to service-to-service communication? Can services continue with cached addresses?" This shapes the entire design — if the answer is "services die when registry is down," you need client-side caching, local DNS, or mesh-based resolution.

Behavior 2: Architecture (match mechanism to organizational needs)

Staff signal: Choose the discovery mechanism based on organizational complexity, not hype.

Why this matters (L5 vs L6)

L5: Defaults to a dedicated registry because it's the "modern" approach. For 10 services with 2 teams, this is over-engineering — DNS-based discovery is simpler, cheaper, and sufficient.

L6: Matches mechanism to organizational needs: DNS for simple platforms (<20 services), dedicated registry for dynamic platforms (20-100+ services with frequent deploys), service mesh for platforms requiring zero-trust and advanced traffic management. Each step up adds capability and operational cost.

Rendering diagram...

Behavior 3: Failure reasoning (registry is a Tier-0 dependency)

Staff signal: Design for registry failure as a first-class concern, not an afterthought.

Why this matters (L5 vs L6)

L5: Treats the registry as reliable infrastructure — "Consul is highly available." This assumption fails when the registry cluster has a quorum issue, network partition, or operational error.

L6: Designs defense-in-depth: (1) client-side caching — services cache discovery results and continue with cached addresses during registry outage, (2) local DNS fallback — if registry-based DNS fails, fall back to static DNS entries, (3) circuit breaker on discovery calls — don't let registry latency cascade to every service call.

Behavior 4: Health checking (multi-layer defense)

Staff signal: Health checks at multiple layers prevent different classes of failures.

Why this matters (L5 vs L6)

L5: Relies on registry health checks: "Consul checks services every 10 seconds." This misses the 10-second window where a service is down but still registered, and doesn't account for application-level health (service is TCP-reachable but returning errors).

L6: Designs multi-layer health checking: (1) registry-level health check (TCP/HTTP) — is the process alive? (2) client-side health check — is the endpoint responding with acceptable latency? (3) application-level health check — is the service returning correct responses? Each layer catches different failure modes.

Behavior 5: Ownership (naming encodes organizational structure)

Staff signal: Service naming conventions should make ownership obvious.

Why this matters (L5 vs L6)

L5: Uses ad-hoc naming — user-service, my-api, service-v2. At 10 services this works. At 200 services, nobody knows who owns what or what depends on what.

L6: Designs naming conventions that encode ownership: payments.checkout-service.prod, catalog.search-api.staging. Names include team/domain, service name, and environment. This makes ownership discoverable from the service name itself and enables automated routing rules (all payments.* traffic goes through payments gateway).

1.3 The Staff Question That Cuts Through Everything

This reframes the entire interview. The registry isn't the problem — the health authority is the problem. Staff engineers design for the case where the health authority is wrong, not just for the case where the registry is down.

2Problem Framing & Intent

2.1 The Three Intents

Before choosing any technology, ask What's the organizational complexity?

Who Pays Analysis
IntentConstraintStrategyHealth ModelOperational Cost
Simple DiscoveryMinimize opsDNS SRV records, Kubernetes ServicesKubelet/LB health checksMinimal
Dynamic DiscoveryHealth-aware, feature-richConsul/Eureka with APIActive + passive health checksMedium
Mesh DiscoveryZero-trust, traffic managementIstio/Linkerd, sidecar proxiesMesh-integrated health + circuit breakingHigh

2.2 What's Intentionally Underspecified

The interviewer deliberately avoids specifying:

  • Number of services and expected growth
  • Container orchestration (Kubernetes vs bare metal vs hybrid)
  • Multi-region requirements
  • Security requirements (mTLS, zero-trust)
  • Current pain points

Staff engineers surface these unknowns. Senior engineers jump to technology selection.

2.3 How to Open (The First 2 Minutes)

  1. Ask about organizational scale and growth trajectory
  2. State your mechanism assumption explicitly
  3. Outline your plan: discovery mechanism → naming → health checking → failure modes → observability

Example opening:

3Fault Lines

3.1 Fault Line 1: Client-Side vs Server-Side Discovery

The tension: Client-side discovery gives services control over routing but requires client libraries. Server-side discovery is transparent but adds a hop.

Who Pays Analysis
ApproachLatencyFlexibilityClient ComplexityWho Pays
Client-side (service resolves address directly)Low — direct connectionHigh — client controls routingHigh — needs discovery libraryEngineering (client library maintenance, per-language implementations)
Server-side (LB/proxy resolves)+1-3ms — extra hopMedium — LB controls routingLow — client just calls a hostnameInfra (LB infrastructure), Users (extra latency hop)
Sidecar (local proxy per service)+0.5ms — local hopHigh — sidecar controls routingNone — transparent to clientInfra (sidecar resource overhead per instance)

L6 answer: "Server-side or sidecar, depending on platform maturity. For most teams, server-side (Kubernetes Services or a service-aware LB) is sufficient — services call a DNS name, the LB resolves to a healthy instance. For platforms needing fine-grained traffic control (canary, circuit breaking), the sidecar pattern (Envoy) gives control without client library complexity. I'd avoid client-side discovery unless we have a strong reason — maintaining client libraries in 5 languages is organizational debt."

L7 answer: "The pattern should match the organizational stage. Client-side is fine for 3 services in one language. Server-side scales to 50+ services without per-team effort. Sidecar/mesh is for 100+ services needing zero-trust and traffic management. Each migration adds capability and cost. I'd start with server-side and evolve to sidecar only when the pain justifies the operational investment."

3.2 Fault Line 2: DNS vs Registry vs Mesh

The tension: DNS is simple but has TTL staleness. Registries are dynamic but add infrastructure. Service mesh is comprehensive but expensive to operate.

Who Pays Analysis
MechanismStalenessInfrastructure CostFeature SetWho Pays
DNS (SRV/A records, CoreDNS)Bounded by TTL (5-60s)Minimal — DNS is everywhereBasic — name resolution onlyUsers (routing to stale/dead endpoints during TTL), Engineering (no health-aware routing)
Registry (Consul, Eureka, etcd)Low (1-5s with watches/long-polling)Medium — registry cluster to operateRich — health checks, KV store, watchesInfra (registry operational burden), Platform team (registry is Tier-0)
Service Mesh (Istio, Linkerd)Very low (real-time updates)High — control plane + sidecarsFull — mTLS, traffic management, observabilityInfra (significant resource overhead), Engineering (mesh complexity)
Rendering diagram...

L6 answer: "DNS-based with intelligent DNS (CoreDNS or Consul DNS interface) as the default. Services resolve checkout.payments.svc via DNS. Behind the scenes, DNS returns only healthy instances with short TTLs (5-10 seconds). This gives us health-aware routing with DNS simplicity. For services needing watch-based real-time updates, the registry API is available for direct consumption."

L7 answer: "I'd layer the mechanisms. DNS as the universal interface (every language/framework supports it). Registry API for services needing real-time updates (deployment orchestration, canary tooling). Service mesh for services needing mTLS and traffic shaping. Each layer is optional — teams opt in based on their requirements. This prevents forcing mesh complexity on teams that only need DNS."

3.3 Fault Line 3: Push vs Pull Health Checks

The tension: Active health checks (registry pings services) detect failures proactively but add load. Passive health checks (observe real traffic) have zero overhead but detect failures reactively.

Who Pays Analysis
ApproachDetection TimeOverheadFalse PositivesWho Pays
Active push (registry polls services)Fast (check interval)Medium — registry sends N health checksSome — network blip ≠ service failureInfra (health check traffic), Services (must expose health endpoint)
Self-report (services heartbeat to registry)Fast (heartbeat interval)Low — each service sends one heartbeatLowServices (must implement heartbeat), Registry (processing heartbeats)
Passive (observe real traffic errors)Slow — requires real traffic failure firstZero — piggybacks on real requestsVery lowUsers (first N requests hit unhealthy instance before detection)
Hybrid (active + passive)Fastest — active catches process death, passive catches application errorsMediumBalancedInfra (two systems), best user experience

L6 answer: "Hybrid: active health checks for process-level health (is the service reachable?), passive health checks for application-level health (is it returning correct responses?). The registry runs active TCP/HTTP checks every 5 seconds. The client-side LB or sidecar tracks real-time success rates — if a specific instance's error rate exceeds 50% over a 10-second window, remove it from the rotation immediately, don't wait for the next registry health check."

L7 answer: "I'd add a third layer: application-level readiness probes that reflect business health. A service can be TCP-healthy (process running), HTTP-healthy (returning 200), but business-unhealthy (database connection pool exhausted, returning errors for real requests). Readiness probes should check downstream dependencies, not just 'am I alive?' The three layers — liveness, readiness, and traffic-based — catch different failure classes."

3.4 Fault Line 4: Centralized vs Distributed Registry

The tension: A centralized registry is simple but is a single point of failure. Distributed registries are resilient but complex.

Who Pays Analysis
ApproachAvailabilityConsistencyOperational CostWho Pays
Single cluster (one registry)Limited by cluster healthStrongLow — one thing to manageEveryone (blast radius if registry fails)
Per-region cluster (registry per region)High — regional independencePer-region (cross-region eventual)Medium — N clustersInfra (multiple clusters), Engineering (cross-region discovery)
Embedded (registry embedded in each service)Very high — no external dependencyEventual (gossip)Low — no separate infrastructureEngineering (gossip protocol complexity), Infra (convergence time)

L6 answer: "Per-region registry clusters with cross-region federation. Each region has its own registry cluster — services in us-east register with the us-east registry. For cross-region discovery, registries federate: they share service catalogs asynchronously. If one region's registry fails, other regions continue independently. Cross-region lookups use the federated catalog with eventual consistency (acceptable — cross-region calls are already higher latency)."

L7 answer: "I'd separate the registry's control plane from its data plane. The control plane (consensus-based, for registration and deregistration) can be per-region. The data plane (serving lookups) should be distributed — cached locally on each node, refreshed from the control plane periodically. This way, the control plane can be briefly unavailable without affecting service-to-service communication."

4Failure Modes & Operational Reality

4.1 The Registry Cascade Failure

Timeline of a registry outage:

t=0:     Registry cluster loses quorum (AZ failure)
t=5s:    Services can't resolve new lookups
t=10s:   Client-side DNS cache starts expiring (TTL=10s)
t=15s:   Services with expired cache can't reach any other service
t=30s:   50 services report "service unavailable" errors
t=1m:    Every inter-service call fails
t=5m:    Registry quorum restored
t=5.5m:  Services re-resolve, traffic resumes

Staff insight: The 5-minute outage duration is registry-specific. The blast radius (50 services) is because every service depends on the registry. Client-side caching with longer fallback TTLs (e.g., cache discovery results for 5 minutes even if registry is unreachable) would have reduced the blast radius to zero for the majority of services.

4.2 The Stale Registry Entry

ScenarioImpactDetection TimeMitigation
Service crashes, no deregisterRegistry routes to dead instanceHealth check interval (5-10s)Active health checks + client-side circuit breaker
Zombie instance (process alive, not serving)Registry shows healthy, requests failUntil health check catches app-level failureApplication-level readiness probes
DNS cache staleClient routes to old IP after service restartDNS TTL (5-60s)Short DNS TTL + connection retry on first failure
Registry split-brainDifferent services see different service catalogsUntil partition healsPer-region registries, client-side caching

4.3 The Naming Collision

The scenario: Two teams independently name their service user-service. Both register in the same registry. Traffic intended for Team A's service routes to Team B's service.

Staff position: This is an organizational problem, not a technical one. Enforce naming conventions with namespace requirements: {team}.{service}.{environment}. Registry registration validates namespace ownership — Team A can only register services under team-a.*. Naming collisions become impossible by construction.

4.4 Service Discovery in Deployments

During a rolling deployment:

Rendering diagram...

Key concerns: (1) New instances must pass health checks before receiving traffic. (2) Old instances must drain connections before deregistering. (3) The registry must update quickly enough that clients don't route to terminated instances.

5Evaluation Rubric

5.1 Level-Based Signals

Level Calibration
DimensionL5 (Senior)L6 (Staff)L7 (Principal)
First move"Use Consul/Eureka"Asks about failure modes and organizational scaleFrames as platform reliability problem
MechanismDedicated registry for everythingMatches mechanism to scale (DNS → registry → mesh)Layered mechanisms with opt-in complexity
Failure design"Registry has HA"Client-side caching, graceful degradationRegistry data plane separation, defense-in-depth
Health checks"Registry health checks"Multi-layer: registry + client + applicationThree-layer health: liveness, readiness, traffic-based
NamingAd-hoc service namesNamespace conventions encoding ownershipNaming as platform contract with governance
Operational costIgnores operational burdenQuantifies registry as Tier-0 dependencyTCO analysis: DNS vs registry vs mesh per organizational stage

5.2 Strong Hire Signals

SignalWhat It Looks Like
Registry failure reasoning"If the registry is down for 5 minutes, services should continue with cached addresses."
Multi-layer health"Registry checks TCP health. Client-side checks response health. App checks readiness."
Naming governance"Names follow {team}.{service}.{env} — ownership is discoverable from the name."
Graceful degradation"Services cache discovery results with 5-minute fallback. Registry failure doesn't cascade."

5.3 Lean No Hire Signals

SignalWhat It Looks Like
Technology-firstSpends 10 minutes on Consul features without discussing registry failure
No degradation design"Registry handles HA" without addressing what happens during registry outage
No naming strategyAd-hoc names without namespace or ownership conventions
Single-layer healthOnly registry health checks, no client-side or application-level health

5.4 Common False Positives

  • Deep Consul/Eureka knowledge ≠ good discovery design (technology, not architecture)
  • "Service mesh solves everything" ≠ understanding the operational cost
  • Complex routing rules ≠ Staff thinking (often over-engineering)
  • "Zero-downtime deployments" without explaining the discovery update mechanism

6Interview Flow & Pivots

6.1 Typical 45-Minute Structure

PhaseTimeWhat Happens
Framing0-5 minClarify organizational scale, ask about failure tolerance
Requirements5-10 minSurface unknowns: service count, deployment frequency, multi-region
High-Level Design10-20 minDiscovery mechanism, naming, health checking strategy
Deep Dive20-35 minFault lines: registry failure, stale entries, deployment interactions
Wrap-Up35-45 minObservability, operational readiness, evolution path

6.2 How Interviewers Pivot

After You Say...They Will Probe...
"DNS-based discovery with short TTLs""Instance dies mid-TTL. How many requests fail before DNS refreshes?"
"Client-side caching for resilience""How do you handle cache staleness? What if a cached address points to a new, different service?"
"Health checks every 5 seconds""Service has a 3-second startup time. It's receiving traffic before it's ready. What's wrong?"
"Consul for registration and lookup""Consul cluster is in us-east. Your service is in eu-west. What's the lookup latency?"
"Service mesh for zero-trust""What's the resource overhead per service? You have 200 services with sidecars."

6.3 What Silence Means

  • After registry failure question → Interviewer wants specific degradation behavior, not just "we have HA"
  • After naming question → Interviewer is testing organizational thinking, not technical naming
  • After health check question → Interviewer wants you to distinguish process health from application health
  • After deployment question → Interviewer wants to see awareness of the discovery-deployment interaction

6.4 Follow-Up Questions to Expect

  1. "How do you handle a service that takes 30 seconds to start? When should it receive traffic?"
  2. "Service A depends on Service B. Service B's registry entry is stale. How does Service A cope?"
  3. "You have 200 services. How do you visualize and debug the dependency graph?"
  4. "A team renames their service. How do you handle backward compatibility?"
  5. "How do you support canary deployments through service discovery?"
  6. "Your registry cluster runs out of memory. What happens?"

7Active Drills

1

Drill 1: Discovery Mechanism Selection

Interview Prompt

Interview prompt: "You're building service discovery for a 30-service platform. What mechanism do you choose?"

Staff Answer

DNS-based with a registry backend. Services resolve each other via DNS names (checkout.payments.svc). Behind the DNS interface, a registry (Consul or CoreDNS with health-aware backends) returns only healthy instances. DNS TTL of 5-10 seconds balances freshness with DNS server load.

Why DNS as the interface: every language and framework supports DNS natively — no client library needed. Why registry backend: plain DNS doesn't do health checking. The registry provides health awareness while DNS provides the universal lookup interface.

For 30 services, a dedicated registry (Consul) is justified. For 10 services, Kubernetes Services (built-in DNS + health checks) is sufficient. For 200+ services, a service mesh becomes worth the operational investment.

Why this is L6:

  • Matches mechanism to organizational scale
  • Uses DNS as the universal interface with registry backend for health awareness
  • Provides clear scaling inflection points (10 → 30 → 200 services)
2

Drill 2: Registry Failure Resilience

Interview Prompt

Interview prompt: "Your service registry is down for 5 minutes. What happens to inter-service communication?"

Staff Answer

With naive implementation: all service-to-service calls fail because nobody can resolve addresses.

With proper design: services cache discovery results locally. When the registry is unreachable, services continue using cached addresses. The cache has a fallback TTL (e.g., 5 minutes) — longer than the normal TTL (10 seconds). During the registry outage:

  • Existing services continue communicating using cached addresses (99% of traffic)
  • New service instances can't be discovered (but they can still be reached by IP if known)
  • Instances that died during the outage are still in cache — clients experience connection failures to those instances, but client-side circuit breaking removes them from the rotation

The net impact: brief increase in connection errors to recently-failed instances, but the majority of service-to-service communication continues unaffected.

Why this is L6:

  • Designs for degradation, not perfection
  • Client-side caching as first defense
  • Client-side circuit breaking as second defense
  • Quantifies impact: "99% of traffic continues"
3

Drill 3: Health Check Design

Interview Prompt

Interview prompt: "Design the health check strategy for a payment processing service."

Staff Answer

Three layers for a critical service like payments:

Layer 1 — Liveness (is the process alive?): TCP check on the service port, every 5 seconds. Registry removes instance after 3 consecutive failures (15 seconds). This catches process crashes and network issues.

Layer 2 — Readiness (can it serve traffic?): HTTP GET to /ready endpoint, every 5 seconds. The /ready endpoint checks database connectivity, downstream service connectivity, and that initialization is complete. New instances don't receive traffic until readiness passes. This catches startup delays and dependency failures.

Layer 3 — Traffic-based (is it performing well?): Client-side monitoring of response times and error rates. If an instance's p99 latency exceeds 500ms or error rate exceeds 10% over a 30-second window, the client-side LB removes it from rotation. This catches slow degradation that liveness/readiness checks miss.

For payments specifically, I'd add a synthetic transaction check: every 60 seconds, process a $0.01 test transaction end-to-end. This validates the entire payment pipeline, not just HTTP health.

Why this is L6:

  • Three distinct layers catching different failure modes
  • Readiness check includes dependency health (not just "am I alive")
  • Traffic-based checks for progressive degradation
  • Domain-specific synthetic check for payments
4

Drill 4: Naming Convention Design

Interview Prompt

Interview prompt: "Design a service naming convention for a company with 20 teams and 100+ services."

Staff Answer

Format: {domain}.{service-name}.{environment}[.{region}]

Examples:

  • payments.checkout-api.prod.us-east
  • catalog.search-service.staging
  • identity.auth-gateway.prod

Rules: (1) Domain maps to team ownership — the payments domain is owned by the payments team. (2) Service names are lowercase, hyphenated, descriptive. (3) Environment is required — prevents accidental cross-environment calls. (4) Region is optional — only needed for multi-region routing.

Enforcement: registry registration validates that the registering service's namespace matches its team ownership. Team payments can only register payments.* services. This prevents naming collisions by construction.

Migration: existing services keep their old names as aliases. New convention is enforced for all new services. Over 6 months, teams migrate existing services during routine maintenance.

Why this is L6:

  • Names encode ownership (discoverable from the name)
  • Namespace enforcement prevents collisions
  • Migration strategy (not big-bang rename)
  • Includes environment to prevent cross-environment mistakes
5

Drill 5: Stale Discovery Handling

Interview Prompt

Interview prompt: "A service instance dies but is still registered. How do you handle the 10-second health check gap?"

Staff Answer

Three defenses for the gap between instance death and health check detection:

  1. Client-side retry with next instance: When a connection to an instance fails, the client immediately retries with the next instance from the discovery result. The dead instance is tried once, fails, and the client transparently falls over. User-visible impact: +50ms latency for the retry, no error.

  2. Client-side circuit breaker: After 2-3 consecutive failures to an instance within 10 seconds, the client marks it as unhealthy locally. Subsequent requests skip it. The circuit breaker resets after 30 seconds (by then, the registry has also deregistered it).

  3. Connection draining on shutdown: For graceful shutdowns, the service deregisters from the registry BEFORE stopping. This eliminates the gap for planned shutdowns. The 10-second gap only applies to unplanned crashes.

Net impact: for unplanned crashes, the first 1-2 requests to the dead instance see a +50ms retry. All subsequent requests are routed to healthy instances by the client-side circuit breaker. No user-visible errors.

Why this is L6:

  • Client-side retry as first defense (transparent to user)
  • Client-side circuit breaking as second defense (removes from rotation)
  • Connection draining for graceful shutdowns (eliminates gap entirely)
  • Quantifies user impact: "+50ms for 1-2 requests"
6

Drill 6: Cross-Region Discovery

Interview Prompt

Interview prompt: "Service A in us-east needs to call Service B in eu-west. How does discovery work?"

Staff Answer

Two options depending on the routing intent:

Option 1: Region-aware routing (prefer local, fallback to remote). Service A first looks up Service B in the us-east registry. If Service B has instances in us-east, route locally. If not, the registry federates — it queries the eu-west registry and returns eu-west instances. Service A connects cross-region (80-100ms latency).

Option 2: Explicit cross-region (Service B is only in eu-west). The naming convention includes region: data-team.service-b.prod.eu-west. Service A explicitly resolves the eu-west address. This is intentional cross-region communication.

I'd default to Option 1 for most services — prefer local instances when available. The discovery system should handle failover transparently: if us-east instances of Service B are all unhealthy, automatically fall back to eu-west instances (with latency awareness).

Why this is L6:

  • Distinguishes region-aware routing from explicit cross-region calls
  • Prefers local with automatic fallback (reduces latency for common case)
  • Naming convention supports both patterns
7

Drill 7: Service Mesh vs Registry

Interview Prompt

Interview prompt: "Your team is debating: dedicated registry (Consul) vs service mesh (Istio). How do you decide?"

Staff Answer

The decision depends on what you need beyond discovery. If you only need discovery + health checking, a registry is simpler and cheaper. If you need mTLS, traffic management (canary, circuit breaking), and deep observability — and you need it platform-wide — a service mesh provides these as infrastructure.

Criteria for mesh adoption: (1) >100 services (enough to justify operational cost), (2) security requirement for mTLS between all services, (3) need for platform-wide traffic management (canary deploys, traffic mirroring), (4) team with mesh operational expertise.

Criteria against mesh: (1) <50 services (over-engineering), (2) no mTLS requirement, (3) no dedicated platform team to operate the mesh, (4) performance-sensitive services that can't tolerate sidecar latency.

I'd start with a registry and migrate to mesh only when the organizational pain (manual mTLS, per-service circuit breaking implementation, inconsistent observability) justifies the mesh investment.

Why this is L6:

  • Clear decision criteria, not opinion
  • Acknowledges mesh operational cost
  • Recommends evolutionary adoption, not big-bang
8

Drill 8: Discovery During Deploys

Interview Prompt

Interview prompt: "You're doing a rolling deployment of Service B. How does service discovery interact with the deployment?"

Staff Answer

The deployment lifecycle must coordinate with discovery:

  1. New instance starts → passes liveness check → registered in registry → DOES NOT receive traffic yet
  2. Readiness check passes → registry marks instance as "serving" → receives traffic
  3. Old instance begins drain → registry marks as "draining" → stops receiving NEW requests → finishes in-flight requests (30-second grace period)
  4. Drain complete → old instance deregisters → terminates

Critical timing: the readiness check gap. If a new instance is registered before it's ready (warm-up, cache loading, connection pooling), it receives traffic it can't serve. The readiness probe must confirm the instance can handle production traffic, not just that the process started.

For canary deployments: register the canary instance with a weight of 5% (or a canary tag). The discovery system routes 5% of traffic to the canary. Monitor error rates for 5 minutes. If healthy, increase weight to 25%, then 100%.

Why this is L6:

  • Full lifecycle coordination (start → ready → drain → terminate)
  • Readiness vs liveness distinction
  • Canary deployment through weighted discovery
  • Specific timing: 30-second drain period

8Deep Dive Scenarios

Scenario-based analysis for Staff-level depth

Deep Dive 1: The Registry Outage

Context: Your Consul cluster (3 nodes) loses quorum at 2am. 80 services are unable to discover each other. Your on-call engineer has never dealt with a Consul failure before.

Questions You Should Ask First:

  • Are services actually failing right now, or are they operating on cached discovery results — how much time do we have before caches expire?
  • Why did the 3-node Consul cluster lose quorum — is this an AZ failure, disk issue, or network partition, and does the node placement tolerate single-AZ failure?
  • Does the on-call engineer have a Consul failure runbook, or are they debugging from scratch at 2am?
  • After recovery, should we upgrade to a 5-node cluster that tolerates 2 failures instead of 1?
Staff Approach

Immediate triage: (1) Are services actually failing, or are they using cached discovery results? Check service error rates. If most services are still operating (cached results), we have a 5-minute window before caches expire. (2) Why did Consul lose quorum? Check node health — AZ failure, disk full, OOM, network partition?

Recovery: If one node is recoverable (restart it), quorum is restored. If two nodes are lost, we need to bootstrap a new cluster from backup or the remaining node (unsafe but necessary in emergency). Document the procedure in a runbook BEFORE this happens.

Post-incident: (1) Move to 5-node Consul cluster (tolerates 2 failures instead of 1). (2) Implement client-side caching with 5-minute fallback TTL — future registry outages don't cascade. (3) Create a Consul failure runbook with step-by-step recovery. (4) Add synthetic monitoring that detects registry unavailability within 30 seconds.

The organizational fix: Consul is a Tier-0 dependency but was treated as "just infrastructure." It needs dedicated operational runbooks, monitoring, and blast radius analysis.

Metrics to Watch: consul.quorum.status (healthy/degraded/lost — alert immediately on degraded), service.discovery.cache_age_seconds (how stale cached results are — critical during registry outage), service.discovery.lookup_failure_rate (percentage of lookups failing due to registry unavailability), consul.node.health_by_az (per-AZ node health to detect placement problems).

Organizational Follow-up: Create a Consul failure runbook with step-by-step recovery procedures before the next 2am incident. Reclassify the service registry as a Tier-0 dependency with dedicated monitoring and blast radius analysis. Upgrade from 3-node to 5-node Consul with 2-2-1 AZ placement. Implement client-side caching with 5-minute fallback TTL so future registry outages don't cascade to service-to-service communication.

Staff Signals:

  • Triages based on actual impact (are services failing or using cache?)
  • Provides immediate recovery steps AND architectural improvements
  • Identifies the operational gap (no runbook, no monitoring)

Deep Dive 2: The Service Naming Migration

Context: Your company has 150 services with ad-hoc names (user-svc, my-api, service-v2-new-FINAL). Nobody knows who owns what. You need to migrate to a structured naming convention without downtime.

Questions You Should Ask First:

  • Is this actually a technical DNS migration, or is it an organizational ownership discovery exercise disguised as a rename?
  • Can we implement dual registration — both old and new names resolve simultaneously — so there's zero traffic disruption during migration?
  • How do we track migration progress with data instead of deadlines — can we monitor old-name lookup counts and deprecate names only when they hit zero?
  • What's the realistic timeline — is this 2 weeks of DNS changes, or 4-6 months of team coordination?
Staff Approach

This is an organizational migration disguised as a technical one. The naming convention reflects ownership — you need team buy-in, not just DNS changes.

Phase 1 (weeks 1-4): Audit and map. Create a service catalog: current name → owner → new name under the convention. Get each team to claim their services. This often reveals orphaned services nobody owns.

Phase 2 (weeks 4-8): Dual registration. Services register under both old and new names simultaneously. Both names resolve to the same instances. No traffic disruption. Callers can use either name.

Phase 3 (weeks 8-16): Caller migration. Teams update their service calls to use new names. Track old-name resolution counts — when a service's old name has zero lookups for 2 weeks, deprecate it.

Phase 4 (week 16+): Enforce convention. New services must follow the convention (registry validation). Old names are aliases that generate deprecation warnings in logs.

Timeline: 4-6 months for 150 services. The bottleneck is team coordination, not technology.

Metrics to Watch: discovery.old_name.lookup_count (per-service old-name resolution volume — migration is done when this reaches zero), discovery.naming.convention_compliance_pct (percentage of services following the new convention), discovery.orphaned_service.count (services with no identified owner discovered during audit), discovery.dual_registration.active_count (services currently registered under both old and new names).

Organizational Follow-up: Conduct an ownership audit: have every team claim their services in the catalog, surfacing orphaned services nobody owns. Implement dual registration so both names resolve during migration. Enforce the naming convention for all new services via registry validation. Set up a migration dashboard showing per-team progress, using old-name lookup counts as the completion signal — not arbitrary deadlines.

Staff Signals:

  • Recognizes this as organizational migration, not just technical
  • Dual registration prevents disruption
  • Uses lookup metrics to drive migration completion
  • Realistic timeline (months, not days)

Deep Dive 3: The Thundering Herd on Discovery

Context: Your registry serves 500K lookups/sec. During a large-scale deployment (30 services deploying simultaneously), the registry sees a spike to 2M lookups/sec as clients refresh their caches. The registry becomes overloaded, causing lookup failures across the platform.

Questions You Should Ask First:

  • Is the 4x traffic spike caused by synchronized cache invalidation across all callers during simultaneous deployments?
  • Can we stagger deployments in waves of 5 with 2-minute gaps instead of deploying 30 services simultaneously?
  • Are clients polling the registry on a fixed interval, or do they use random jitter so cache refreshes spread across a time window?
  • Can we switch from polling to push-based mechanisms (Consul blocking queries, etcd watches) so clients don't need to poll at all?
Rendering diagram...

Thundering herd on discovery: 30 simultaneous deployments trigger cache invalidation across all callers at once, creating a 4x traffic spike that overloads the registry. Fix: stagger deployments, add client-side refresh jitter, and switch to push-based updates.

Staff Approach

Root cause: deployment triggers cache invalidation across all callers simultaneously. 30 services × N callers × cache refresh = thundering herd on the registry.

Immediate fix: stagger deployments. Don't deploy 30 services simultaneously — deploy in waves of 5 with 2-minute gaps. This spreads the cache refresh load.

Architecture fix: (1) Client-side jitter — add random jitter (0-5 seconds) to cache refresh timing. Instead of all clients refreshing at TTL expiry, they refresh within a 5-second window. (2) Watch/push instead of poll — use registry watch mechanisms (Consul blocking queries, etcd watches) where available. Clients receive updates pushed by the registry, not polling for changes. (3) Local DNS cache — services resolve through a per-node DNS cache. The cache handles the lookup load; only cache misses hit the registry.

The registry should be sized for 3-5x normal traffic to handle deployment spikes without degradation.

Metrics to Watch: registry.lookup_rate_vs_baseline (current lookups/sec as a multiple of steady-state — alert at 2x), registry.deployment_concurrency_count (number of services deploying simultaneously), registry.cache_refresh.jitter_spread_seconds (variance in client refresh timing — higher is better), registry.push_subscription.active_count (clients using watch/push instead of polling).

Organizational Follow-up: Implement deployment staggering: no more than 5 services deploy simultaneously, with 2-minute gaps between waves. Add random jitter (0-5 seconds) to all client-side cache refresh timers. Create a registry capacity dashboard showing lookup rate with deployment events overlaid. Migrate clients from polling to watch/push mechanisms, tracking adoption weekly.

Staff Signals:

  • Identifies thundering herd as the root cause pattern
  • Proposes both immediate (stagger) and architectural (jitter, push, cache) fixes
  • Sizes infrastructure for spikes, not just steady state

Deep Dive 4: Multi-Region Service Discovery

Context: Your company operates in US and EU. Each region has its own service registry. A new requirement says that if all US instances of a service are unhealthy, traffic should automatically fail over to EU instances. How do you implement this?

Questions You Should Ask First:

  • What's the failover priority chain — do we prefer local unhealthy instances with circuit breakers before falling back to cross-region, or go directly to the remote region?
  • What's the latency increase when traffic fails over cross-region — have we communicated that users will see 5ms to 80ms latency increase as an explicit tradeoff?
  • How does failback work when the local region recovers — do we shift traffic back gradually to prevent flapping, or instantly?
  • Is the cross-region registry federation async with bounded staleness, or synchronous — and what's the acceptable sync interval?
Staff Approach

This requires cross-region registry federation with health-aware failover.

Design: (1) Each region's registry maintains a local service catalog AND a federated view of other regions' catalogs. Federation happens via async replication (10-second sync interval). (2) Discovery lookup priority: local healthy instances → local unhealthy instances with circuit breaker → remote region instances. (3) Cross-region failover trigger: when local instance count drops to 0 (or all are unhealthy), the discovery system returns remote region instances with a latency warning.

Implementation: DNS-based with geo-aware records. checkout.payments.svc resolves to local instances normally. When local health checks fail, DNS returns eu-west instances. Client sees higher latency (80ms vs 5ms) but the service stays available.

Critical consideration: failback. When US instances recover, traffic should gradually shift back (not instantly — validate US instances are stable for 5 minutes before shifting all traffic back). Instant failback risks a "flapping" pattern where traffic bounces between regions.

Metrics to Watch: discovery.cross_region.failover_active (boolean per service — is cross-region failover currently engaged), discovery.cross_region.latency_increase_ms (latency delta between local and cross-region resolution), discovery.federation.sync_lag_seconds (replication delay between regional registries), discovery.failback.traffic_shift_rate_pct_per_minute (how gradually traffic returns to recovered region).

Organizational Follow-up: Define a cross-region failover policy: failover triggers when local healthy instance count drops to zero, failback is gradual over 5 minutes. Communicate the latency tradeoff (5ms to 80ms) to product teams so they can set user expectations. Create a failover testing schedule: quarterly drills that simulate single-region failure and measure actual failover time. Document the failover and failback procedures in the on-call runbook.

Staff Signals:

  • Federated registry with local-first resolution
  • Clear failover trigger and priority ordering
  • Addresses failback carefully (gradual, not instant)
  • Acknowledges latency increase as explicit tradeoff

Deep Dive 5: The Ghost Service

Context: A service was decommissioned 6 months ago but its registry entry was never cleaned up. Other services still have it in their dependency graphs. A new service is assigned the same port on the same host. Traffic intended for the decommissioned service now reaches the new service.

Questions You Should Ask First:

  • Why did the registry entry persist for 6 months undetected — is there no TTL-based expiry or heartbeat requirement?
  • Is service identity based on host:port (vulnerable to port reuse collisions) or unique instance identifiers (container ID, hostname+PID)?
  • Do we have an automated registry audit that detects entries with no traffic — or does cleanup depend entirely on someone remembering a decommission checklist?
  • Is service decommission defined as a multi-step process (stop traffic, verify zero connections, deregister, remove DNS, update dependency graphs), or a single event?
Staff Approach

This is a registry hygiene and naming safety problem.

Immediate fix: remove the stale entry from the registry. Audit all registry entries for services that haven't sent a heartbeat in >24 hours — these are likely ghost entries.

Prevention: (1) TTL on all entries — every registry entry must be renewed by the service periodically (heartbeat). If the service stops heartbeating (because it's decommissioned), the entry expires automatically. No manual cleanup needed. (2) Unique service identity — register with a service ID that includes a unique instance identifier (container ID, hostname + PID), not just host:port. This prevents port reuse collisions. (3) Decommission checklist — add "remove from service registry" to the decommission runbook. Automate it if possible (decommission pipeline triggers deregistration). (4) Registry audit — monthly automated scan for entries with no traffic in >7 days. Alert the owning team.

The organizational fix: service decommission is a process, not an event. It includes: stop traffic → verify zero connections → deregister → remove DNS entries → update dependency graphs → archive.

Metrics to Watch: registry.entry.heartbeat_age_seconds (time since last heartbeat — entries older than 24 hours are likely ghosts), registry.ghost_service.count (entries with no traffic for >7 days), registry.port_collision.detected_count (instances where a new service received traffic intended for a decommissioned service), registry.decommission.checklist_completion_pct (percentage of decommissions following the full process).

Organizational Follow-up: Implement TTL-based entry expiry: every registry entry requires periodic heartbeat renewal, and entries expire automatically when the service stops heartbeating. Add unique instance identifiers (container ID) to service registration to prevent port reuse collisions. Create a monthly automated registry audit that alerts owning teams about entries with no traffic in 7 days. Redefine service decommission as a formal multi-step process with automated triggers, not a manual checklist.

Staff Signals:

  • Identifies root cause: no TTL enforcement + no decommission process
  • Proposes automatic expiry as primary defense
  • Includes organizational process fix, not just technical fix
  • Addresses the port reuse collision specifically

9Level Expectations Summary

After studying this playbook, you should be able to:

  • Choose between DNS-based, registry-based, and mesh-based discovery based on organizational scale
  • Design client-side caching for graceful degradation during registry outages
  • Implement multi-layer health checking (liveness, readiness, traffic-based)
  • Design naming conventions that encode team ownership and prevent collisions
  • Walk through a registry failure scenario with timeline and blast radius analysis
  • Explain the tradeoffs between client-side and server-side discovery patterns
  • Design cross-region service discovery with health-aware failover

The Bar for This Question

Mid-level (L4/E4): Understands the fundamental problem service discovery solves — how services find each other when instances are ephemeral and IP addresses change. Can describe DNS-based or registry-based discovery at a high level and explain why hardcoded addresses don't work in dynamic environments. Knows that services register themselves and consumers look them up, even if the design lacks depth on failure modes.

Senior (L5/E5): Makes a deliberate choice between client-side discovery (consumer queries registry, picks an instance) and server-side discovery (load balancer queries registry, consumer hits a stable endpoint) with clear tradeoffs. Integrates health checks with registration so that unhealthy instances are removed from the registry promptly. Handles stale registrations from crashed services (TTL-based expiry, active health probes). Reasons about how rolling deployments interact with discovery — new instances registering before they're ready, old instances deregistering while draining connections.

Staff+ (L6/E6+): Identifies the service registry as the most invisible single point of failure in the platform — every service depends on it, but nobody thinks about its availability until it fails. Designs cache-based resilience so that services can continue routing to last-known-good endpoints during a registry outage. Addresses the bootstrapping problem: how does a service discover the registry itself? Thinks about organizational ownership — who runs the discovery platform, what SLOs govern it, and how is the cost of operating a consensus-backed registry justified to leadership.

10Staff Insiders: Controversial Opinions

10.1 "DNS Is Underrated for Service Discovery"

The conventional wisdom is that you need a dedicated service registry (Consul, Eureka) for microservices. The reality: DNS-based discovery covers 80% of use cases with zero additional infrastructure. Every language, framework, and tool supports DNS. Short TTLs (5-10 seconds) provide near-real-time updates. Health-aware DNS (CoreDNS with health checks, Consul DNS interface) provides the best of both worlds.

The experienced approach: start with DNS. Add a registry API only when you need features DNS can't provide (watches, KV store, ACLs). Most teams add a registry too early and pay the operational cost of running a consensus-based cluster for a problem that DNS solves.

And that's OK. Simple solutions that work are better than sophisticated solutions that require a dedicated team to operate.

10.2 "The Service Registry Is Your Most Invisible Single Point of Failure"

Everyone worries about database availability. Nobody worries about registry availability — until it fails and takes down every service simultaneously. The registry is the one dependency that every service shares, and it's invisible because it "just works" until it doesn't.

The uncomfortable truth: your microservices platform's availability is capped by your registry's availability. A 99.99% available payment service depending on a 99.9% available Consul cluster is effectively 99.9% available. Invest in registry resilience — 5-node clusters, cross-AZ deployment, client-side caching, automated recovery, and chaos testing.

Treat the registry like you treat the database: with respect, dedicated resources, and a runbook for 3am failures.

10.3 "Service Mesh Is the Right Answer for the Wrong Time"

Service mesh (Istio, Linkerd) provides everything: discovery, health checking, mTLS, traffic management, observability. It's the "right" answer architecturally. But the operational cost is enormous — sidecar proxies consume 10-15% of your compute, the control plane is a complex distributed system, and debugging mesh issues requires specialized expertise.

The counterintuitive truth: most companies adopt service mesh too early. They have 30 services, 3 teams, and no mTLS requirement — but they deploy Istio because it's the "modern" approach. The result: 6 months of mesh debugging instead of shipping features.

Staff insight: adopt mesh when the organizational pain (manual mTLS, inconsistent observability, per-service circuit breaking) exceeds the mesh operational cost. For most companies, that's at 100+ services with a dedicated platform team. Before that, DNS + registry + client-side libraries is simpler, cheaper, and more debuggable.

Appendices
Appendix A: Service Discovery Technologies

A.1 Technology Comparison

TechnologyDiscovery ModelHealth CheckConsistencyBest For
ConsulRegistry + DNSActive (TCP/HTTP/gRPC)Raft consensusGeneral purpose, multi-datacenter
etcdKey-value storeClient-managedRaft consensusKubernetes, config management
ZooKeeperHierarchical registryEphemeral nodes + watchesZAB consensusLegacy systems, Kafka
EurekaRegistry (AP model)Client heartbeatEventual (no consensus)Spring ecosystem, availability-first
Kubernetes ServicesBuilt-in DNS + endpointsKubelet probes (liveness/readiness)etcd-backedKubernetes-native workloads
CoreDNSDNS serverPlugin-basedDepends on backendDNS-based discovery, Kubernetes

A.2 Decision Matrix

RequirementConsulEurekaK8s ServicesPlain DNS
Health-aware routing
Multi-datacenter❌ (needs federation)
No additional infra✅ (if on K8s)
KV store / config✅ (ConfigMaps)
Service mesh integration✅ (Consul Connect)✅ (Istio/Linkerd)
Appendix B: Health Check Patterns

B.1 Three-Layer Health Model

Rendering diagram...

B.2 Health Endpoint Design

JSON
// GET /health (liveness)
{ "status": "ok" }

// GET /ready (readiness)
{
  "status": "ready",
  "checks": {
    "database": "connected",
    "cache": "connected",
    "downstream_auth": "reachable"
  }
}

// GET /health/detailed (debugging, not for load balancer)
{
  "status": "ready",
  "uptime_seconds": 86400,
  "version": "2.1.3",
  "connections": { "db": 45, "cache": 12 },
  "memory_mb": 256,
  "cpu_percent": 15
}

B.3 Health Check Timing

Check TypeIntervalFailure ThresholdRecovery Threshold
Liveness5s3 failures (15s)1 success
Readiness5s1 failure (immediate)3 successes (15s)
Traffic-basedContinuous>10% error rate over 30s<1% error rate over 60s

Note: readiness is stricter (1 failure removes from rotation) because sending traffic to an unready service causes user-visible errors. Recovery is stricter (3 successes) to prevent flapping.

Appendix C: Naming Conventions Reference

C.1 Naming Format

{domain}.{service-name}.{environment}[.{region}]

Examples:
  payments.checkout-api.prod.us-east-1
  catalog.search-service.staging
  identity.auth-gateway.prod
  platform.api-gateway.prod.eu-west-1

C.2 Naming Rules

RuleGoodBad
Lowercase, hyphenatedcheckout-apiCheckoutAPI, checkout_api
Descriptiveorder-processorservice-v2, my-api
Domain-prefixedpayments.checkout-apicheckout-api (no domain)
No version in namecheckout-apicheckout-api-v2
Environment explicitcheckout-api.prodcheckout-api (which env?)

C.3 Reserved Domains

DomainPurposeOwner
platformInfrastructure services (gateway, monitoring)Platform team
identityAuth, user managementIdentity team
internalInternal tools, admin servicesVarious

C.4 Migration from Ad-Hoc Names

  1. Create service catalog mapping old → new names
  2. Register services under both old and new names (dual registration)
  3. Track old-name lookup counts via metrics
  4. When old-name lookups reach zero for 2 weeks, deprecate the old name
  5. Enforce new naming convention for all new services via registry validation

You just read the full Design Service Discovery playbook.

Explore the full playbook library — the same depth, drills, and Staff-grade analysis across every topic.