Build vs Buy Framework

What This Framework Covers

Every infrastructure decision includes a build vs buy tradeoff. Staff engineers don't default to building; they make explicit decisions based on:

What's actually differentiated
Total cost of ownership (TCO)
Time-to-value
Operational burden

The Staff Mindset

The question is not "can we build it?" — the answer is always yes.

The question is: "What's the smallest custom surface that delivers our differentiated value?"

The Framework

Step 1: Inventory Existing Controls

Before designing anything custom, list what's already available:

Layer	Examples	Identity Context
CDN/WAF	Cloudflare, Akamai, AWS WAF	IP, ASN, geo
Managed Gateway	AWS API Gateway, Kong, Apigee	API key, OAuth
Service Mesh	Envoy RLS, Istio, Linkerd	Service identity
Cloud Provider	AWS throttling, GCP quotas	Account-level

Step 2: Identify the Gap

The gap is where existing solutions don't meet your requirements:

Requirement	CDN/WAF	Managed Gateway	Service Mesh	Custom Needed?
IP-based abuse	✅	✅	❌	No
API key throttling	❌	✅	❌	Maybe
User-level quotas	❌	Maybe	❌	Probably
Billing integration	❌	❌	❌	Yes
Custom fairness logic	❌	❌	❌	Yes

The gap is where you need custom work. Everything else should use existing solutions.

Step 3: Calculate TCO

Cost Category	Build	Buy/Use Managed
Initial development	High	Low
Ongoing maintenance	High (your engineers)	Low (vendor)
On-call burden	High	Low-Medium
Scaling complexity	High	Low (vendor handles)
Feature velocity	Slow (you build it)	Fast (vendor roadmap)
Customization	Full control	Limited
Vendor lock-in	None	Medium-High

Staff-grade TCO calculation:

Build TCO =
  (Engineers × Time × Loaded Cost) +
  (Ongoing maintenance FTE × Years) +
  (Incident cost × Expected incidents) +
  (Opportunity cost of not building product features)

Buy TCO =
  (Vendor cost × Years) +
  (Integration effort) +
  (Migration cost if vendor fails) +
  (Customization workarounds)

Step 4: Time-to-Value Analysis

Scenario	Build	Buy
Need it in 2 weeks	Impossible	Possible
Need it in 3 months	Risky	Safe
Need it in 1 year	Possible	Overkill if simple
Regulatory deadline	High risk	Lower risk

Decision Framework

Rendering diagram...

The "Auth Context Gap"

A common pattern: managed solutions work for coarse signals (IP, ASN) but can't access your auth context (user ID, tenant, plan tier).

Solutions:

Push auth to edge — If CDN/WAF can read a header or cookie with identity, you can use edge limiting
Gateway integration — API gateway often has auth plugins; rate limiting after auth gives you identity
Hybrid — CDN/WAF for volumetric abuse (no auth needed), custom for identity-aware quotas

Applying to Common Systems

Rate Limiting

Layer	Managed Option	What It Can Do	Gap
Edge	Cloudflare Rate Limiting	IP, ASN, path	No auth context
Gateway	Kong, AWS API Gateway	API key, basic quotas	Limited fairness logic
Custom	Your code	Full control	You own it

Staff recommendation:

Push volumetric abuse protection to CDN/WAF
Use gateway for API key / basic auth throttling
Build custom only for billing integration or complex fairness

Observability

Option	Pros	Cons
Datadog/New Relic	Fast setup, rich features	Expensive at scale
Self-hosted (Prometheus/Grafana)	Cost control, customization	Operational burden
Hybrid	Best of both	Complexity

Staff recommendation:

Start with managed (time-to-value)
Migrate high-volume metrics to self-hosted when cost justifies

Authentication

Option	Pros	Cons
Auth0/Okta	Fast, secure, compliant	Expensive, lock-in
Cognito/Firebase Auth	Cheaper, cloud-native	Less flexible
Custom	Full control	Security risk, maintenance

Staff recommendation:

Almost never build custom auth
The "gap" is rarely worth the security risk

L6 vs L7 Calibration

Dimension	L6 (Staff)	L7 (Principal)
First instinct	"What exists?"	"What's our long-term platform strategy?"
Gap analysis	Identifies what's missing	Quantifies the cost of the gap
TCO	Rough estimate	Spreadsheet with multi-year projection
Migration	"We can migrate later"	"Here's the migration plan and exit criteria"
Org impact	"This affects my team"	"This affects 10 teams; here's the adoption plan"

Interview Probes

After You Say...	They Will Ask...
"We'll build a rate limiter"	"What about Cloudflare / AWS WAF / Envoy RLS?"
"We'll use managed"	"What's the gap? How will you handle [X]?"
"Custom for full control"	"What's the TCO? Who operates it?"
"Hybrid approach"	"Where's the boundary? How do they interact?"

Anti-Patterns

1. "Not Invented Here"

Building everything from scratch because "we can do it better" — ignoring TCO and opportunity cost.

2. "Vendor Will Solve Everything"

Assuming managed solutions have no gaps — then discovering limitations in production.

3. "We'll Abstract It Later"

Building tightly coupled to a vendor with "we'll add an abstraction layer" — rarely happens.

4. "It's Just a Small Service"

Underestimating operational burden — "small" services still need monitoring, on-call, upgrades.

5. "Cloud Provider Lock-in is Bad"

Over-rotating on portability — the cost of multi-cloud abstraction often exceeds lock-in risk.

Staff Sentence Templates

Implementation Deep Dive

1. Gap Analysis — The Systematic Approach

The most common Build vs Buy mistake is deciding before enumerating requirements. Staff engineers use a structured gap analysis to identify exactly where managed solutions fall short.

Gap Analysis Worksheet

# Step 1: List every requirement
REQUIREMENTS = [
    { "name": "IP-based rate limiting",        "priority": "P0" },
    { "name": "API key throttling",            "priority": "P0" },
    { "name": "User-level quotas",             "priority": "P0" },
    { "name": "Billing integration (metering)", "priority": "P1" },
    { "name": "Custom fairness (weighted)",    "priority": "P2" },
    { "name": "Real-time usage dashboard",     "priority": "P2" },
]

# Step 2: Evaluate each managed solution
SOLUTIONS = {
    "Cloudflare":       { "covers": ["IP-based rate limiting"],
                          "gaps": ["API key throttling", "User-level quotas",
                                   "Billing integration", "Custom fairness"] },
    "AWS API Gateway":  { "covers": ["IP-based rate limiting", "API key throttling"],
                          "gaps": ["User-level quotas", "Billing integration",
                                   "Custom fairness"] },
    "Kong (gateway)":   { "covers": ["IP-based rate limiting", "API key throttling",
                                      "User-level quotas"],
                          "gaps": ["Billing integration", "Custom fairness"] },
}

# Step 3: Identify the gap
# Kong covers P0 requirements. The gap is P1 (billing) and P2 (fairness, dashboard).
# Decision: Use Kong for P0, build custom for billing integration only.

2. Thin Custom Layer on Top of Managed — The Hybrid Pattern

The Staff default is rarely "build everything" or "buy everything." It is: use managed for commodity functionality, build a thin custom layer for the differentiated gap.

Hybrid Architecture Example — Rate Limiting

# Layer 1: Cloudflare (managed) — IP-based abuse protection
# Handles: DDoS, bot detection, volumetric attacks
# Configuration: Cloudflare dashboard, no custom code
# Cost: $0.05 per 10K requests

# Layer 2: Kong (managed gateway) — API key throttling
# Handles: per-key rate limits, basic quotas
# Configuration: Kong rate-limiting plugin
# Cost: Kong license + hosting

# Layer 3: Custom service — billing integration
# Handles: metered usage, plan-tier quotas, overage billing
# This is the ONLY custom code

class BillingAwareQuota:
    function checkQuota(userId, operation):
        # Read plan limits from billing system cache
        plan = billingCache.getPlan(userId)       # Cached, refreshed every 60s
        usage = usageCounter.get(userId, period="current_month")

        if usage >= plan.hardLimit:
            return { allowed: false, reason: "plan_limit_reached",
                     upgradeUrl: "/pricing" }

        if usage >= plan.softLimit:
            # Overage billing: allow but meter at overage rate
            usageCounter.increment(userId)
            return { allowed: true, overage: true,
                     overageRate: plan.overageRate }

        usageCounter.increment(userId)
        return { allowed: true, overage: false }

3. TCO Calculation — A Concrete Example

Abstract TCO comparisons are useless. Here is a concrete calculation for a rate limiting system.

Build vs Buy TCO (3-Year Projection)

# === BUILD ===
Initial development:
  2 senior engineers × 3 months × $20K/month loaded = $120,000
  Code review, testing, security audit              =  $30,000
                                            Subtotal = $150,000

Ongoing maintenance (per year):
  0.5 engineer (bug fixes, upgrades, on-call)       =  $120,000/year
  Infrastructure (Redis cluster, monitoring)         =   $24,000/year
  Incident cost (2 incidents/year × $5K each)        =   $10,000/year
                                         Subtotal/yr = $154,000

Opportunity cost:
  2 engineers × 3 months not building product features
  (Hard to quantify but real)

3-Year Build TCO = $150,000 + (3 × $154,000) = $612,000

# === BUY (Kong + Cloudflare + custom billing layer) ===
Managed services (per year):
  Kong Enterprise license                            =  $50,000/year
  Cloudflare Pro + rate limiting add-on              =  $12,000/year
                                         Subtotal/yr =  $62,000

Custom billing layer:
  1 engineer × 2 weeks                               =  $10,000
  Ongoing: 0.05 engineer (minimal maintenance)       =  $12,000/year

Integration effort:
  1 engineer × 1 month                               =  $20,000

3-Year Buy TCO = $30,000 + (3 × $74,000) = $252,000

Verdict: Buy saves $360K over 3 years and frees 2 engineers for product work. Build is justified only if the managed solutions cannot meet a P0 requirement that no workaround can address.

4. Vendor Exit Strategy — The Migration Plan

Every "buy" decision needs an exit plan. What happens when the vendor raises prices 3x, gets acquired, or sunsets the product?

Exit Strategy Checklist

VENDOR_EXIT_CHECKLIST = {
    "abstraction_layer": {
        "description": "Is vendor logic behind an interface you control?",
        "example": "RateLimiter interface with KongRateLimiter and CustomRateLimiter implementations",
        "effort_if_missing": "2-4 weeks to extract and abstract"
    },
    "data_portability": {
        "description": "Can you export your configuration, usage data, and audit logs?",
        "example": "Kong config exported as YAML, usage metrics in your own Prometheus",
        "effort_if_missing": "1-2 weeks to build export pipeline"
    },
    "feature_parity_plan": {
        "description": "If you had to replace this vendor in 90 days, what features would you skip?",
        "example": "Skip: dashboard UI, anomaly detection. Keep: core rate limiting, quota enforcement",
        "effort_if_missing": "Depends on gap size — 1 month to 6 months"
    },
    "contractual_protection": {
        "description": "Source code escrow, price lock, or migration period clause?",
        "example": "1-year price lock, 90-day migration period on termination",
        "effort_if_missing": "Negotiate before signing, not after"
    },
}

Architecture Diagram

Rendering diagram...

Each layer handles what it does best: Cloudflare handles volumetric abuse at the edge (no auth needed). Kong handles API key throttling at the gateway (auth available). Your code handles billing integration (requires business context). Traffic is filtered at each layer — by the time a request reaches your application, it has survived three layers of protection with zero custom infrastructure.

Failure Scenarios

1. Vendor Rate Limit Exceeds Your Budget

Timeline: You use AWS API Gateway for rate limiting. Your API grows from 1M requests/day to 50M. AWS API Gateway charges $3.50 per million requests. Your monthly bill goes from $105 to $5,250 — a 50x increase. The finance team asks why infrastructure costs grew 50x when revenue grew 3x.

Blast radius: Budget pressure forces a migration. But migrating mid-growth means engineering time diverted from product features during the highest-growth period.

Detection: Cloud cost monitoring alerts when spend exceeds budget threshold. Monthly cost reviews with finance.

Recovery:

Immediate: optimize API call volume — client-side caching, request batching, CDN for cacheable responses can reduce API Gateway calls by 40-60%
Short-term: move high-volume, low-value endpoints (health checks, polling) to a cheaper path (direct to ALB, bypassing API Gateway)
Long-term: evaluate self-hosted gateway (Kong OSS on EKS) for the commodity functionality. Keep AWS API Gateway for low-volume endpoints where the managed convenience is worth the premium

2. Managed Solution Missing Critical Feature

Timeline: You use Auth0 for authentication. A new compliance requirement mandates that user sessions must be revocable within 30 seconds across all regions. Auth0's token revocation propagation takes 5-10 minutes (their global CDN caches JWKS endpoints). You cannot meet the compliance deadline with Auth0 alone.

Blast radius: Compliance deadline is in 8 weeks. Migrating off Auth0 is a 6-month project. Building a custom auth system from scratch is a 12-month project and a security risk.

Detection: Compliance review identifies the gap during the assessment phase (ideally before the deadline, not after).

Recovery:

Hybrid approach: keep Auth0 for authentication issuance. Add a thin custom revocation layer — a Redis-backed bloom filter checked on every request that overrides the JWT validity
The custom layer is ~200 lines of code: check bloom filter before accepting JWT. If the token ID is in the bloom filter, reject regardless of JWT validity
This meets the 30-second revocation requirement without replacing Auth0

3. Vendor Lock-In During Acquisition

Timeline: Your primary CDN vendor (mid-tier provider) is acquired by a competitor. The acquiring company announces end-of-life for the product in 12 months. Your entire content delivery and edge caching strategy depends on this vendor's proprietary API.

Blast radius: 12 months to migrate all edge logic, caching rules, and delivery configuration to a new vendor. If you did not abstract the vendor's API behind an interface, every CDN integration point in your codebase needs modification.

Detection: Vendor communication (acquisition announcement). Hopefully caught in your vendor risk assessment process.

Recovery:

If you have an abstraction layer: implement a new CDN adapter (Cloudflare, Fastly, AWS CloudFront) behind the same interface. Migration is configuration, not code.
If you do not have an abstraction layer: extract vendor-specific code into an adapter, implement the new vendor adapter, then switch. This takes 2-4 months.
Negotiate migration timeline with the acquiring company (most offer 12-18 months of support post-acquisition)

Staff Interview Application

How to Introduce This Pattern

Lead with the inventory of existing solutions. This signals that you don't default to building, and that you understand the total cost of ownership.

When NOT to Use Managed (Build Instead)

Core business differentiator: If the system IS the product (a payment processor building payment processing, a CDN building content delivery), building is justified because the operational expertise is your competitive advantage.
No managed solution exists: Novel problems (custom ML inference at edge, proprietary protocol handling) have no off-the-shelf solution. Build with an eye toward open-sourcing or spinning out.
Compliance prohibits third-party data handling: Some regulatory environments (defense, certain healthcare) cannot send data to SaaS vendors. Self-hosted or custom-built is the only option.
Cost crossover reached: When managed cost exceeds custom cost at your scale and the custom solution is proven (not speculative), migration is justified.

Follow-Up Questions to Anticipate

Interviewer Asks	What They Are Testing	How to Respond
"Why not just build it?"	TCO and opportunity cost awareness	"I could build it. The question is: should I? The TCO for custom is [X] over 3 years including opportunity cost. Managed is [Y]. Unless the gap is a P0 business differentiator, I'd use managed."
"What about vendor lock-in?"	Pragmatic risk assessment	"I'd abstract the vendor behind an interface from day one. Migration cost with an abstraction layer is weeks, not months. The lock-in risk is real but manageable."
"What if the vendor doesn't support [X]?"	Gap analysis depth	"That's the gap. I'd evaluate: is [X] a P0 requirement? Can we work around it? If it's truly critical and no workaround exists, I'd build only the [X] component and use managed for everything else."
"How do you evaluate vendors?"	Decision process maturity	"Three criteria: does it meet P0 requirements? What is the 3-year TCO including scale? What is the exit plan if the vendor fails? I'd prototype with the top 2 candidates before committing."