StaffSignal
Cross-Cutting Framework

Build vs Buy Framework

TCO analysis, operational burden assessment, and when to use managed services vs custom solutions.

Build vs Buy Framework

What This Framework Covers

Every infrastructure decision includes a build vs buy tradeoff. Staff engineers don't default to building; they make explicit decisions based on:

  • What's actually differentiated
  • Total cost of ownership (TCO)
  • Time-to-value
  • Operational burden

The Staff Mindset

The question is not "can we build it?" — the answer is always yes.

The question is: "What's the smallest custom surface that delivers our differentiated value?"


The Framework

Step 1: Inventory Existing Controls

Before designing anything custom, list what's already available:

LayerExamplesIdentity Context
CDN/WAFCloudflare, Akamai, AWS WAFIP, ASN, geo
Managed GatewayAWS API Gateway, Kong, ApigeeAPI key, OAuth
Service MeshEnvoy RLS, Istio, LinkerdService identity
Cloud ProviderAWS throttling, GCP quotasAccount-level

Step 2: Identify the Gap

The gap is where existing solutions don't meet your requirements:

RequirementCDN/WAFManaged GatewayService MeshCustom Needed?
IP-based abuseNo
API key throttlingMaybe
User-level quotasMaybeProbably
Billing integrationYes
Custom fairness logicYes

The gap is where you need custom work. Everything else should use existing solutions.

Step 3: Calculate TCO

Cost CategoryBuildBuy/Use Managed
Initial developmentHighLow
Ongoing maintenanceHigh (your engineers)Low (vendor)
On-call burdenHighLow-Medium
Scaling complexityHighLow (vendor handles)
Feature velocitySlow (you build it)Fast (vendor roadmap)
CustomizationFull controlLimited
Vendor lock-inNoneMedium-High

Staff-grade TCO calculation:

Build TCO =
  (Engineers × Time × Loaded Cost) +
  (Ongoing maintenance FTE × Years) +
  (Incident cost × Expected incidents) +
  (Opportunity cost of not building product features)

Buy TCO =
  (Vendor cost × Years) +
  (Integration effort) +
  (Migration cost if vendor fails) +
  (Customization workarounds)

Step 4: Time-to-Value Analysis

ScenarioBuildBuy
Need it in 2 weeksImpossiblePossible
Need it in 3 monthsRiskySafe
Need it in 1 yearPossibleOverkill if simple
Regulatory deadlineHigh riskLower risk

Decision Framework

Rendering diagram...

The "Auth Context Gap"

A common pattern: managed solutions work for coarse signals (IP, ASN) but can't access your auth context (user ID, tenant, plan tier).

Solutions:

  1. Push auth to edge — If CDN/WAF can read a header or cookie with identity, you can use edge limiting
  2. Gateway integration — API gateway often has auth plugins; rate limiting after auth gives you identity
  3. Hybrid — CDN/WAF for volumetric abuse (no auth needed), custom for identity-aware quotas

Applying to Common Systems

Rate Limiting

LayerManaged OptionWhat It Can DoGap
EdgeCloudflare Rate LimitingIP, ASN, pathNo auth context
GatewayKong, AWS API GatewayAPI key, basic quotasLimited fairness logic
CustomYour codeFull controlYou own it

Staff recommendation:

  • Push volumetric abuse protection to CDN/WAF
  • Use gateway for API key / basic auth throttling
  • Build custom only for billing integration or complex fairness

Observability

OptionProsCons
Datadog/New RelicFast setup, rich featuresExpensive at scale
Self-hosted (Prometheus/Grafana)Cost control, customizationOperational burden
HybridBest of bothComplexity

Staff recommendation:

  • Start with managed (time-to-value)
  • Migrate high-volume metrics to self-hosted when cost justifies

Authentication

OptionProsCons
Auth0/OktaFast, secure, compliantExpensive, lock-in
Cognito/Firebase AuthCheaper, cloud-nativeLess flexible
CustomFull controlSecurity risk, maintenance

Staff recommendation:

  • Almost never build custom auth
  • The "gap" is rarely worth the security risk

L6 vs L7 Calibration

DimensionL6 (Staff)L7 (Principal)
First instinct"What exists?""What's our long-term platform strategy?"
Gap analysisIdentifies what's missingQuantifies the cost of the gap
TCORough estimateSpreadsheet with multi-year projection
Migration"We can migrate later""Here's the migration plan and exit criteria"
Org impact"This affects my team""This affects 10 teams; here's the adoption plan"

Interview Probes

After You Say...They Will Ask...
"We'll build a rate limiter""What about Cloudflare / AWS WAF / Envoy RLS?"
"We'll use managed""What's the gap? How will you handle [X]?"
"Custom for full control""What's the TCO? Who operates it?"
"Hybrid approach""Where's the boundary? How do they interact?"

Anti-Patterns

1. "Not Invented Here"

Building everything from scratch because "we can do it better" — ignoring TCO and opportunity cost.

2. "Vendor Will Solve Everything"

Assuming managed solutions have no gaps — then discovering limitations in production.

3. "We'll Abstract It Later"

Building tightly coupled to a vendor with "we'll add an abstraction layer" — rarely happens.

4. "It's Just a Small Service"

Underestimating operational burden — "small" services still need monitoring, on-call, upgrades.

5. "Cloud Provider Lock-in is Bad"

Over-rotating on portability — the cost of multi-cloud abstraction often exceeds lock-in risk.


Staff Sentence Templates


Implementation Deep Dive

1. Gap Analysis — The Systematic Approach

The most common Build vs Buy mistake is deciding before enumerating requirements. Staff engineers use a structured gap analysis to identify exactly where managed solutions fall short.

Gap Analysis Worksheet

# Step 1: List every requirement
REQUIREMENTS = [
    { "name": "IP-based rate limiting",        "priority": "P0" },
    { "name": "API key throttling",            "priority": "P0" },
    { "name": "User-level quotas",             "priority": "P0" },
    { "name": "Billing integration (metering)", "priority": "P1" },
    { "name": "Custom fairness (weighted)",    "priority": "P2" },
    { "name": "Real-time usage dashboard",     "priority": "P2" },
]

# Step 2: Evaluate each managed solution
SOLUTIONS = {
    "Cloudflare":       { "covers": ["IP-based rate limiting"],
                          "gaps": ["API key throttling", "User-level quotas",
                                   "Billing integration", "Custom fairness"] },
    "AWS API Gateway":  { "covers": ["IP-based rate limiting", "API key throttling"],
                          "gaps": ["User-level quotas", "Billing integration",
                                   "Custom fairness"] },
    "Kong (gateway)":   { "covers": ["IP-based rate limiting", "API key throttling",
                                      "User-level quotas"],
                          "gaps": ["Billing integration", "Custom fairness"] },
}

# Step 3: Identify the gap
# Kong covers P0 requirements. The gap is P1 (billing) and P2 (fairness, dashboard).
# Decision: Use Kong for P0, build custom for billing integration only.

2. Thin Custom Layer on Top of Managed — The Hybrid Pattern

The Staff default is rarely "build everything" or "buy everything." It is: use managed for commodity functionality, build a thin custom layer for the differentiated gap.

Hybrid Architecture Example — Rate Limiting

# Layer 1: Cloudflare (managed) — IP-based abuse protection
# Handles: DDoS, bot detection, volumetric attacks
# Configuration: Cloudflare dashboard, no custom code
# Cost: $0.05 per 10K requests

# Layer 2: Kong (managed gateway) — API key throttling
# Handles: per-key rate limits, basic quotas
# Configuration: Kong rate-limiting plugin
# Cost: Kong license + hosting

# Layer 3: Custom service — billing integration
# Handles: metered usage, plan-tier quotas, overage billing
# This is the ONLY custom code

class BillingAwareQuota:
    function checkQuota(userId, operation):
        # Read plan limits from billing system cache
        plan = billingCache.getPlan(userId)       # Cached, refreshed every 60s
        usage = usageCounter.get(userId, period="current_month")

        if usage >= plan.hardLimit:
            return { allowed: false, reason: "plan_limit_reached",
                     upgradeUrl: "/pricing" }

        if usage >= plan.softLimit:
            # Overage billing: allow but meter at overage rate
            usageCounter.increment(userId)
            return { allowed: true, overage: true,
                     overageRate: plan.overageRate }

        usageCounter.increment(userId)
        return { allowed: true, overage: false }

3. TCO Calculation — A Concrete Example

Abstract TCO comparisons are useless. Here is a concrete calculation for a rate limiting system.

Build vs Buy TCO (3-Year Projection)

# === BUILD ===
Initial development:
  2 senior engineers × 3 months × $20K/month loaded = $120,000
  Code review, testing, security audit              =  $30,000
                                            Subtotal = $150,000

Ongoing maintenance (per year):
  0.5 engineer (bug fixes, upgrades, on-call)       =  $120,000/year
  Infrastructure (Redis cluster, monitoring)         =   $24,000/year
  Incident cost (2 incidents/year × $5K each)        =   $10,000/year
                                         Subtotal/yr = $154,000

Opportunity cost:
  2 engineers × 3 months not building product features
  (Hard to quantify but real)

3-Year Build TCO = $150,000 + (3 × $154,000) = $612,000

# === BUY (Kong + Cloudflare + custom billing layer) ===
Managed services (per year):
  Kong Enterprise license                            =  $50,000/year
  Cloudflare Pro + rate limiting add-on              =  $12,000/year
                                         Subtotal/yr =  $62,000

Custom billing layer:
  1 engineer × 2 weeks                               =  $10,000
  Ongoing: 0.05 engineer (minimal maintenance)       =  $12,000/year

Integration effort:
  1 engineer × 1 month                               =  $20,000

3-Year Buy TCO = $30,000 + (3 × $74,000) = $252,000

Verdict: Buy saves $360K over 3 years and frees 2 engineers for product work. Build is justified only if the managed solutions cannot meet a P0 requirement that no workaround can address.

4. Vendor Exit Strategy — The Migration Plan

Every "buy" decision needs an exit plan. What happens when the vendor raises prices 3x, gets acquired, or sunsets the product?

Exit Strategy Checklist

VENDOR_EXIT_CHECKLIST = {
    "abstraction_layer": {
        "description": "Is vendor logic behind an interface you control?",
        "example": "RateLimiter interface with KongRateLimiter and CustomRateLimiter implementations",
        "effort_if_missing": "2-4 weeks to extract and abstract"
    },
    "data_portability": {
        "description": "Can you export your configuration, usage data, and audit logs?",
        "example": "Kong config exported as YAML, usage metrics in your own Prometheus",
        "effort_if_missing": "1-2 weeks to build export pipeline"
    },
    "feature_parity_plan": {
        "description": "If you had to replace this vendor in 90 days, what features would you skip?",
        "example": "Skip: dashboard UI, anomaly detection. Keep: core rate limiting, quota enforcement",
        "effort_if_missing": "Depends on gap size — 1 month to 6 months"
    },
    "contractual_protection": {
        "description": "Source code escrow, price lock, or migration period clause?",
        "example": "1-year price lock, 90-day migration period on termination",
        "effort_if_missing": "Negotiate before signing, not after"
    },
}

Architecture Diagram

Rendering diagram...

Each layer handles what it does best: Cloudflare handles volumetric abuse at the edge (no auth needed). Kong handles API key throttling at the gateway (auth available). Your code handles billing integration (requires business context). Traffic is filtered at each layer — by the time a request reaches your application, it has survived three layers of protection with zero custom infrastructure.


Failure Scenarios

1. Vendor Rate Limit Exceeds Your Budget

Timeline: You use AWS API Gateway for rate limiting. Your API grows from 1M requests/day to 50M. AWS API Gateway charges $3.50 per million requests. Your monthly bill goes from $105 to $5,250 — a 50x increase. The finance team asks why infrastructure costs grew 50x when revenue grew 3x.

Blast radius: Budget pressure forces a migration. But migrating mid-growth means engineering time diverted from product features during the highest-growth period.

Detection: Cloud cost monitoring alerts when spend exceeds budget threshold. Monthly cost reviews with finance.

Recovery:

  1. Immediate: optimize API call volume — client-side caching, request batching, CDN for cacheable responses can reduce API Gateway calls by 40-60%
  2. Short-term: move high-volume, low-value endpoints (health checks, polling) to a cheaper path (direct to ALB, bypassing API Gateway)
  3. Long-term: evaluate self-hosted gateway (Kong OSS on EKS) for the commodity functionality. Keep AWS API Gateway for low-volume endpoints where the managed convenience is worth the premium

2. Managed Solution Missing Critical Feature

Timeline: You use Auth0 for authentication. A new compliance requirement mandates that user sessions must be revocable within 30 seconds across all regions. Auth0's token revocation propagation takes 5-10 minutes (their global CDN caches JWKS endpoints). You cannot meet the compliance deadline with Auth0 alone.

Blast radius: Compliance deadline is in 8 weeks. Migrating off Auth0 is a 6-month project. Building a custom auth system from scratch is a 12-month project and a security risk.

Detection: Compliance review identifies the gap during the assessment phase (ideally before the deadline, not after).

Recovery:

  1. Hybrid approach: keep Auth0 for authentication issuance. Add a thin custom revocation layer — a Redis-backed bloom filter checked on every request that overrides the JWT validity
  2. The custom layer is ~200 lines of code: check bloom filter before accepting JWT. If the token ID is in the bloom filter, reject regardless of JWT validity
  3. This meets the 30-second revocation requirement without replacing Auth0

3. Vendor Lock-In During Acquisition

Timeline: Your primary CDN vendor (mid-tier provider) is acquired by a competitor. The acquiring company announces end-of-life for the product in 12 months. Your entire content delivery and edge caching strategy depends on this vendor's proprietary API.

Blast radius: 12 months to migrate all edge logic, caching rules, and delivery configuration to a new vendor. If you did not abstract the vendor's API behind an interface, every CDN integration point in your codebase needs modification.

Detection: Vendor communication (acquisition announcement). Hopefully caught in your vendor risk assessment process.

Recovery:

  1. If you have an abstraction layer: implement a new CDN adapter (Cloudflare, Fastly, AWS CloudFront) behind the same interface. Migration is configuration, not code.
  2. If you do not have an abstraction layer: extract vendor-specific code into an adapter, implement the new vendor adapter, then switch. This takes 2-4 months.
  3. Negotiate migration timeline with the acquiring company (most offer 12-18 months of support post-acquisition)

Staff Interview Application

How to Introduce This Pattern

Lead with the inventory of existing solutions. This signals that you don't default to building, and that you understand the total cost of ownership.

When NOT to Use Managed (Build Instead)

  • Core business differentiator: If the system IS the product (a payment processor building payment processing, a CDN building content delivery), building is justified because the operational expertise is your competitive advantage.
  • No managed solution exists: Novel problems (custom ML inference at edge, proprietary protocol handling) have no off-the-shelf solution. Build with an eye toward open-sourcing or spinning out.
  • Compliance prohibits third-party data handling: Some regulatory environments (defense, certain healthcare) cannot send data to SaaS vendors. Self-hosted or custom-built is the only option.
  • Cost crossover reached: When managed cost exceeds custom cost at your scale and the custom solution is proven (not speculative), migration is justified.

Follow-Up Questions to Anticipate

Interviewer AsksWhat They Are TestingHow to Respond
"Why not just build it?"TCO and opportunity cost awareness"I could build it. The question is: should I? The TCO for custom is [X] over 3 years including opportunity cost. Managed is [Y]. Unless the gap is a P0 business differentiator, I'd use managed."
"What about vendor lock-in?"Pragmatic risk assessment"I'd abstract the vendor behind an interface from day one. Migration cost with an abstraction layer is weeks, not months. The lock-in risk is real but manageable."
"What if the vendor doesn't support [X]?"Gap analysis depth"That's the gap. I'd evaluate: is [X] a P0 requirement? Can we work around it? If it's truly critical and no workaround exists, I'd build only the [X] component and use managed for everything else."
"How do you evaluate vendors?"Decision process maturity"Three criteria: does it meet P0 requirements? What is the 3-year TCO including scale? What is the exit plan if the vendor fails? I'd prototype with the top 2 candidates before committing."