Build vs Buy Framework
What This Framework Covers
Every infrastructure decision includes a build vs buy tradeoff. Staff engineers don't default to building; they make explicit decisions based on:
- What's actually differentiated
- Total cost of ownership (TCO)
- Time-to-value
- Operational burden
The Staff Mindset
The question is not "can we build it?" — the answer is always yes.
The question is: "What's the smallest custom surface that delivers our differentiated value?"
The Framework
Step 1: Inventory Existing Controls
Before designing anything custom, list what's already available:
| Layer | Examples | Identity Context |
|---|---|---|
| CDN/WAF | Cloudflare, Akamai, AWS WAF | IP, ASN, geo |
| Managed Gateway | AWS API Gateway, Kong, Apigee | API key, OAuth |
| Service Mesh | Envoy RLS, Istio, Linkerd | Service identity |
| Cloud Provider | AWS throttling, GCP quotas | Account-level |
Step 2: Identify the Gap
The gap is where existing solutions don't meet your requirements:
| Requirement | CDN/WAF | Managed Gateway | Service Mesh | Custom Needed? |
|---|---|---|---|---|
| IP-based abuse | ✅ | ✅ | ❌ | No |
| API key throttling | ❌ | ✅ | ❌ | Maybe |
| User-level quotas | ❌ | Maybe | ❌ | Probably |
| Billing integration | ❌ | ❌ | ❌ | Yes |
| Custom fairness logic | ❌ | ❌ | ❌ | Yes |
The gap is where you need custom work. Everything else should use existing solutions.
Step 3: Calculate TCO
| Cost Category | Build | Buy/Use Managed |
|---|---|---|
| Initial development | High | Low |
| Ongoing maintenance | High (your engineers) | Low (vendor) |
| On-call burden | High | Low-Medium |
| Scaling complexity | High | Low (vendor handles) |
| Feature velocity | Slow (you build it) | Fast (vendor roadmap) |
| Customization | Full control | Limited |
| Vendor lock-in | None | Medium-High |
Staff-grade TCO calculation:
Build TCO =
(Engineers × Time × Loaded Cost) +
(Ongoing maintenance FTE × Years) +
(Incident cost × Expected incidents) +
(Opportunity cost of not building product features)
Buy TCO =
(Vendor cost × Years) +
(Integration effort) +
(Migration cost if vendor fails) +
(Customization workarounds)
Step 4: Time-to-Value Analysis
| Scenario | Build | Buy |
|---|---|---|
| Need it in 2 weeks | Impossible | Possible |
| Need it in 3 months | Risky | Safe |
| Need it in 1 year | Possible | Overkill if simple |
| Regulatory deadline | High risk | Lower risk |
Decision Framework
The "Auth Context Gap"
A common pattern: managed solutions work for coarse signals (IP, ASN) but can't access your auth context (user ID, tenant, plan tier).
Solutions:
- Push auth to edge — If CDN/WAF can read a header or cookie with identity, you can use edge limiting
- Gateway integration — API gateway often has auth plugins; rate limiting after auth gives you identity
- Hybrid — CDN/WAF for volumetric abuse (no auth needed), custom for identity-aware quotas
Applying to Common Systems
Rate Limiting
| Layer | Managed Option | What It Can Do | Gap |
|---|---|---|---|
| Edge | Cloudflare Rate Limiting | IP, ASN, path | No auth context |
| Gateway | Kong, AWS API Gateway | API key, basic quotas | Limited fairness logic |
| Custom | Your code | Full control | You own it |
Staff recommendation:
- Push volumetric abuse protection to CDN/WAF
- Use gateway for API key / basic auth throttling
- Build custom only for billing integration or complex fairness
Observability
| Option | Pros | Cons |
|---|---|---|
| Datadog/New Relic | Fast setup, rich features | Expensive at scale |
| Self-hosted (Prometheus/Grafana) | Cost control, customization | Operational burden |
| Hybrid | Best of both | Complexity |
Staff recommendation:
- Start with managed (time-to-value)
- Migrate high-volume metrics to self-hosted when cost justifies
Authentication
| Option | Pros | Cons |
|---|---|---|
| Auth0/Okta | Fast, secure, compliant | Expensive, lock-in |
| Cognito/Firebase Auth | Cheaper, cloud-native | Less flexible |
| Custom | Full control | Security risk, maintenance |
Staff recommendation:
- Almost never build custom auth
- The "gap" is rarely worth the security risk
L6 vs L7 Calibration
| Dimension | L6 (Staff) | L7 (Principal) |
|---|---|---|
| First instinct | "What exists?" | "What's our long-term platform strategy?" |
| Gap analysis | Identifies what's missing | Quantifies the cost of the gap |
| TCO | Rough estimate | Spreadsheet with multi-year projection |
| Migration | "We can migrate later" | "Here's the migration plan and exit criteria" |
| Org impact | "This affects my team" | "This affects 10 teams; here's the adoption plan" |
Interview Probes
| After You Say... | They Will Ask... |
|---|---|
| "We'll build a rate limiter" | "What about Cloudflare / AWS WAF / Envoy RLS?" |
| "We'll use managed" | "What's the gap? How will you handle [X]?" |
| "Custom for full control" | "What's the TCO? Who operates it?" |
| "Hybrid approach" | "Where's the boundary? How do they interact?" |
Anti-Patterns
1. "Not Invented Here"
Building everything from scratch because "we can do it better" — ignoring TCO and opportunity cost.
2. "Vendor Will Solve Everything"
Assuming managed solutions have no gaps — then discovering limitations in production.
3. "We'll Abstract It Later"
Building tightly coupled to a vendor with "we'll add an abstraction layer" — rarely happens.
4. "It's Just a Small Service"
Underestimating operational burden — "small" services still need monitoring, on-call, upgrades.
5. "Cloud Provider Lock-in is Bad"
Over-rotating on portability — the cost of multi-cloud abstraction often exceeds lock-in risk.
Staff Sentence Templates
Implementation Deep Dive
1. Gap Analysis — The Systematic Approach
The most common Build vs Buy mistake is deciding before enumerating requirements. Staff engineers use a structured gap analysis to identify exactly where managed solutions fall short.
Gap Analysis Worksheet
# Step 1: List every requirement
REQUIREMENTS = [
{ "name": "IP-based rate limiting", "priority": "P0" },
{ "name": "API key throttling", "priority": "P0" },
{ "name": "User-level quotas", "priority": "P0" },
{ "name": "Billing integration (metering)", "priority": "P1" },
{ "name": "Custom fairness (weighted)", "priority": "P2" },
{ "name": "Real-time usage dashboard", "priority": "P2" },
]
# Step 2: Evaluate each managed solution
SOLUTIONS = {
"Cloudflare": { "covers": ["IP-based rate limiting"],
"gaps": ["API key throttling", "User-level quotas",
"Billing integration", "Custom fairness"] },
"AWS API Gateway": { "covers": ["IP-based rate limiting", "API key throttling"],
"gaps": ["User-level quotas", "Billing integration",
"Custom fairness"] },
"Kong (gateway)": { "covers": ["IP-based rate limiting", "API key throttling",
"User-level quotas"],
"gaps": ["Billing integration", "Custom fairness"] },
}
# Step 3: Identify the gap
# Kong covers P0 requirements. The gap is P1 (billing) and P2 (fairness, dashboard).
# Decision: Use Kong for P0, build custom for billing integration only.
2. Thin Custom Layer on Top of Managed — The Hybrid Pattern
The Staff default is rarely "build everything" or "buy everything." It is: use managed for commodity functionality, build a thin custom layer for the differentiated gap.
Hybrid Architecture Example — Rate Limiting
# Layer 1: Cloudflare (managed) — IP-based abuse protection
# Handles: DDoS, bot detection, volumetric attacks
# Configuration: Cloudflare dashboard, no custom code
# Cost: $0.05 per 10K requests
# Layer 2: Kong (managed gateway) — API key throttling
# Handles: per-key rate limits, basic quotas
# Configuration: Kong rate-limiting plugin
# Cost: Kong license + hosting
# Layer 3: Custom service — billing integration
# Handles: metered usage, plan-tier quotas, overage billing
# This is the ONLY custom code
class BillingAwareQuota:
function checkQuota(userId, operation):
# Read plan limits from billing system cache
plan = billingCache.getPlan(userId) # Cached, refreshed every 60s
usage = usageCounter.get(userId, period="current_month")
if usage >= plan.hardLimit:
return { allowed: false, reason: "plan_limit_reached",
upgradeUrl: "/pricing" }
if usage >= plan.softLimit:
# Overage billing: allow but meter at overage rate
usageCounter.increment(userId)
return { allowed: true, overage: true,
overageRate: plan.overageRate }
usageCounter.increment(userId)
return { allowed: true, overage: false }
3. TCO Calculation — A Concrete Example
Abstract TCO comparisons are useless. Here is a concrete calculation for a rate limiting system.
Build vs Buy TCO (3-Year Projection)
# === BUILD ===
Initial development:
2 senior engineers × 3 months × $20K/month loaded = $120,000
Code review, testing, security audit = $30,000
Subtotal = $150,000
Ongoing maintenance (per year):
0.5 engineer (bug fixes, upgrades, on-call) = $120,000/year
Infrastructure (Redis cluster, monitoring) = $24,000/year
Incident cost (2 incidents/year × $5K each) = $10,000/year
Subtotal/yr = $154,000
Opportunity cost:
2 engineers × 3 months not building product features
(Hard to quantify but real)
3-Year Build TCO = $150,000 + (3 × $154,000) = $612,000
# === BUY (Kong + Cloudflare + custom billing layer) ===
Managed services (per year):
Kong Enterprise license = $50,000/year
Cloudflare Pro + rate limiting add-on = $12,000/year
Subtotal/yr = $62,000
Custom billing layer:
1 engineer × 2 weeks = $10,000
Ongoing: 0.05 engineer (minimal maintenance) = $12,000/year
Integration effort:
1 engineer × 1 month = $20,000
3-Year Buy TCO = $30,000 + (3 × $74,000) = $252,000
Verdict: Buy saves $360K over 3 years and frees 2 engineers for product work. Build is justified only if the managed solutions cannot meet a P0 requirement that no workaround can address.
4. Vendor Exit Strategy — The Migration Plan
Every "buy" decision needs an exit plan. What happens when the vendor raises prices 3x, gets acquired, or sunsets the product?
Exit Strategy Checklist
VENDOR_EXIT_CHECKLIST = {
"abstraction_layer": {
"description": "Is vendor logic behind an interface you control?",
"example": "RateLimiter interface with KongRateLimiter and CustomRateLimiter implementations",
"effort_if_missing": "2-4 weeks to extract and abstract"
},
"data_portability": {
"description": "Can you export your configuration, usage data, and audit logs?",
"example": "Kong config exported as YAML, usage metrics in your own Prometheus",
"effort_if_missing": "1-2 weeks to build export pipeline"
},
"feature_parity_plan": {
"description": "If you had to replace this vendor in 90 days, what features would you skip?",
"example": "Skip: dashboard UI, anomaly detection. Keep: core rate limiting, quota enforcement",
"effort_if_missing": "Depends on gap size — 1 month to 6 months"
},
"contractual_protection": {
"description": "Source code escrow, price lock, or migration period clause?",
"example": "1-year price lock, 90-day migration period on termination",
"effort_if_missing": "Negotiate before signing, not after"
},
}
Architecture Diagram
Each layer handles what it does best: Cloudflare handles volumetric abuse at the edge (no auth needed). Kong handles API key throttling at the gateway (auth available). Your code handles billing integration (requires business context). Traffic is filtered at each layer — by the time a request reaches your application, it has survived three layers of protection with zero custom infrastructure.
Failure Scenarios
1. Vendor Rate Limit Exceeds Your Budget
Timeline: You use AWS API Gateway for rate limiting. Your API grows from 1M requests/day to 50M. AWS API Gateway charges $3.50 per million requests. Your monthly bill goes from $105 to $5,250 — a 50x increase. The finance team asks why infrastructure costs grew 50x when revenue grew 3x.
Blast radius: Budget pressure forces a migration. But migrating mid-growth means engineering time diverted from product features during the highest-growth period.
Detection: Cloud cost monitoring alerts when spend exceeds budget threshold. Monthly cost reviews with finance.
Recovery:
- Immediate: optimize API call volume — client-side caching, request batching, CDN for cacheable responses can reduce API Gateway calls by 40-60%
- Short-term: move high-volume, low-value endpoints (health checks, polling) to a cheaper path (direct to ALB, bypassing API Gateway)
- Long-term: evaluate self-hosted gateway (Kong OSS on EKS) for the commodity functionality. Keep AWS API Gateway for low-volume endpoints where the managed convenience is worth the premium
2. Managed Solution Missing Critical Feature
Timeline: You use Auth0 for authentication. A new compliance requirement mandates that user sessions must be revocable within 30 seconds across all regions. Auth0's token revocation propagation takes 5-10 minutes (their global CDN caches JWKS endpoints). You cannot meet the compliance deadline with Auth0 alone.
Blast radius: Compliance deadline is in 8 weeks. Migrating off Auth0 is a 6-month project. Building a custom auth system from scratch is a 12-month project and a security risk.
Detection: Compliance review identifies the gap during the assessment phase (ideally before the deadline, not after).
Recovery:
- Hybrid approach: keep Auth0 for authentication issuance. Add a thin custom revocation layer — a Redis-backed bloom filter checked on every request that overrides the JWT validity
- The custom layer is ~200 lines of code: check bloom filter before accepting JWT. If the token ID is in the bloom filter, reject regardless of JWT validity
- This meets the 30-second revocation requirement without replacing Auth0
3. Vendor Lock-In During Acquisition
Timeline: Your primary CDN vendor (mid-tier provider) is acquired by a competitor. The acquiring company announces end-of-life for the product in 12 months. Your entire content delivery and edge caching strategy depends on this vendor's proprietary API.
Blast radius: 12 months to migrate all edge logic, caching rules, and delivery configuration to a new vendor. If you did not abstract the vendor's API behind an interface, every CDN integration point in your codebase needs modification.
Detection: Vendor communication (acquisition announcement). Hopefully caught in your vendor risk assessment process.
Recovery:
- If you have an abstraction layer: implement a new CDN adapter (Cloudflare, Fastly, AWS CloudFront) behind the same interface. Migration is configuration, not code.
- If you do not have an abstraction layer: extract vendor-specific code into an adapter, implement the new vendor adapter, then switch. This takes 2-4 months.
- Negotiate migration timeline with the acquiring company (most offer 12-18 months of support post-acquisition)
Staff Interview Application
How to Introduce This Pattern
Lead with the inventory of existing solutions. This signals that you don't default to building, and that you understand the total cost of ownership.
When NOT to Use Managed (Build Instead)
- Core business differentiator: If the system IS the product (a payment processor building payment processing, a CDN building content delivery), building is justified because the operational expertise is your competitive advantage.
- No managed solution exists: Novel problems (custom ML inference at edge, proprietary protocol handling) have no off-the-shelf solution. Build with an eye toward open-sourcing or spinning out.
- Compliance prohibits third-party data handling: Some regulatory environments (defense, certain healthcare) cannot send data to SaaS vendors. Self-hosted or custom-built is the only option.
- Cost crossover reached: When managed cost exceeds custom cost at your scale and the custom solution is proven (not speculative), migration is justified.
Follow-Up Questions to Anticipate
| Interviewer Asks | What They Are Testing | How to Respond |
|---|---|---|
| "Why not just build it?" | TCO and opportunity cost awareness | "I could build it. The question is: should I? The TCO for custom is [X] over 3 years including opportunity cost. Managed is [Y]. Unless the gap is a P0 business differentiator, I'd use managed." |
| "What about vendor lock-in?" | Pragmatic risk assessment | "I'd abstract the vendor behind an interface from day one. Migration cost with an abstraction layer is weeks, not months. The lock-in risk is real but manageable." |
| "What if the vendor doesn't support [X]?" | Gap analysis depth | "That's the gap. I'd evaluate: is [X] a P0 requirement? Can we work around it? If it's truly critical and no workaround exists, I'd build only the [X] component and use managed for everything else." |
| "How do you evaluate vendors?" | Decision process maturity | "Three criteria: does it meet P0 requirements? What is the 3-year TCO including scale? What is the exit plan if the vendor fails? I'd prototype with the top 2 candidates before committing." |