Distributed Systems: Where Physics, Murphy's Law, and Your Career Collide πŸ’₯

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    Distributed Systems: Where Physics, Murphy's Law, and Your Career Collide πŸ’₯

    🎬 The Interview Question That Breaks People

    "Design a system that handles 100,000 requests per second with 99.99% availability across multiple regions."


    Silence. Sweating. "Uh... load balancer?"


    Here's the thing β€” distributed systems aren't magic. They're a collection of patterns applied to specific problems. Once you learn the patterns, the interview question becomes solvable. And more importantly, the 3 AM production issue becomes debuggable.


    Let's learn the patterns that power the internet.





    πŸ§ͺ The Fundamental Laws You Can't Break

    CAP Theorem: Pick Two (But Actually Pick One)

    In a distributed system, when a network partition happens (and it WILL), you must choose between:






    Consistency
    (C)
    /\
    / \
    / \
    / Pick \
    / two \
    / but \
    / actually \
    / one since \
    / partitions \
    / always happen \
    / \
    Availability ──────────── Partition
    (A) Tolerance (P)







    In plain English:






    CP (Consistency + Partition Tolerance):
    "I'd rather refuse a request than give you wrong data."
    Examples: Banking systems, inventory counts, etcd
    When a network partition happens β†’ some requests fail β†’ but data is always correct

    AP (Availability + Partition Tolerance):
    "I'd rather give you possibly-stale data than refuse your request."
    Examples: Shopping cart, social media feed, DNS
    When a network partition happens β†’ all requests succeed β†’ but data might be stale

    CA (Consistency + Availability):
    "Doesn't exist in distributed systems."
    Only works for single-node databases. The moment you go distributed, network
    partitions are possible, so you MUST handle P.







    🚨 Real Scenario: Choosing Wrong Consistency

    The System: An e-commerce platform with a product catalog replicated across 3 regions.


    The Choice: AP (eventual consistency) β€” because "availability matters more."


    The Disaster: A flash sale. Product price was updated from $99 to $9.99 in the US region. Due to replication lag (3 seconds), the EU region still showed $99. EU customers paid $99 for the same product that US customers got for $9.99. Customer complaints, social media firestorm, $200K in refunds.


    The Lesson: For pricing and inventory, you need strong consistency (CP). For product descriptions and reviews, eventual consistency (AP) is fine.






    Rule of thumb:
    πŸ’° Involves money? β†’ Strong consistency (CP)
    πŸ“¦ Involves stock? β†’ Strong consistency (CP)
    πŸ“ Involves content? β†’ Eventual consistency (AP) is fine
    πŸ‘€ Involves profiles? β†’ Eventual consistency (AP) is fine










    πŸ›‘οΈ Resilience Patterns: Surviving the Chaos

    Pattern 1: Circuit Breaker

    Problem: Service A calls Service B. B starts failing. A keeps calling B, wasting resources and cascading the failure everywhere.






    Without circuit breaker:
    Service A β†’ "Call B" β†’ TIMEOUT (5s) β†’ "Try again" β†’ TIMEOUT β†’
    "Try again" β†’ TIMEOUT β†’ ... (meanwhile, A's thread pool is exhausted)
    β†’ A fails β†’ Everything calling A fails β†’ πŸ’€

    With circuit breaker:

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” success β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” timer β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ CLOSED │────────────▢│ OPEN │───────────▢│HALF-OPEN β”‚
    β”‚ (normal) β”‚ β”‚(fast-failβ”‚ β”‚(test 1 β”‚
    β”‚ │◀──too many──│ all req) β”‚ success β”‚ request) β”‚
    β”‚ β”‚ failures β”‚ │◀───────────│ β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ fail β†’ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    back to OPEN











    CLOSED: Everything is fine. Let requests through.
    OPEN: B is broken. INSTANTLY fail all requests to B.
    Don't even try. Return a fallback/error immediately.
    This prevents A from drowning in timeouts.
    HALF-OPEN: After 30 seconds, try ONE request.
    If it works β†’ CLOSED (B recovered!)
    If it fails β†’ OPEN (B still broken, wait more)







    Pattern 2: Retry with Exponential Backoff + Jitter





    Naive retry:
    Fail β†’ Retry immediately β†’ Fail β†’ Retry immediately β†’ Fail...
    Problem: If 1000 clients all retry at the same time = thundering herd
    β†’ Makes the failing service EVEN MORE overwhelmed

    Smart retry:
    Attempt 1: Wait 100ms + random(0-50ms) = ~125ms
    Attempt 2: Wait 200ms + random(0-100ms) = ~250ms
    Attempt 3: Wait 400ms + random(0-200ms) = ~500ms
    Attempt 4: Wait 800ms + random(0-400ms) = ~1000ms
    Attempt 5: Give up. Circuit breaker opens.

    The JITTER (random component) is crucial:
    Without jitter: 1000 clients all retry at 100ms, 200ms, 400ms (synchronized waves)
    With jitter: 1000 clients retry at random times (spread out, no wave)







    Rules for Retries





    βœ… Retry these:
    HTTP 429 (Too Many Requests) β€” you're rate limited, wait and retry
    HTTP 503 (Service Unavailable) β€” server is temporarily overwhelmed
    HTTP 502/504 (Gateway errors) β€” upstream might recover
    Network timeouts β€” transient network issues

    ❌ Never retry these:
    HTTP 400 (Bad Request) β€” your request is wrong, retrying won't fix it
    HTTP 401/403 (Auth errors) β€” you're not authorized, stop trying
    HTTP 404 (Not Found) β€” it doesn't exist, it won't appear on retry
    HTTP 409 (Conflict) β€” your data is stale, need new data first

    ⚠️ Only retry IDEMPOTENT operations:
    GET, PUT, DELETE: Safe to retry (same result each time)
    POST: DANGEROUS to retry (might create duplicates!)
    β†’ For POST retries, use idempotency keys







    Pattern 3: Bulkhead

    Inspired by ship compartments β€” if one floods, the others stay dry.






    Without bulkhead:
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Shared thread pool (100 threads) β”‚
    β”‚ β”œβ”€β”€ Service A calls (SLOW!) 90/100 β”‚ ← A is broken
    β”‚ β”œβ”€β”€ Service B calls 5/100 β”‚ ← B starved
    β”‚ └── Service C calls 5/100 β”‚ ← C starved
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    A breaks β†’ B and C starve β†’ Everything breaks

    With bulkhead:
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Pool A (40) β”‚ β”‚ Pool B (30) β”‚ β”‚ Pool C (30) β”‚
    β”‚ Service A β”‚ β”‚ Service B β”‚ β”‚ Service C β”‚
    β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
    β”‚ A is slow β”‚ β”‚ B runs fine β”‚ β”‚ C runs fine β”‚
    β”‚ Pool exhaustsβ”‚ β”‚ Unaffected β”‚ β”‚ Unaffected β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    A breaks β†’ Only A is affected β†’ B and C are fine!







    In practice:
    • Kubernetes: Separate node pools for critical vs. best-effort workloads
    • Code: Separate thread pools / connection pools per dependency
    • Networking: Separate ingress controllers for internal vs external traffic





    πŸ“ˆ Scalability Patterns

    Vertical vs. Horizontal Scaling





    Vertical (Scale Up): Buy a bigger machine
    β”œβ”€β”€ Simple: No code changes
    β”œβ”€β”€ Limited: There's a maximum VM size
    └── Expensive: Exponential cost curve

    $100/mo β†’ $400/mo β†’ $1,600/mo β†’ $6,400/mo
    (2x CPU) (4x CPU) (8x CPU) (16x CPU)

    Horizontal (Scale Out): Add more machines
    β”œβ”€β”€ Complex: Need load balancing, stateless design
    β”œβ”€β”€ Unlimited: Add as many as needed
    └── Linear: Linear cost curve

    $100 Γ— 1 β†’ $100 Γ— 2 β†’ $100 Γ— 4 β†’ $100 Γ— 8
    ($100) ($200) ($400) ($800)







    Database Scaling: The Real Bottleneck





    Your app scales horizontally easily (add more pods).
    Your database is almost always the bottleneck.

    Scaling strategies (in order of complexity):

    1. Read Replicas (easy)
    β”Œβ”€β”€β”€β”€ Write ────▢ Primary DB
    β”‚ β”‚
    β”‚ β”Œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”
    β”‚ β–Ό β–Ό β–Ό
    └── Read ──▢ Rep 1 Rep 2 Rep 3

    Works when: 80%+ of queries are reads (most apps)
    Doesn't help: Write-heavy workloads

    2. Caching Layer (medium)
    App β†’ Redis Cache β†’ hit? Return cached β†’ miss? Query DB

    Works when: Same data is requested frequently (product pages)
    Gotcha: Cache invalidation (the two hardest problems in CS)

    3. Sharding (hard)
    Shard key: user_id
    Users 1-1M β†’ Shard 1
    Users 1M-2M β†’ Shard 2
    Users 2M-3M β†’ Shard 3

    Works when: Data is partitionable by a key
    Gotcha: Cross-shard queries are painful (joins across shards = πŸ’€)
    Gotcha: Rebalancing shards when they grow unevenly

    4. CQRS (Command Query Responsibility Segregation) (complex)
    Writes β†’ Write Model (normalized, consistent)
    Reads β†’ Read Model (denormalized, fast, eventually consistent)

    Works when: Read and write patterns are vastly different
    Gotcha: Eventually consistent reads (fine for most apps)







    🚨 Real-World Disaster: The Database Connection Stampede

    Setup: 50 pods, each with a connection pool of 20 connections = 1,000 database connections. PostgreSQL max_connections = 500.






    Normal operation:
    50 pods Γ— 5 active connections = 250 connections (within limit)

    After a deployment (all pods restart simultaneously):
    50 pods boot up at the same time
    Each opens 20 connections immediately
    50 Γ— 20 = 1,000 connection attempts
    Database: "I can only handle 500!"
    Result: Half the pods fail to start β†’ CrashLoopBackOff
    β†’ Pod restarts β†’ more connection attempts β†’ worse stampede







    The Fix:






    # 1. Add PgBouncer as a connection pooler
    # PgBouncer sits between your app and PostgreSQL
    # 1000 app connections β†’ PgBouncer β†’ 100 actual DB connections

    # 2. Rolling restart instead of recreate strategy
    spec:
    strategy:
    type: RollingUpdate
    rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0 # One at a time!

    # 3. Startup probe with backoff
    startupProbe:
    httpGet:
    path: /healthz
    port: 8080
    initialDelaySeconds: 10 # Wait before trying
    failureThreshold: 30
    periodSeconds: 5










    πŸ“¬ Event-Driven Architecture: Decoupling Services

    The Problem With Synchronous Communication





    Synchronous (request-response):
    Order Service β†’ Payment Service β†’ Inventory Service β†’ Email Service

    If Payment Service is slow (2s) β†’ EVERYTHING waits
    If Inventory Service is down β†’ EVERYTHING breaks
    Total latency = sum of all service latencies







    The Event-Driven Solution





    Event-driven (publish-subscribe):
    Order Service publishes: "OrderCreated" event

    β”œβ”€β”€ Payment Service subscribes β†’ processes payment
    β”œβ”€β”€ Inventory Service subscribes β†’ decrements stock
    β”œβ”€β”€ Email Service subscribes β†’ sends confirmation
    └── Analytics Service subscribes β†’ records metrics

    Services are decoupled:
    βœ… Payment is slow? Order Service doesn't care.
    βœ… Email Service is down? Events queue up, delivered later.
    βœ… New service? Just subscribe. No changes to Order Service.







    🚨 Real-World Disaster: The Unordered Events

    System: Event-driven order processing.






    Expected order:
    1. OrderCreated β†’ Payment processes β†’ Inventory decrements β†’ Email sent

    What actually happened:
    Network glitch caused events to arrive out of order:
    1. InventoryDecremented (before payment!)
    2. OrderCreated
    3. PaymentProcessed

    Result: Inventory was decremented for orders where payment FAILED.
    1,200 phantom inventory deductions. Stock counts wrong for 3 days.







    The Fix: Design for out-of-order events.






    Option 1: Include sequence numbers
    Event { orderId: 123, sequence: 1, type: "OrderCreated" }
    Event { orderId: 123, sequence: 2, type: "PaymentProcessed" }
    Consumer: "I got sequence 2 before 1. Buffer it, wait for 1."

    Option 2: Idempotent consumers
    Each event has a unique ID. Consumer tracks processed IDs.
    If duplicate arrives β†’ skip. If out of order β†’ handle gracefully.

    Option 3: Event sourcing
    Store ALL events in order. Replay to build current state.
    The event log IS the truth. Services derive their view from it.










    πŸ—οΈ Platform Design Patterns

    The Internal Developer Platform (IDP)





    The Problem:
    Developer: "I need a new microservice deployed."
    Developer: Writes code β†’ writes Terraform β†’ writes K8s manifests β†’
    configures CI/CD β†’ sets up monitoring β†’ creates DNS β†’
    configures SSL β†’ adds to service mesh β†’
    2 WEEKS LATER: "It's deployed!"

    The Solution: Internal Developer Platform
    Developer: "I need a new microservice deployed."
    Developer: Fills in a template β†’ clicks deploy β†’
    15 MINUTES LATER: "It's deployed with monitoring,
    SSL, CI/CD, and service mesh. All standard. All secure."







    The Golden Path (Not the Golden Cage)





    Golden Path = The recommended way to do things
    "Here's a well-paved road with guardrails.
    Use it and go fast."

    NOT Golden Cage:
    "Here's the ONLY way to do things.
    Deviate and face consequences."

    The difference matters. Teams should be able to leave the
    golden path when they have a good reason. But 95% of the
    time, the path should be so good that nobody wants to leave.










    πŸ”₯ The Anti-Patterns Hall of Shame





    πŸ† Distributed Monolith
    "We have 50 microservices, but they all have to deploy
    together and they all share one database."
    Congratulations, you built a monolith but with network
    latency! The worst of both worlds.

    πŸ† The God Service
    "The OrderService handles orders, payments, inventory,
    emails, analytics, and user management."
    That's not a microservice. That's a monolith in a trench coat.

    πŸ† Chatty Services
    "To render a product page, we make 47 API calls to 12 services."
    Each call adds latency and failure risk. Use the BFF
    (Backend for Frontend) pattern or GraphQL.

    πŸ† Shared Database
    "All 8 services read and write to the same database."
    You lost the entire point of microservices. One schema
    change breaks everything. One slow query blocks everyone.

    πŸ† Not Invented Here
    "We built our own message queue because Kafka was too complex."
    Your custom queue doesn't have 15 years of production testing.
    Use the boring technology. It works.










    🧠 System Design Quick Reference





    Problem: Need high availability?
    β†’ Multi-AZ deployment (minimum)
    β†’ Multi-region for critical services
    β†’ Health checks + auto-failover
    β†’ Circuit breakers between services

    Problem: Need low latency?
    β†’ CDN for static content
    β†’ Cache (Redis) for hot data
    β†’ Edge computing for global users
    β†’ Async processing for non-critical work

    Problem: Need high throughput?
    β†’ Horizontal scaling (more instances)
    β†’ Event-driven architecture (decouple services)
    β†’ Database read replicas + sharding
    β†’ Connection pooling everywhere

    Problem: Need data consistency?
    β†’ Strongly consistent DB (PostgreSQL, Azure SQL)
    β†’ Two-phase commit (expensive, avoid if possible)
    β†’ Saga pattern for distributed transactions
    β†’ Idempotency keys for retry safety

    Problem: Need fault tolerance?
    β†’ Circuit breakers between services
    β†’ Retries with exponential backoff + jitter
    β†’ Bulkheads for isolation
    β†’ Graceful degradation (serve cached/partial data)
    β†’ Queue-based architecture (survive downstream failures)










    🎯 Key Takeaways

    1. CAP theorem is real β€” understand your consistency needs per use case
    2. Circuit breakers prevent cascading failures β€” they're non-negotiable for microservices
    3. Retries without jitter create thundering herds β€” always add randomness
    4. The database is almost always the bottleneck β€” scale it before anything else
    5. Event-driven decoupling saves systems β€” but design for out-of-order delivery
    6. Anti-patterns are more important than patterns β€” knowing what NOT to do prevents disasters
    7. Use boring technology β€” battle-tested beats cutting-edge in production





    πŸ”₯ Homework

    1. Draw the architecture of your main system. Identify where a circuit breaker would prevent cascading failures.
    2. Check your retry configurations. Do they have jitter? If not, add it.
    3. Find one synchronous call chain that could be replaced with events. Write the event schema.





    🏁 Series Wrap-Up

    Congratulations β€” you've made it through the entire DevOps Principal Mastery series!


    Here's what we covered:
    • [Blog 1] Azure Cloud-Native Architecture β€” subscriptions, networking, identity
    • [Blog 2] Kubernetes Mastery β€” pods, scaling, security, GitOps
    • [Blog 3] Terraform at Scale β€” state, modules, testing, environments
    • [Blog 4] CI/CD Standardization β€” pipelines, DORA, deployment strategies
    • [Blog 5] Observability β€” metrics, logs, traces, alerting
    • [Blog 6] DevSecOps β€” supply chain, secrets, container security, zero-trust
    • [Blog 7] SRE β€” SLOs, error budgets, incidents, chaos engineering
    • [Blog 8] Technical Leadership β€” ADRs, mentoring, stakeholder management
    • [Blog 9] System Design β€” CAP, resilience patterns, scalability, events


    Every blog was packed with real incidents, real errors, real fixes, and real patterns used in production. No theoretical fluff. Just the stuff that matters.





    πŸ’¬ Which blog in the series was most valuable to you? What topic should I deep-dive next? Drop your votes below β€” the next series depends on YOU. 🎯




    More...
Working...