Part 5: From One Server to Many - The Need for Orchestration

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    Part 5: From One Server to Many - The Need for Orchestration

    Series: From "Just Put It on a Server" to Production DevOps


    Reading time: 11 minutes


    Level: Intermediate



    The Production Reality Check

    Your SSPP platform is live! Docker Compose works beautifully on your local machine and even on your single production server.


    Then Black Friday hits. Traffic spikes 50x.


    What do you do?


    You can't just run docker-compose up --scale worker=50 because:

    1. One server doesn't have 50x the resources
    2. The database would be overwhelmed
    3. You'd need multiple servers


    So you start manually:






    # Rent 5 more Linode servers
    # SSH into each one
    # Install Docker on each
    # Copy docker-compose.yml to each
    # Modify each to avoid port conflicts
    # Start containers manually
    # Configure a load balancer somehow
    # Hope nothing breaks







    Time to scale: 3-4 hours (if you're fast and lucky)


    By the time you're done, Black Friday is over.





    Failure Scenario 1: Container Crashes

    Let's simulate a production crash:






    # Start your stack
    docker-compose up -d

    # Kill the API container
    docker kill sspp-api







    What happens?


    The API is dead. Docker Compose doesn't restart it automatically.


    Check status:






    docker-compose ps

    NAME STATE
    sspp-api Exited (137)
    sspp-worker Up
    sspp-postgres Up
    sspp-redis Up







    Users see 500 errors. Your on-call phone explodes. πŸ“±πŸ’₯


    Manual fix:






    docker-compose up -d api







    Downtime: 2-10 minutes (detection + SSH + restart)


    In a production system, you need automatic recovery.





    Failure Scenario 2: Server Crashes

    Even worseβ€”the entire server goes down:






    # Simulate server crash (don't actually run this!)
    sudo reboot -f







    What happens?
    • API: Dead
    • Worker: Dead
    • PostgreSQL: Dead (data persisted in volumes, but service down)
    • Redis queue: Empty (all jobs lost)
    • Users: Angry


    Manual recovery:






    # Wait for server to boot (~2 minutes)
    # SSH in
    docker-compose up -d
    # Wait for services to start (~30 seconds)
    # Hope data is intact







    Downtime: 3-5 minutes minimum


    Lost data: All queued jobs





    Failure Scenario 3: Rolling Update Gone Wrong

    You need to deploy a critical bug fix:






    # Build new image
    docker-compose build api

    # Restart with new image
    docker-compose up -d api







    What happens?

    1. Old API container stops (connections dropped)
    2. New API container starts
    3. 5-30 seconds of downtime while it boots
    4. If the new version has a bug, you need to manually rollback


    The deployment strategy:
    • No blue/green deployment
    • No canary releases
    • No gradual rollout
    • Just... restart and pray πŸ™





    Failure Scenario 4: Manual Scaling Nightmare

    Traffic is increasing. You need 5 API instances across 3 servers:


    Server 1 (docker-compose.yml):






    services:
    api:
    ports:
    - "3000:3000" # Occupies port 3000







    Server 2 (docker-compose.yml):






    services:
    api:
    ports:
    - "3000:3000" # Same portβ€”works because different server







    But how do users reach them? You need a load balancer:






    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Load Balancerβ”‚
    β”‚ (HAProxy?) β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό β–Ό β–Ό
    Server 1 Server 2 Server 3
    API:3000 API:3000 API:3000







    Manual steps:

    1. Install HAProxy
    2. Configure health checks
    3. Add server IPs manually
    4. Restart HAProxy when adding/removing servers
    5. Handle SSL termination
    6. Monitor everything


    Time to set up: 2-4 hours


    Maintenance burden: High


    Error-prone: Very



    Failure Scenario 5: Database Connection Limits

    Your PostgreSQL server has a max_connections limit (default: 100).


    With 10 API instances and 10 Worker instances, each holding 10 connections:






    10 APIs Γ— 10 connections = 100
    10 Workers Γ— 10 connections = 100
    Total = 200 connections
    Max allowed = 100







    Result: Half your containers can't connect to the database.


    Manual fix:

    1. Configure connection pooling in each service
    2. Increase PostgreSQL max_connections
    3. Restart everything
    4. Hope you calculated correctly





    What You Need (But Don't Have)

    At this point, you realize you need:

    1. Self-healing: Automatically restart failed containers
    2. Auto-scaling: Add/remove instances based on load
    3. Load balancing: Distribute traffic across instances
    4. Service discovery: Containers find each other dynamically
    5. Rolling updates: Deploy without downtime
    6. Rollback capability: Revert bad deploys instantly
    7. Health checks: Don't route traffic to sick containers
    8. Resource limits: Prevent one container from starving others
    9. Secrets management: No passwords in plain text
    10. Multi-server orchestration: Run across many machines


    Docker Compose gives you none of these in production.





    The Orchestration Gap

    Docker Compose is great for development:
    • Single server
    • Manual starts/stops
    • Simple networking
    • Quick iteration


    But terrible for production:
    • No multi-server support
    • No automatic recovery
    • No scaling logic
    • No deployment strategies
    • No resource management
    • No production-grade networking


    You've hit the orchestration wall.





    The Emotional Journey

    Stage 1: Denial


    "Docker Compose works fine. I'll just run it on a big server."


    Stage 2: Anger


    "Why is this so hard?! I just want to run containers!"


    Stage 3: Bargaining


    "Maybe I can script this with bash and cron jobs?"


    Stage 4: Depression


    "I'm spending 80% of my time managing infrastructure, 20% building features."


    Stage 5: Acceptance


    "I need an orchestrator. I need Kubernetes."



    Why Kubernetes Exists

    Kubernetes solves all the problems we just experienced:


    Auto-restart ❌ Manual βœ… Automatic
    Multi-server ❌ Single server βœ… Cluster of servers
    Load balancing ❌ Manual HAProxy βœ… Built-in Service
    Scaling ❌ Manual --scale βœ… Auto-scaling (HPA)
    Rolling updates ❌ Restart (downtime) βœ… Zero-downtime
    Rollback ❌ Manual βœ… One command
    Health checks ⚠️ Basic βœ… Advanced (liveness, readiness)
    Secrets ❌ Plain text βœ… Encrypted
    Resource limits ⚠️ Basic βœ… Fine-grained
    Service discovery ⚠️ DNS-based βœ… Dynamic


    Kubernetes is Docker Compose for production, multiplied by 1000.



    But Why Not Just... [Alternative]?

    "Why not Docker Swarm?"

    Docker Swarm is simpler than Kubernetes, but:
    • Smaller ecosystem
    • Fewer features (no HPA, limited RBAC)
    • Less adoption (most tools target K8s)
    • Docker Inc. de-prioritized it


    Use case: Small teams, simple apps.

    "Why not managed services (AWS ECS, Cloud Run)?"

    Managed services work great, but:
    • Vendor lock-in (can't easily move)
    • Limited customization
    • Higher costs at scale
    • Not portable (can't run locally)


    Use case: Fully bought into one cloud provider.

    "Why not Nomad?"

    HashiCorp Nomad is excellent, but:
    • Smaller community
    • Fewer integrations
    • Less tooling
    • Harder to hire for


    Use case: Already using HashiCorp stack (Terraform, Vault, Consul).

    "Why Kubernetes?"

    • Industry standard (most jobs require it)
    • Huge ecosystem (tools for everything)
    • Cloud-agnostic (AWS, GCP, Azure, Linode)
    • Local development (k3s, Minikube, Kind)
    • Portable (same manifests everywhere)


    Kubernetes won the orchestration war.



    What You'll Learn in Part 6

    In the next article, we'll deploy SSPP to Kubernetes on Linode.


    But we won't just throw kubectl commands at you.


    We'll explain:
    • What Pods, Deployments, and Services actually are
    • Why Kubernetes seems complicated (and how to think about it)
    • How to run Kubernetes locally (k3s) before going to production
    • Real deployment strategies (rolling updates, blue/green)
    • How our SSPP manifests work


    No magic. No copy-paste. Just understanding.



    The Mindset Shift

    Before Kubernetes, you think:


    "I have a server. I'll put containers on it."


    After Kubernetes, you think:


    "I have a cluster. I'll declare what I want running. Kubernetes makes it happen."


    It's declarative infrastructure:






    # You say what you want
    apiVersion: apps/v1
    kind: Deployment
    spec:
    replicas: 5 # I want 5 API instances

    # Kubernetes makes it happen
    # - Schedules 5 pods
    # - Distributes across servers
    # - Monitors them
    # - Restarts if they die
    # - Scales up/down dynamically







    You describe the desired state. Kubernetes maintains it.





    Try It Yourself (Before Part 6)

    Challenge: Break Docker Compose in creative ways:

    1. Kill containersβ€”see if they restart (they won't)
    2. Overload the APIβ€”see if it auto-scales (it won't)
    3. Deploy a new versionβ€”see if there's downtime (there will be)
    4. Simulate high CPUβ€”see if K8s would help (it would)


    Write down your frustrations. They'll make Part 6 more satisfying.





    Discussion

    What production incident convinced you that you needed orchestration?


    Share your war stories on GitHub Discussions.





    Previous: Part 4: Running Multiple Services Locally with Docker Compose


    Next: Part 6: Kubernetes from First Principles (No Magic)

    About the Author


    Documenting real DevOps journey for Proton.ai application. Connect with me:



    More...
Working...