Part 5: From One Server to Many - The Need for Orchestration

**MyrinNew** · 12-22-2025, 05:45 PM

Series: From "Just Put It on a Server" to Production DevOps

Reading time: 11 minutes

Level: Intermediate

The Production Reality Check

Your SSPP platform is live! Docker Compose works beautifully on your local machine and even on your single production server.

Then Black Friday hits. Traffic spikes 50x.

What do you do?

You can't just run docker-compose up --scale worker=50 because:

One server doesn't have 50x the resources
The database would be overwhelmed
You'd need multiple servers

So you start manually:

# Rent 5 more Linode servers
# SSH into each one
# Install Docker on each
# Copy docker-compose.yml to each
# Modify each to avoid port conflicts
# Start containers manually
# Configure a load balancer somehow
# Hope nothing breaks

Time to scale: 3-4 hours (if you're fast and lucky)

By the time you're done, Black Friday is over.

Failure Scenario 1: Container Crashes

Let's simulate a production crash:

# Start your stack
docker-compose up -d

# Kill the API container
docker kill sspp-api

What happens?

The API is dead. Docker Compose doesn't restart it automatically.

Check status:

docker-compose ps

NAME STATE
sspp-api Exited (137)
sspp-worker Up
sspp-postgres Up
sspp-redis Up

Users see 500 errors. Your on-call phone explodes. 📱💥

Manual fix:

docker-compose up -d api

Downtime: 2-10 minutes (detection + SSH + restart)

In a production system, you need automatic recovery.

Failure Scenario 2: Server Crashes

Even worse—the entire server goes down:

# Simulate server crash (don't actually run this!)
sudo reboot -f

What happens?

API: Dead
Worker: Dead
PostgreSQL: Dead (data persisted in volumes, but service down)
Redis queue: Empty (all jobs lost)
Users: Angry

Manual recovery:

# Wait for server to boot (~2 minutes)
# SSH in
docker-compose up -d
# Wait for services to start (~30 seconds)
# Hope data is intact

Downtime: 3-5 minutes minimum

Lost data: All queued jobs

Failure Scenario 3: Rolling Update Gone Wrong

You need to deploy a critical bug fix:

# Build new image
docker-compose build api

# Restart with new image
docker-compose up -d api

What happens?

Old API container stops (connections dropped)
New API container starts
5-30 seconds of downtime while it boots
If the new version has a bug, you need to manually rollback

The deployment strategy:

No blue/green deployment
No canary releases
No gradual rollout
Just... restart and pray 🙏

Failure Scenario 4: Manual Scaling Nightmare

Traffic is increasing. You need 5 API instances across 3 servers:

Server 1 (docker-compose.yml):

services:
api:
ports:
- "3000:3000" # Occupies port 3000

Server 2 (docker-compose.yml):

services:
api:
ports:
- "3000:3000" # Same port—works because different server

But how do users reach them? You need a load balancer:

┌──────────────┐
│ Load Balancer│
│ (HAProxy?) │
└───────┬──────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
Server 1 Server 2 Server 3
API:3000 API:3000 API:3000

Manual steps:

Install HAProxy
Configure health checks
Add server IPs manually
Restart HAProxy when adding/removing servers
Handle SSL termination
Monitor everything

Time to set up: 2-4 hours

Maintenance burden: High

Error-prone: Very

Failure Scenario 5: Database Connection Limits

Your PostgreSQL server has a max_connections limit (default: 100).

With 10 API instances and 10 Worker instances, each holding 10 connections:

10 APIs × 10 connections = 100
10 Workers × 10 connections = 100
Total = 200 connections
Max allowed = 100

Result: Half your containers can't connect to the database.

Manual fix:

Configure connection pooling in each service
Increase PostgreSQL max_connections
Restart everything
Hope you calculated correctly

What You Need (But Don't Have)

At this point, you realize you need:

Self-healing: Automatically restart failed containers
Auto-scaling: Add/remove instances based on load
Load balancing: Distribute traffic across instances
Service discovery: Containers find each other dynamically
Rolling updates: Deploy without downtime
Rollback capability: Revert bad deploys instantly
Health checks: Don't route traffic to sick containers
Resource limits: Prevent one container from starving others
Secrets management: No passwords in plain text
Multi-server orchestration: Run across many machines

Docker Compose gives you none of these in production.

The Orchestration Gap

Docker Compose is great for development:

Single server
Manual starts/stops
Simple networking
Quick iteration

But terrible for production:

No multi-server support
No automatic recovery
No scaling logic
No deployment strategies
No resource management
No production-grade networking

You've hit the orchestration wall.

The Emotional Journey

Stage 1: Denial

"Docker Compose works fine. I'll just run it on a big server."

Stage 2: Anger

"Why is this so hard?! I just want to run containers!"

Stage 3: Bargaining

"Maybe I can script this with bash and cron jobs?"

Stage 4: Depression

"I'm spending 80% of my time managing infrastructure, 20% building features."

Stage 5: Acceptance

"I need an orchestrator. I need Kubernetes."

Why Kubernetes Exists

Kubernetes solves all the problems we just experienced:

Auto-restart	❌ Manual	✅ Automatic
Multi-server	❌ Single server	✅ Cluster of servers
Load balancing	❌ Manual HAProxy	✅ Built-in Service
Scaling	❌ Manual --scale	✅ Auto-scaling (HPA)
Rolling updates	❌ Restart (downtime)	✅ Zero-downtime
Rollback	❌ Manual	✅ One command
Health checks	⚠️ Basic	✅ Advanced (liveness, readiness)
Secrets	❌ Plain text	✅ Encrypted
Resource limits	⚠️ Basic	✅ Fine-grained
Service discovery	⚠️ DNS-based	✅ Dynamic

Kubernetes is Docker Compose for production, multiplied by 1000.

But Why Not Just... [Alternative]?

"Why not Docker Swarm?"

Docker Swarm is simpler than Kubernetes, but:

Smaller ecosystem
Fewer features (no HPA, limited RBAC)
Less adoption (most tools target K8s)
Docker Inc. de-prioritized it

Use case: Small teams, simple apps.

"Why not managed services (AWS ECS, Cloud Run)?"

Managed services work great, but:

Vendor lock-in (can't easily move)
Limited customization
Higher costs at scale
Not portable (can't run locally)

Use case: Fully bought into one cloud provider.

"Why not Nomad?"

HashiCorp Nomad is excellent, but:

Smaller community
Fewer integrations
Less tooling
Harder to hire for

Use case: Already using HashiCorp stack (Terraform, Vault, Consul).

"Why Kubernetes?"

Industry standard (most jobs require it)
Huge ecosystem (tools for everything)
Cloud-agnostic (AWS, GCP, Azure, Linode)
Local development (k3s, Minikube, Kind)
Portable (same manifests everywhere)

Kubernetes won the orchestration war.

What You'll Learn in Part 6

In the next article, we'll deploy SSPP to Kubernetes on Linode.

But we won't just throw kubectl commands at you.

We'll explain:

What Pods, Deployments, and Services actually are
Why Kubernetes seems complicated (and how to think about it)
How to run Kubernetes locally (k3s) before going to production
Real deployment strategies (rolling updates, blue/green)
How our SSPP manifests work

No magic. No copy-paste. Just understanding.

The Mindset Shift

Before Kubernetes, you think:

"I have a server. I'll put containers on it."

After Kubernetes, you think:

"I have a cluster. I'll declare what I want running. Kubernetes makes it happen."

It's declarative infrastructure:

# You say what you want
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 5 # I want 5 API instances

# Kubernetes makes it happen
# - Schedules 5 pods
# - Distributes across servers
# - Monitors them
# - Restarts if they die
# - Scales up/down dynamically

You describe the desired state. Kubernetes maintains it.

Try It Yourself (Before Part 6)

Challenge: Break Docker Compose in creative ways:

Kill containers—see if they restart (they won't)
Overload the API—see if it auto-scales (it won't)
Deploy a new version—see if there's downtime (there will be)
Simulate high CPU—see if K8s would help (it would)

Write down your frustrations. They'll make Part 6 more satisfying.

Discussion

What production incident convinced you that you needed orchestration?

Share your war stories on GitHub Discussions.

Previous: Part 4: Running Multiple Services Locally with Docker Compose

Next: Part 6: Kubernetes from First Principles (No Magic)

About the Author

Documenting real DevOps journey for Proton.ai application. Connect with me:

GitHub: @daviesbrown
LinkedIn: David Nwosu Brown

More...