Zero-Downtime Deployment Strategies: Keep Your Application Running During Updates
Planned downtime costs businesses an average of $5,600 per minute. Yet 43% of companies still take their applications offline during deployments. Zero-downtime deployment is not a luxury --- it is an expectation. Customers, search engines, and integration partners all penalize applications that go offline, even briefly.
This guide covers the three primary zero-downtime deployment strategies, database migration techniques that preserve uptime, and automated rollback mechanisms.
Key Takeaways
- Blue-green deployment is the safest strategy: instant rollback by switching traffic back to the previous version
- Database migrations must be backward-compatible --- the old application version must work with the new schema
- Health checks and readiness probes prevent routing traffic to pods that are not ready to serve
- Automated rollback based on error rate monitoring reduces mean time to recovery to under 2 minutes
Strategy Comparison
| Strategy | Complexity | Rollback Speed | Infrastructure Cost | Best For |
|---|---|---|---|---|
| Blue-green | Low | Instant (seconds) | 2x during deployment | Critical applications, infrequent deploys |
| Rolling update | Medium | Minutes | 1.25x during deployment | Kubernetes, frequent deploys |
| Canary | High | Fast (seconds) | 1.05x during deployment | High-traffic, risk-sensitive |
| Feature flags | Medium | Instant | 1x | Gradual feature rollout |
Blue-Green Deployment
Architecture
Load Balancer
|
|--- [ACTIVE] Blue environment (v2.0.0) <-- receives 100% traffic
|
|--- [IDLE] Green environment (v2.1.0) <-- deployed, tested, waiting
On deployment:
- Deploy v2.1.0 to the idle (green) environment
- Run smoke tests against green
- Switch load balancer to green
- Blue becomes idle (available for instant rollback)
Implementation with Nginx
# /etc/nginx/conf.d/app.conf
upstream blue {
server 10.0.1.10:3000;
server 10.0.1.11:3000;
}
upstream green {
server 10.0.2.10:3000;
server 10.0.2.11:3000;
}
# Active environment - change this during deployment
map $host $active_upstream {
default blue; # Change to 'green' to switch
}
server {
listen 443 ssl;
server_name app.example.com;
location / {
proxy_pass http://$active_upstream;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Deployment Script
#!/bin/bash
set -e
CURRENT=$(cat /etc/nginx/active-env) # "blue" or "green"
TARGET=$( [ "$CURRENT" = "blue" ] && echo "green" || echo "blue" )
echo "Current: $CURRENT, deploying to: $TARGET"
# Deploy to inactive environment
ssh "deploy@$TARGET-1" "cd /opt/app && git pull && pnpm install --frozen-lockfile && pnpm build && pm2 restart all"
ssh "deploy@$TARGET-2" "cd /opt/app && git pull && pnpm install --frozen-lockfile && pnpm build && pm2 restart all"
# Wait for health checks
for i in 1 2; do
echo "Checking $TARGET-$i health..."
for attempt in $(seq 1 30); do
if curl -sf "http://$TARGET-$i:3000/health" > /dev/null; then
echo "$TARGET-$i is healthy"
break
fi
sleep 2
done
done
# Run smoke tests
pnpm test:smoke --base-url "http://$TARGET-1:3000"
# Switch traffic
sed -i "s/default $CURRENT/default $TARGET/" /etc/nginx/conf.d/app.conf
nginx -s reload
echo "$TARGET" > /etc/nginx/active-env
echo "Traffic switched to $TARGET. Rollback: change active-env back to $CURRENT"
Rolling Update
Rolling updates replace instances incrementally, ensuring some capacity is always available.
Kubernetes Rolling Update
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Create 1 extra pod during update
maxUnavailable: 0 # Never reduce below desired replicas
template:
spec:
containers:
- name: api
image: registry.example.com/api:v2.1.0
readinessProbe:
httpGet:
path: /health
port: 3001
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3001
initialDelaySeconds: 15
periodSeconds: 10
The rolling update process with maxSurge: 1 and maxUnavailable: 0:
- Create 1 new pod with v2.1.0 (6 pods total: 5 old + 1 new)
- Wait for new pod readiness probe to pass
- Terminate 1 old pod (5 pods: 4 old + 1 new)
- Create another new pod (6 pods: 4 old + 2 new)
- Repeat until all pods are v2.1.0
Canary Deployment
Traffic Splitting
# Istio VirtualService for canary routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-canary
spec:
hosts:
- api.example.com
http:
- route:
- destination:
host: api-stable
port:
number: 3001
weight: 95
- destination:
host: api-canary
port:
number: 3001
weight: 5
Progressive Canary Rollout
| Phase | Canary Traffic | Duration | Success Criteria |
|---|---|---|---|
| 1 | 1% | 10 minutes | Error rate <0.1%, latency <500ms |
| 2 | 5% | 30 minutes | Error rate <0.1%, latency <500ms |
| 3 | 25% | 1 hour | Error rate <0.5%, latency <1s |
| 4 | 50% | 2 hours | Error rate <0.5%, latency <1s |
| 5 | 100% | Full rollout | Stable for 24 hours |
Database Migrations Without Downtime
The biggest challenge in zero-downtime deployment is database schema changes. The old application version must work with the new schema, and vice versa.
The Expand-Contract Pattern
Phase 1: Expand (deploy schema change)
-- Add new column (nullable, no default)
ALTER TABLE orders ADD COLUMN shipping_method VARCHAR(50);
Old application code ignores the new column. New application code writes to both old and new columns.
Phase 2: Migrate data
-- Backfill existing data
UPDATE orders SET shipping_method = 'standard' WHERE shipping_method IS NULL;
Phase 3: Contract (deploy code that uses new column exclusively)
After all application instances use the new column:
-- Make column required
ALTER TABLE orders ALTER COLUMN shipping_method SET NOT NULL;
ALTER TABLE orders ALTER COLUMN shipping_method SET DEFAULT 'standard';
Dangerous Migration Patterns
| Pattern | Risk | Safe Alternative |
|---|---|---|
| Rename column | Breaks old code | Add new column, migrate, drop old |
| Drop column | Breaks old code | Stop using, then drop in next release |
| Add NOT NULL column | Locks table | Add nullable, backfill, alter to NOT NULL |
| Change column type | Locks table, breaks queries | Add new column with new type, migrate |
| Add unique index | Locks table on large tables | CREATE INDEX CONCURRENTLY |
Automated Rollback
Error Rate Based Rollback
#!/bin/bash
# post-deploy-monitor.sh
DEPLOY_TIME=$(date +%s)
MONITOR_DURATION=300 # 5 minutes
ERROR_THRESHOLD=0.02 # 2%
while [ $(($(date +%s) - DEPLOY_TIME)) -lt $MONITOR_DURATION ]; do
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[2m])/rate(http_requests_total[2m])" | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
echo "ERROR: Rate $ERROR_RATE exceeds threshold $ERROR_THRESHOLD"
echo "Initiating rollback..."
kubectl rollout undo deployment/api
exit 1
fi
sleep 15
done
echo "Deployment healthy for $MONITOR_DURATION seconds"
Frequently Asked Questions
Which strategy should we start with?
Start with blue-green deployment. It is the simplest to implement, provides instant rollback, and works with any application architecture. Rolling updates are better for Kubernetes environments with many replicas. Canary deployments are for high-traffic applications where you want to validate changes with real traffic before full rollout.
How do we handle long-running background jobs during deployment?
Use graceful shutdown. When a pod receives a termination signal, stop accepting new jobs, finish in-progress jobs (with a timeout), then shut down. In Kubernetes, configure terminationGracePeriodSeconds to allow enough time for jobs to complete. For jobs that take longer than the grace period, use a job queue (Redis, RabbitMQ) that retries failed jobs on surviving workers.
What about WebSocket connections during deployment?
WebSocket connections are long-lived and must be handled carefully. During a rolling update, existing connections on the old pod stay active until the pod terminates. Clients should implement automatic reconnection logic. For blue-green deployments, switch new connections to the new environment while allowing existing connections to drain on the old environment with a timeout.
How do we test zero-downtime deployments?
Run a load test during deployment. Use k6 or a similar tool to generate continuous traffic, then trigger a deployment. Check for any errors, increased latency, or dropped connections during the rollover. See our load testing guide for implementation details.
What Comes Next
Zero-downtime deployment is a prerequisite for frequent, confident releases. Combine it with CI/CD automation for the full deployment pipeline and monitoring for post-deployment verification.
Contact ECOSIRE for deployment strategy consulting, or explore our DevOps guide for the complete infrastructure roadmap.
Published by ECOSIRE -- helping businesses deploy without disruption.
Written by
ECOSIRE Research and Development Team
Building enterprise-grade digital products at ECOSIRE. Sharing insights on Odoo integrations, e-commerce automation, and AI-powered business solutions.
Related Articles
API Gateway Patterns and Best Practices for Modern Applications
Implement API gateway patterns including rate limiting, authentication, request routing, circuit breakers, and API versioning for scalable web architectures.
CDN Performance Optimization: The Complete Guide to Faster Global Delivery
Optimize CDN performance with caching strategies, edge computing, image optimization, and multi-CDN architectures for faster global content delivery.
CI/CD Pipeline Best Practices: Automate Your Way to Reliable Deployments
Build reliable CI/CD pipelines with best practices for testing, staging, deployment automation, rollback strategies, and security scanning in production workflows.