Zero-Downtime Deployment Strategies: Keep Your Application Running During Updates

Q: Which strategy should we start with?

Start with blue-green deployment. It is the simplest to implement, provides instant rollback, and works with any application architecture. Rolling updates are better for Kubernetes environments with many replicas. Canary deployments are for high-traffic applications where you want to validate changes with real traffic before full rollout.

Q: How do we handle long-running background jobs during deployment?

Use graceful shutdown. When a pod receives a termination signal, stop accepting new jobs, finish in-progress jobs (with a timeout), then shut down. In Kubernetes, configure terminationGracePeriodSeconds to allow enough time for jobs to complete. For jobs that take longer than the grace period, use a job queue (Redis, RabbitMQ) that retries failed jobs on surviving workers.

Q: What about WebSocket connections during deployment?

WebSocket connections are long-lived and must be handled carefully. During a rolling update, existing connections on the old pod stay active until the pod terminates. Clients should implement automatic reconnection logic. For blue-green deployments, switch new connections to the new environment while allowing existing connections to drain on the old environment with a timeout.

Q: How do we test zero-downtime deployments?

Run a load test during deployment. Use k6 or a similar tool to generate continuous traffic, then trigger a deployment. Check for any errors, increased latency, or dropped connections during the rollover. See our load testing guide for implementation details.

Planned downtime costs businesses an average of $5,600 per minute. Yet 43% of companies still take their applications offline during deployments. Zero-downtime deployment is not a luxury --- it is an expectation. Customers, search engines, and integration partners all penalize applications that go offline, even briefly.

This guide covers the three primary zero-downtime deployment strategies, database migration techniques that preserve uptime, and automated rollback mechanisms.

Key Takeaways

Blue-green deployment is the safest strategy: instant rollback by switching traffic back to the previous version

Database migrations must be backward-compatible --- the old application version must work with the new schema

Health checks and readiness probes prevent routing traffic to pods that are not ready to serve

Automated rollback based on error rate monitoring reduces mean time to recovery to under 2 minutes

Strategy Comparison

Strategy	Complexity	Rollback Speed	Infrastructure Cost	Best For
Blue-green	Low	Instant (seconds)	2x during deployment	Critical applications, infrequent deploys
Rolling update	Medium	Minutes	1.25x during deployment	Kubernetes, frequent deploys
Canary	High	Fast (seconds)	1.05x during deployment	High-traffic, risk-sensitive
Feature flags	Medium	Instant	1x	Gradual feature rollout

Blue-Green Deployment

Architecture

Load Balancer
    |
    |--- [ACTIVE] Blue environment (v2.0.0) <-- receives 100% traffic
    |
    |--- [IDLE] Green environment (v2.1.0) <-- deployed, tested, waiting

On deployment:

Deploy v2.1.0 to the idle (green) environment
Run smoke tests against green
Switch load balancer to green
Blue becomes idle (available for instant rollback)

Implementation with Nginx

# /etc/nginx/conf.d/app.conf
upstream blue {
    server 10.0.1.10:3000;
    server 10.0.1.11:3000;
}

upstream green {
    server 10.0.2.10:3000;
    server 10.0.2.11:3000;
}

# Active environment - change this during deployment
map $host $active_upstream {
    default blue;  # Change to 'green' to switch
}

server {
    listen 443 ssl;
    server_name app.example.com;

    location / {
        proxy_pass http://$active_upstream;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Deployment Script

#!/bin/bash
set -e

CURRENT=$(cat /etc/nginx/active-env)  # "blue" or "green"
TARGET=$( [ "$CURRENT" = "blue" ] && echo "green" || echo "blue" )

echo "Current: $CURRENT, deploying to: $TARGET"

# Deploy to inactive environment
ssh "deploy@$TARGET-1" "cd /opt/app && git pull && pnpm install --frozen-lockfile && pnpm build && pm2 restart all"
ssh "deploy@$TARGET-2" "cd /opt/app && git pull && pnpm install --frozen-lockfile && pnpm build && pm2 restart all"

# Wait for health checks
for i in 1 2; do
  echo "Checking $TARGET-$i health..."
  for attempt in $(seq 1 30); do
    if curl -sf "http://$TARGET-$i:3000/health" > /dev/null; then
      echo "$TARGET-$i is healthy"
      break
    fi
    sleep 2
  done
done

# Run smoke tests
pnpm test:smoke --base-url "http://$TARGET-1:3000"

# Switch traffic
sed -i "s/default $CURRENT/default $TARGET/" /etc/nginx/conf.d/app.conf
nginx -s reload
echo "$TARGET" > /etc/nginx/active-env

echo "Traffic switched to $TARGET. Rollback: change active-env back to $CURRENT"

Rolling Update

Rolling updates replace instances incrementally, ensuring some capacity is always available.

Kubernetes Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Create 1 extra pod during update
      maxUnavailable: 0   # Never reduce below desired replicas
  template:
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v2.1.0
          readinessProbe:
            httpGet:
              path: /health
              port: 3001
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 3001
            initialDelaySeconds: 15
            periodSeconds: 10

The rolling update process with maxSurge: 1 and maxUnavailable: 0:

Create 1 new pod with v2.1.0 (6 pods total: 5 old + 1 new)
Wait for new pod readiness probe to pass
Terminate 1 old pod (5 pods: 4 old + 1 new)
Create another new pod (6 pods: 4 old + 2 new)
Repeat until all pods are v2.1.0

Canary Deployment

Traffic Splitting

# Istio VirtualService for canary routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-canary
spec:
  hosts:
    - api.example.com
  http:
    - route:
        - destination:
            host: api-stable
            port:
              number: 3001
          weight: 95
        - destination:
            host: api-canary
            port:
              number: 3001
          weight: 5

Progressive Canary Rollout

Phase	Canary Traffic	Duration	Success Criteria
1	1%	10 minutes	Error rate <0.1%, latency <500ms
2	5%	30 minutes	Error rate <0.1%, latency <500ms
3	25%	1 hour	Error rate <0.5%, latency <1s
4	50%	2 hours	Error rate <0.5%, latency <1s
5	100%	Full rollout	Stable for 24 hours

Database Migrations Without Downtime

The biggest challenge in zero-downtime deployment is database schema changes. The old application version must work with the new schema, and vice versa.

The Expand-Contract Pattern

Phase 1: Expand (deploy schema change)

-- Add new column (nullable, no default)
ALTER TABLE orders ADD COLUMN shipping_method VARCHAR(50);

Old application code ignores the new column. New application code writes to both old and new columns.

Phase 2: Migrate data

-- Backfill existing data
UPDATE orders SET shipping_method = 'standard' WHERE shipping_method IS NULL;

Phase 3: Contract (deploy code that uses new column exclusively)

After all application instances use the new column:

-- Make column required
ALTER TABLE orders ALTER COLUMN shipping_method SET NOT NULL;
ALTER TABLE orders ALTER COLUMN shipping_method SET DEFAULT 'standard';

Dangerous Migration Patterns

Pattern	Risk	Safe Alternative
Rename column	Breaks old code	Add new column, migrate, drop old
Drop column	Breaks old code	Stop using, then drop in next release
Add NOT NULL column	Locks table	Add nullable, backfill, alter to NOT NULL
Change column type	Locks table, breaks queries	Add new column with new type, migrate
Add unique index	Locks table on large tables	`CREATE INDEX CONCURRENTLY`

Automated Rollback

Error Rate Based Rollback

#!/bin/bash
# post-deploy-monitor.sh

DEPLOY_TIME=$(date +%s)
MONITOR_DURATION=300  # 5 minutes
ERROR_THRESHOLD=0.02  # 2%

while [ $(($(date +%s) - DEPLOY_TIME)) -lt $MONITOR_DURATION ]; do
  ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[2m])/rate(http_requests_total[2m])" | jq -r '.data.result[0].value[1]')

  if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
    echo "ERROR: Rate $ERROR_RATE exceeds threshold $ERROR_THRESHOLD"
    echo "Initiating rollback..."
    kubectl rollout undo deployment/api
    exit 1
  fi

  sleep 15
done

echo "Deployment healthy for $MONITOR_DURATION seconds"

Frequently Asked Questions

Which strategy should we start with?

Start with blue-green deployment. It is the simplest to implement, provides instant rollback, and works with any application architecture. Rolling updates are better for Kubernetes environments with many replicas. Canary deployments are for high-traffic applications where you want to validate changes with real traffic before full rollout.

How do we handle long-running background jobs during deployment?

Use graceful shutdown. When a pod receives a termination signal, stop accepting new jobs, finish in-progress jobs (with a timeout), then shut down. In Kubernetes, configure terminationGracePeriodSeconds to allow enough time for jobs to complete. For jobs that take longer than the grace period, use a job queue (Redis, RabbitMQ) that retries failed jobs on surviving workers.

What about WebSocket connections during deployment?

WebSocket connections are long-lived and must be handled carefully. During a rolling update, existing connections on the old pod stay active until the pod terminates. Clients should implement automatic reconnection logic. For blue-green deployments, switch new connections to the new environment while allowing existing connections to drain on the old environment with a timeout.

How do we test zero-downtime deployments?

Run a load test during deployment. Use k6 or a similar tool to generate continuous traffic, then trigger a deployment. Check for any errors, increased latency, or dropped connections during the rollover. See our load testing guide for implementation details.

What Comes Next

Zero-downtime deployment is a prerequisite for frequent, confident releases. Combine it with CI/CD automation for the full deployment pipeline and monitoring for post-deployment verification.

Contact ECOSIRE for deployment strategy consulting, or explore our DevOps guide for the complete infrastructure roadmap.

Published by ECOSIRE -- helping businesses deploy without disruption.

This guide covers the three primary zero-downtime deployment strategies, database migration techniques that preserve uptime, and automated rollback mechanisms.

Key Takeaways

Blue-green deployment is the safest strategy: instant rollback by switching traffic back to the previous version

Database migrations must be backward-compatible --- the old application version must work with the new schema

Health checks and readiness probes prevent routing traffic to pods that are not ready to serve

Automated rollback based on error rate monitoring reduces mean time to recovery to under 2 minutes

Strategy Comparison

Strategy	Complexity	Rollback Speed	Infrastructure Cost	Best For
Blue-green	Low	Instant (seconds)	2x during deployment	Critical applications, infrequent deploys
Rolling update	Medium	Minutes	1.25x during deployment	Kubernetes, frequent deploys
Canary	High	Fast (seconds)	1.05x during deployment	High-traffic, risk-sensitive
Feature flags	Medium	Instant	1x	Gradual feature rollout

Blue-Green Deployment

Architecture

Load Balancer
    |
    |--- [ACTIVE] Blue environment (v2.0.0) <-- receives 100% traffic
    |
    |--- [IDLE] Green environment (v2.1.0) <-- deployed, tested, waiting

On deployment:

Deploy v2.1.0 to the idle (green) environment
Run smoke tests against green
Switch load balancer to green
Blue becomes idle (available for instant rollback)

Implementation with Nginx

# /etc/nginx/conf.d/app.conf
upstream blue {
    server 10.0.1.10:3000;
    server 10.0.1.11:3000;
}

upstream green {
    server 10.0.2.10:3000;
    server 10.0.2.11:3000;
}

# Active environment - change this during deployment
map $host $active_upstream {
    default blue;  # Change to 'green' to switch
}

server {
    listen 443 ssl;
    server_name app.example.com;

    location / {
        proxy_pass http://$active_upstream;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Deployment Script

#!/bin/bash
set -e

CURRENT=$(cat /etc/nginx/active-env)  # "blue" or "green"
TARGET=$( [ "$CURRENT" = "blue" ] && echo "green" || echo "blue" )

echo "Current: $CURRENT, deploying to: $TARGET"

# Deploy to inactive environment
ssh "deploy@$TARGET-1" "cd /opt/app && git pull && pnpm install --frozen-lockfile && pnpm build && pm2 restart all"
ssh "deploy@$TARGET-2" "cd /opt/app && git pull && pnpm install --frozen-lockfile && pnpm build && pm2 restart all"

# Wait for health checks
for i in 1 2; do
  echo "Checking $TARGET-$i health..."
  for attempt in $(seq 1 30); do
    if curl -sf "http://$TARGET-$i:3000/health" > /dev/null; then
      echo "$TARGET-$i is healthy"
      break
    fi
    sleep 2
  done
done

# Run smoke tests
pnpm test:smoke --base-url "http://$TARGET-1:3000"

# Switch traffic
sed -i "s/default $CURRENT/default $TARGET/" /etc/nginx/conf.d/app.conf
nginx -s reload
echo "$TARGET" > /etc/nginx/active-env

echo "Traffic switched to $TARGET. Rollback: change active-env back to $CURRENT"

Rolling Update

Rolling updates replace instances incrementally, ensuring some capacity is always available.

Kubernetes Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Create 1 extra pod during update
      maxUnavailable: 0   # Never reduce below desired replicas
  template:
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v2.1.0
          readinessProbe:
            httpGet:
              path: /health
              port: 3001
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 3001
            initialDelaySeconds: 15
            periodSeconds: 10

The rolling update process with maxSurge: 1 and maxUnavailable: 0:

Create 1 new pod with v2.1.0 (6 pods total: 5 old + 1 new)
Wait for new pod readiness probe to pass
Terminate 1 old pod (5 pods: 4 old + 1 new)
Create another new pod (6 pods: 4 old + 2 new)
Repeat until all pods are v2.1.0

Canary Deployment

Traffic Splitting

# Istio VirtualService for canary routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-canary
spec:
  hosts:
    - api.example.com
  http:
    - route:
        - destination:
            host: api-stable
            port:
              number: 3001
          weight: 95
        - destination:
            host: api-canary
            port:
              number: 3001
          weight: 5

Progressive Canary Rollout

Phase	Canary Traffic	Duration	Success Criteria
1	1%	10 minutes	Error rate <0.1%, latency <500ms
2	5%	30 minutes	Error rate <0.1%, latency <500ms
3	25%	1 hour	Error rate <0.5%, latency <1s
4	50%	2 hours	Error rate <0.5%, latency <1s
5	100%	Full rollout	Stable for 24 hours

Database Migrations Without Downtime

The biggest challenge in zero-downtime deployment is database schema changes. The old application version must work with the new schema, and vice versa.

The Expand-Contract Pattern

Phase 1: Expand (deploy schema change)

-- Add new column (nullable, no default)
ALTER TABLE orders ADD COLUMN shipping_method VARCHAR(50);

Old application code ignores the new column. New application code writes to both old and new columns.

Phase 2: Migrate data

-- Backfill existing data
UPDATE orders SET shipping_method = 'standard' WHERE shipping_method IS NULL;

Phase 3: Contract (deploy code that uses new column exclusively)

After all application instances use the new column:

-- Make column required
ALTER TABLE orders ALTER COLUMN shipping_method SET NOT NULL;
ALTER TABLE orders ALTER COLUMN shipping_method SET DEFAULT 'standard';

Dangerous Migration Patterns

Pattern	Risk	Safe Alternative
Rename column	Breaks old code	Add new column, migrate, drop old
Drop column	Breaks old code	Stop using, then drop in next release
Add NOT NULL column	Locks table	Add nullable, backfill, alter to NOT NULL
Change column type	Locks table, breaks queries	Add new column with new type, migrate
Add unique index	Locks table on large tables	`CREATE INDEX CONCURRENTLY`

Automated Rollback

Error Rate Based Rollback

#!/bin/bash
# post-deploy-monitor.sh

DEPLOY_TIME=$(date +%s)
MONITOR_DURATION=300  # 5 minutes
ERROR_THRESHOLD=0.02  # 2%

while [ $(($(date +%s) - DEPLOY_TIME)) -lt $MONITOR_DURATION ]; do
  ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[2m])/rate(http_requests_total[2m])" | jq -r '.data.result[0].value[1]')

  if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
    echo "ERROR: Rate $ERROR_RATE exceeds threshold $ERROR_THRESHOLD"
    echo "Initiating rollback..."
    kubectl rollout undo deployment/api
    exit 1
  fi

  sleep 15
done

echo "Deployment healthy for $MONITOR_DURATION seconds"

Frequently Asked Questions

Which strategy should we start with?

How do we handle long-running background jobs during deployment?

What about WebSocket connections during deployment?

How do we test zero-downtime deployments?

What Comes Next

Zero-downtime deployment is a prerequisite for frequent, confident releases. Combine it with CI/CD automation for the full deployment pipeline and monitoring for post-deployment verification.

Contact ECOSIRE for deployment strategy consulting, or explore our DevOps guide for the complete infrastructure roadmap.

Published by ECOSIRE -- helping businesses deploy without disruption.

Zero-Downtime Deployment Strategies: Keep Your Application Running During Updates

Strategy Comparison

Blue-Green Deployment

Architecture

Implementation with Nginx

Deployment Script

Rolling Update

Kubernetes Rolling Update

Canary Deployment

Traffic Splitting

Progressive Canary Rollout

Database Migrations Without Downtime

The Expand-Contract Pattern

Dangerous Migration Patterns

Automated Rollback

Error Rate Based Rollback

Frequently Asked Questions

What Comes Next

Grow Your Business with ECOSIRE

Related Articles

How Much Does Cloud Hosting Cost in 2026? Real Price Breakdown (AWS, Hetzner, DigitalOcean, Odoo.sh)

Odoo Hosting Requirements in 2026: Server Sizing by User Count (With Real Configs)

Odoo CI/CD with GitHub Actions: Testing and Deployment

Zero-Downtime Deployment Strategies: Keep Your Application Running During Updates

Strategy Comparison

Blue-Green Deployment

Architecture

Implementation with Nginx

Deployment Script

Rolling Update

Kubernetes Rolling Update

Canary Deployment

Traffic Splitting

Progressive Canary Rollout

Database Migrations Without Downtime

The Expand-Contract Pattern

Dangerous Migration Patterns

Automated Rollback

Error Rate Based Rollback

Frequently Asked Questions

What Comes Next

Grow Your Business with ECOSIRE

Related Articles

How Much Does Cloud Hosting Cost in 2026? Real Price Breakdown (AWS, Hetzner, DigitalOcean, Odoo.sh)

Odoo Hosting Requirements in 2026: Server Sizing by User Count (With Real Configs)

Odoo CI/CD with GitHub Actions: Testing and Deployment