Este artigo está atualmente disponível apenas em inglês. Tradução em breve.
Infrastructure Scaling: Horizontal vs Vertical, Load Balancing & Auto-Scaling
Netflix serves 260 million subscribers across 190 countries by dynamically scaling thousands of instances based on real-time demand. While most businesses do not operate at Netflix scale, the scaling principles are the same: match infrastructure capacity to traffic demand automatically, without manual intervention and without paying for idle resources. The choices you make between horizontal and vertical scaling, L4 and L7 load balancers, and reactive and predictive auto-scaling determine whether your platform grows gracefully or hits a wall.
Key Takeaways
- Start vertical (bigger machines) for simplicity, then switch to horizontal (more machines) when you need high availability or exceed single-machine limits
- L7 load balancers provide content-based routing essential for modern applications, while L4 balancers offer raw throughput for simple TCP distribution
- Auto-scaling policies should combine metric-based triggers (CPU, memory) with predictive scaling for known traffic patterns
- Database scaling follows different rules than application scaling -- read replicas for read-heavy loads, partitioning for write-heavy loads
Horizontal vs Vertical Scaling
The fundamental scaling decision is whether to make existing machines bigger (vertical) or add more machines (horizontal). Each approach has distinct tradeoffs.
| Factor | Vertical Scaling | Horizontal Scaling | |---|---|---| | Implementation complexity | Low -- upgrade instance type | High -- requires stateless design, load balancing | | Maximum ceiling | Limited by largest available machine | Practically unlimited | | High availability | Single point of failure | Redundancy built in | | Cost efficiency | Cost-effective up to mid-range | Cost-effective at scale | | Downtime for scaling | Usually requires restart | Zero downtime (add/remove instances) | | Data consistency | Simple (single machine) | Requires distributed coordination | | Best for | Databases, cache servers | Application servers, web servers |
When to Scale Vertically
Vertical scaling is the right first choice for several reasons. It requires no application changes, no load balancer configuration, and no distributed state management. For databases especially, vertical scaling avoids the complexity of sharding, replication lag, and distributed transactions.
Scale vertically when:
- Your application is not yet stateless (sessions stored in memory, file system writes)
- You are running a single database and have not hit connection or CPU limits
- The next instance size up is cheaper than the engineering time to go horizontal
- You need to scale immediately without architectural changes
Stop scaling vertically when:
- You need high availability (single instance = single point of failure)
- The largest available instance is not enough
- You are paying for peak capacity that sits idle 90% of the time
- You need geographic distribution for latency
When to Scale Horizontally
Horizontal scaling requires your application to be stateless -- any request can be handled by any instance. This means:
- Sessions stored in Redis or database, not in-memory
- File uploads stored in object storage (S3), not local disk
- No instance-specific configuration or local caching without replication
- Graceful handling of instance termination (health checks, connection draining)
Scale horizontally when:
- High availability is a requirement (your business cannot tolerate minutes of downtime)
- Traffic is variable and auto-scaling can save cost by scaling down during quiet periods
- You need to deploy without downtime (rolling deploys across instances)
- Performance requirements exceed single-machine capacity
Load Balancing Deep Dive
A load balancer distributes incoming traffic across multiple backend servers. The type of load balancer determines what routing decisions are possible.
L4 (Transport Layer) Load Balancing
L4 load balancers operate at the TCP/UDP level. They route connections based on IP address and port without inspecting packet contents. They are fast, simple, and handle any TCP-based protocol.
Best for: Raw TCP distribution, database connection proxying (PgBouncer), non-HTTP protocols, extremely high throughput requirements
Limitations: Cannot make routing decisions based on URL path, headers, or cookies. Cannot terminate SSL (must be done on backends).
L7 (Application Layer) Load Balancing
L7 load balancers operate at the HTTP level. They inspect request headers, URLs, cookies, and even request bodies to make routing decisions. They handle SSL termination, compression, and can modify requests and responses.
Best for: Web applications, API gateways, content-based routing, SSL termination, A/B testing, canary deployments
| Feature | L4 Load Balancer | L7 Load Balancer | |---|---|---| | Routing granularity | IP and port | URL, headers, cookies, method | | SSL termination | No (pass-through) | Yes | | WebSocket support | Pass-through | Full support with upgrade | | Health checks | TCP connection check | HTTP endpoint check with status code | | Request modification | No | Yes (add headers, rewrite URLs) | | Throughput | Higher (less processing) | Lower (deeper inspection) | | Cost | Lower | Higher | | Example (AWS) | Network Load Balancer (NLB) | Application Load Balancer (ALB) |
Load Balancing Algorithms
| Algorithm | How It Works | Best For | |---|---|---| | Round robin | Requests distributed evenly in rotation | Homogeneous servers with similar capacity | | Weighted round robin | Servers receive traffic proportional to assigned weights | Mixed server sizes | | Least connections | Routes to server with fewest active connections | Long-lived connections, varying request duration | | IP hash | Routes based on client IP hash (sticky sessions) | Stateful applications needing session affinity | | Least response time | Routes to server with fastest average response time | Heterogeneous performance characteristics |
Health Checks and Graceful Degradation
Health checks determine whether a backend server should receive traffic. Configure them carefully:
- Shallow health check -- a simple TCP connection check or HTTP 200 on a dedicated endpoint. Catches server crashes but not application-level failures.
- Deep health check -- verifies database connectivity, Redis availability, and external service reachability. Catches more issues but risks false negatives if a non-critical dependency is down.
- Grace period -- new instances need time to warm up (JIT compilation, cache population). Set a startup grace period before the load balancer sends full traffic.
- Draining -- when removing an instance, stop sending new requests but allow existing requests to complete (connection draining). Typically 30-60 seconds.
Auto-Scaling Policies
Auto-scaling adjusts the number of instances based on demand, matching capacity to traffic without manual intervention.
Metric-Based Scaling
The most common approach triggers scaling actions when a metric crosses a threshold.
| Metric | Scale Up Threshold | Scale Down Threshold | Considerations | |---|---|---|---| | CPU utilization | Above 70% for 3 minutes | Below 30% for 10 minutes | Most common, works well for compute-bound workloads | | Memory utilization | Above 80% for 3 minutes | Below 40% for 10 minutes | Important for memory-intensive applications | | Request count | Above 1000 req/s per instance | Below 300 req/s per instance | Good for predictable request-bound workloads | | Queue depth | Above 100 messages | Below 10 messages | Perfect for background job workers | | Response time (P95) | Above 500ms | Below 100ms | User experience-focused scaling |
Predictive Scaling
If your traffic follows predictable patterns (daily peaks, weekly cycles, seasonal events), predictive scaling provisions capacity before the traffic arrives. AWS Auto Scaling supports predictive scaling that learns from historical patterns and scales proactively.
Combine predictive and reactive: Use predictive scaling for known patterns (morning traffic ramp, Black Friday pre-provisioning) and reactive scaling for unexpected surges.
Scaling Best Practices
- Scale out fast, scale in slow -- add instances aggressively (1-2 minute cooldown) but remove them gradually (10-15 minute cooldown) to avoid flapping
- Use multiple metrics -- scale on CPU OR memory OR request count, using the first metric that triggers
- Set minimum and maximum limits -- prevent scaling to zero (no availability) or scaling indefinitely (cost explosion)
- Test scaling during load tests -- verify that auto-scaling triggers correctly and new instances serve traffic within the expected time frame
- Monitor scaling events -- alert on frequent scaling to detect configuration issues or underlying performance problems
Database Scaling Strategies
Databases do not scale horizontally the same way application servers do. Write operations require consensus, and strong consistency limits distribution options.
Read Replicas
Read replicas copy data from the primary database and serve read queries. They scale read throughput linearly but do not help with write throughput.
Implementation considerations:
- Replication lag -- replicas are eventually consistent. Queries immediately after a write may not see the change. Use the primary for reads-after-writes.
- Connection routing -- your application needs logic to route reads to replicas and writes to the primary. ORMs and connection proxies (ProxySQL, PgBouncer) can automate this.
- Failover -- if the primary fails, a replica can be promoted. Automated failover (AWS RDS Multi-AZ, AWS Aurora) reduces downtime to seconds.
Table Partitioning
For write-heavy workloads on large tables, partitioning divides a table into smaller physical chunks while maintaining a single logical interface. For detailed partitioning strategies, see our database optimization guide.
Connection Pooling
Database connections are expensive to create and limited in number. Connection poolers like PgBouncer pool connections from many application instances into a smaller number of database connections.
Without pooling: 20 application instances x 20 connections each = 400 database connections (likely exceeding PostgreSQL limits)
With PgBouncer: 20 application instances connect to PgBouncer, which maintains 50-100 connections to PostgreSQL, multiplexing requests efficiently.
Microservices Decomposition
When a monolith becomes too large for a single team to develop and deploy efficiently, microservices decomposition allows independent scaling of different components.
When to Decompose
Do not start with microservices. Start with a well-structured monolith and decompose when:
- Different components have vastly different scaling requirements (search needs 20 instances, checkout needs 5)
- Different teams need to deploy independently without coordinating release schedules
- A specific component needs a different technology stack (machine learning in Python, API in Node.js)
- The monolith's deploy time exceeds 30 minutes due to codebase size
What to Extract First
| Service | Why Extract | Scaling Benefit | |---|---|---| | Image/file processing | CPU-intensive, bursty | Scale workers independently, use spot instances | | Search | Memory-intensive, read-heavy | Dedicated search cluster (Elasticsearch/Meilisearch) | | Notification service | External API-dependent, latency-tolerant | Queue-based, independent scaling | | Payment processing | Security-sensitive, compliance requirements | Isolated security boundary, independent auditing | | Reporting/analytics | Data-intensive, scheduled | Run on cheaper instances during off-peak hours |
Frequently Asked Questions
How do I know when I need to scale?
Monitor four key metrics: CPU utilization (consistently above 70%), memory usage (above 80%), response time P95 (above your SLO), and error rate (above 0.1%). When any metric consistently breaches its threshold, you need to scale. Proactive monitoring with alerting catches these trends before users notice. See our monitoring guide.
Is auto-scaling or pre-provisioning more cost-effective?
Auto-scaling is more cost-effective for unpredictable traffic because you only pay for capacity when you need it. Pre-provisioning is better for predictable peaks (Black Friday, daily rushes) because auto-scaling takes 3-10 minutes to add capacity. The most cost-effective approach combines both: pre-provision for expected peaks, auto-scale for unexpected surges, and use reserved instances for your baseline capacity. See our cloud cost optimization guide.
Should I use containers (Docker/Kubernetes) or traditional VMs?
Containers start faster (seconds vs minutes), use resources more efficiently (higher density per host), and provide consistent environments across development and production. Kubernetes adds orchestration (auto-scaling, self-healing, rolling deploys) but significant operational complexity. Start with managed container services (AWS ECS, Google Cloud Run) before considering Kubernetes.
How do I handle database failover without data loss?
Use synchronous replication for zero data loss failover -- the primary does not acknowledge a write until the replica confirms it. This adds write latency (typically 1-5ms within the same region) but guarantees no data loss during failover. AWS RDS Multi-AZ and Aurora provide managed synchronous replication with automatic failover.
What Is Next
Assess your current infrastructure against your growth projections. If you are running a single server, ensure your application is stateless and ready for horizontal scaling. If you already run multiple instances, optimize your load balancer configuration and implement auto-scaling policies.
For the complete performance engineering perspective, see our pillar guide on scaling your business platform. To optimize costs as you scale, read our guide on cloud cost optimization.
ECOSIRE designs and implements scalable infrastructure for business platforms on AWS and multi-cloud environments. Contact our DevOps team for an infrastructure scaling assessment.
Published by ECOSIRE — helping businesses scale with AI-powered solutions across Odoo ERP, Shopify eCommerce, and OpenClaw AI.
Escrito por
ECOSIRE Research and Development Team
Construindo produtos digitais de nível empresarial na ECOSIRE. Compartilhando insights sobre integrações Odoo, automação de e-commerce e soluções de negócios com IA.
Artigos Relacionados
Cost Optimization: Reducing Cloud Infrastructure Spend by 40%
Cut cloud costs by 30-40% with reserved instances, right-sizing, storage tiering, and data transfer optimization. Practical AWS cost reduction strategies.
Cloud Security Posture Management: AWS, Azure & GCP Best Practices
Secure your cloud infrastructure with CSPM best practices for AWS, Azure, and GCP covering IAM, encryption, network security, logging, and compliance automation.
eCommerce Scaling Case Study: 10x Revenue with Multi-Channel Integration
How a DTC brand scaled from $500K to $5M in 18 months by integrating Shopify with 8 marketplace channels and an Odoo ERP backend.