This article is currently available in English only. Translation coming soon.
ہماری Performance & Scalability سیریز کا حصہ
مکمل گائیڈ پڑھیںProduction Monitoring and Alerting: The Complete Setup Guide
The average production incident costs $5,600 per minute of downtime. Companies with mature monitoring detect issues in under 5 minutes, while those without monitoring average 197 minutes to detection --- the difference between a minor blip and a customer-losing catastrophe.
This guide covers end-to-end production monitoring setup: what to measure, how to collect it, where to visualize it, and when to alert.
Key Takeaways
- The three pillars of observability (metrics, logs, traces) serve different purposes and all three are necessary
- Alert on symptoms (error rate, latency) not causes (CPU usage) to reduce noise by 80%
- Runbooks attached to every alert ensure consistent incident response regardless of who is on call
- Start with 5 essential alerts and expand only when you understand the baseline
The Three Pillars of Observability
Metrics
Numeric measurements sampled over time. Metrics answer "what is happening right now?"
Application metrics:
- Request rate (requests per second)
- Error rate (5xx responses per second)
- Latency distribution (P50, P95, P99)
- Active sessions / concurrent users
Infrastructure metrics:
- CPU utilization per service
- Memory usage and garbage collection
- Disk I/O and available space
- Network throughput
Business metrics:
- Orders per minute
- Cart abandonment rate
- Revenue per hour
- API calls by endpoint
Logs
Timestamped, structured records of discrete events. Logs answer "why did it happen?"
{
"timestamp": "2026-03-16T14:32:01.234Z",
"level": "error",
"service": "api",
"requestId": "req_abc123",
"userId": "usr_456",
"message": "Payment processing failed",
"error": "Stripe API timeout after 30000ms",
"endpoint": "POST /billing/checkout",
"duration": 30142
}
Log best practices:
- Use structured JSON logging, not plain text
- Include correlation IDs (
requestId) across services - Log at appropriate levels (ERROR for failures, WARN for degradation, INFO for key events)
- Never log sensitive data (passwords, tokens, full credit card numbers)
Traces
End-to-end request paths through distributed systems. Traces answer "where is the bottleneck?"
A single user request to an eCommerce checkout might touch:
- Nginx (2ms) to Next.js frontend (50ms) to NestJS API (120ms) to PostgreSQL (45ms) to Stripe API (800ms) to email service (200ms)
Without tracing, you see "checkout takes 1.2 seconds." With tracing, you see "Stripe API accounts for 67% of checkout latency."
Monitoring Stack Setup
Prometheus + Grafana (Self-Hosted)
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:10.3.0
volumes:
- grafana-data:/var/lib/grafana
ports:
- "3030:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
loki:
image: grafana/loki:2.9.0
volumes:
- loki-data:/loki
ports:
- "3100:3100"
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
volumes:
prometheus-data:
grafana-data:
loki-data:
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: "api"
metrics_path: /metrics
static_configs:
- targets: ["api:3001"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "postgres"
static_configs:
- targets: ["postgres-exporter:9187"]
- job_name: "redis"
static_configs:
- targets: ["redis-exporter:9121"]
NestJS Application Metrics
Exposing Prometheus Metrics
// metrics.module.ts
import { Module } from '@nestjs/common';
import { PrometheusModule } from '@willsoto/nestjs-prometheus';
import {
makeCounterProvider,
makeHistogramProvider,
makeGaugeProvider,
} from '@willsoto/nestjs-prometheus';
@Module({
imports: [
PrometheusModule.register({
path: '/metrics',
defaultMetrics: { enabled: true },
}),
],
providers: [
makeCounterProvider({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status'],
}),
makeHistogramProvider({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
}),
makeGaugeProvider({
name: 'active_connections',
help: 'Number of active connections',
}),
],
exports: [PrometheusModule],
})
export class MetricsModule {}
Alert Configuration
The Five Essential Alerts
Every production system needs these alerts from day one:
# alerts/essential.yml
groups:
- name: essential
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
runbook: "https://wiki.example.com/runbooks/service-down"
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 2 seconds"
- alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} < 0.2
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space below 20% on {{ $labels.instance }}"
- alert: SSLCertExpiringSoon
expr: |
probe_ssl_earliest_cert_expiry - time() < 14 * 24 * 3600
labels:
severity: warning
annotations:
summary: "SSL certificate expires within 14 days"
Alert Routing
# alertmanager.yml
global:
slack_api_url: "${SLACK_WEBHOOK_URL}"
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
repeat_interval: 15m
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'
- name: 'pagerduty'
pagerduty_configs:
- routing_key: "${PAGERDUTY_KEY}"
severity: '{{ .GroupLabels.severity }}'
- name: 'slack'
slack_configs:
- channel: '#alerts-warnings'
title: '{{ .GroupLabels.alertname }}'
Alert Quality Rules
| Practice | Why |
|---|---|
| Alert on symptoms, not causes | "Error rate high" is actionable; "CPU at 80%" may not be |
| Every alert has a runbook | On-call engineers should not need to think at 3 AM |
| Alerts must be actionable | If no one can act on it, it is noise, not an alert |
| Tune thresholds after 2 weeks | Initial thresholds are guesses; adjust based on baselines |
| Review alert fatigue monthly | If alerts fire daily without action, raise thresholds or remove them |
Grafana Dashboards
Dashboard Hierarchy
- Overview dashboard: High-level health across all services. This is the first screen anyone looks at during an incident.
- Service dashboards: Detailed metrics for each service (API, web, workers).
- Infrastructure dashboards: Node-level metrics (CPU, memory, disk, network).
- Business dashboards: Revenue, orders, user activity.
The RED Method for Service Dashboards
For every service, display:
- Rate: Requests per second
- Errors: Error rate as a percentage
- Duration: Latency distribution (P50, P95, P99)
This provides instant visibility into service health without cognitive overload.
Error Tracking with Sentry
// sentry.config.ts
import * as Sentry from '@sentry/nestjs';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1,
profilesSampleRate: 0.1,
integrations: [
Sentry.postgresIntegration(),
],
beforeSend(event) {
// Strip sensitive data
if (event.request?.headers) {
delete event.request.headers['authorization'];
delete event.request.headers['cookie'];
}
return event;
},
});
Sentry provides:
- Automatic error grouping and deduplication
- Stack traces with source maps
- Release tracking (which deploy introduced the error)
- Performance monitoring (transaction traces)
Frequently Asked Questions
How much does a monitoring stack cost?
Self-hosted (Prometheus + Grafana + Loki): approximately $50-100/month for the hosting resources. Managed alternatives: Datadog starts at $15/host/month for infrastructure, plus $0.10/GB for logs. Sentry cloud is $26/month for the team plan. A reasonable starting budget for a small business is $100-200/month total.
What is the difference between monitoring and observability?
Monitoring tells you when something is wrong. Observability tells you why. Monitoring is about predefined dashboards and alerts for known failure modes. Observability is about the ability to ask arbitrary questions about your system's behavior using metrics, logs, and traces. You need both, but monitoring is the foundation.
How do we avoid alert fatigue?
Three rules: (1) every alert must require human action, (2) set thresholds based on actual baselines not theoretical ideals, (3) review and tune alerts monthly. If an alert fires more than once per week without requiring action, either fix the underlying issue or raise the threshold. Teams suffering from alert fatigue ignore all alerts, including the critical ones.
Should we monitor our ERP system differently?
ERP systems have unique monitoring requirements. Beyond standard web metrics, monitor: database connection pool usage, background job queue depth, integration sync status (Shopify, payment gateways), scheduled report execution time, and user session counts by module. ECOSIRE provides managed Odoo monitoring as part of our support packages.
What Comes Next
Monitoring is the eyes and ears of your production infrastructure. Pair it with CI/CD automation for deployment confidence and disaster recovery planning for resilience. For a comprehensive DevOps roadmap, see our DevOps guide for small businesses.
Contact ECOSIRE for monitoring setup and managed infrastructure services.
Published by ECOSIRE -- helping businesses see what matters in production.
تحریر
ECOSIRE Research and Development Team
ECOSIRE میں انٹرپرائز گریڈ ڈیجیٹل مصنوعات بنانا۔ Odoo انٹیگریشنز، ای کامرس آٹومیشن، اور AI سے چلنے والے کاروباری حل پر بصیرت شیئر کرنا۔
متعلقہ مضامین
Testing and Monitoring AI Agents: Reliability Engineering for Autonomous Systems
Complete guide to testing and monitoring AI agents covering unit testing, integration testing, behavioral testing, observability, and production monitoring strategies.
API Gateway Patterns and Best Practices for Modern Applications
Implement API gateway patterns including rate limiting, authentication, request routing, circuit breakers, and API versioning for scalable web architectures.
CDN Performance Optimization: The Complete Guide to Faster Global Delivery
Optimize CDN performance with caching strategies, edge computing, image optimization, and multi-CDN architectures for faster global content delivery.
Performance & Scalability سے مزید
AI Agent Performance Optimization: Speed, Accuracy, and Cost Efficiency
Optimize AI agent performance across response time, accuracy, and cost with proven techniques for prompt engineering, caching, model selection, and monitoring.
Testing and Monitoring AI Agents: Reliability Engineering for Autonomous Systems
Complete guide to testing and monitoring AI agents covering unit testing, integration testing, behavioral testing, observability, and production monitoring strategies.
CDN Performance Optimization: The Complete Guide to Faster Global Delivery
Optimize CDN performance with caching strategies, edge computing, image optimization, and multi-CDN architectures for faster global content delivery.
Load Testing Strategies for Web Applications: Find Breaking Points Before Users Do
Load test web applications with k6, Artillery, and Locust. Covers test design, traffic modeling, performance baselines, and result interpretation strategies.
Mobile SEO for eCommerce: Complete Optimization Guide for 2026
Mobile SEO guide for eCommerce sites. Covers mobile-first indexing, Core Web Vitals, structured data, page speed optimization, and mobile search ranking factors.
API Performance: Rate Limiting, Pagination & Async Processing
Build high-performance APIs with rate limiting algorithms, cursor-based pagination, async job queues, and response compression best practices.