Production Monitoring and Alerting: The Complete Setup Guide

Set up production monitoring and alerting with Prometheus, Grafana, and Sentry. Covers metrics, logs, traces, alert policies, and incident response workflows.

E
ECOSIRE Research and Development Team
|16 مارچ، 20267 منٹ پڑھیں1.4k الفاظ|

This article is currently available in English only. Translation coming soon.

ہماری Performance & Scalability سیریز کا حصہ

مکمل گائیڈ پڑھیں

Production Monitoring and Alerting: The Complete Setup Guide

The average production incident costs $5,600 per minute of downtime. Companies with mature monitoring detect issues in under 5 minutes, while those without monitoring average 197 minutes to detection --- the difference between a minor blip and a customer-losing catastrophe.

This guide covers end-to-end production monitoring setup: what to measure, how to collect it, where to visualize it, and when to alert.

Key Takeaways

  • The three pillars of observability (metrics, logs, traces) serve different purposes and all three are necessary
  • Alert on symptoms (error rate, latency) not causes (CPU usage) to reduce noise by 80%
  • Runbooks attached to every alert ensure consistent incident response regardless of who is on call
  • Start with 5 essential alerts and expand only when you understand the baseline

The Three Pillars of Observability

Metrics

Numeric measurements sampled over time. Metrics answer "what is happening right now?"

Application metrics:

  • Request rate (requests per second)
  • Error rate (5xx responses per second)
  • Latency distribution (P50, P95, P99)
  • Active sessions / concurrent users

Infrastructure metrics:

  • CPU utilization per service
  • Memory usage and garbage collection
  • Disk I/O and available space
  • Network throughput

Business metrics:

  • Orders per minute
  • Cart abandonment rate
  • Revenue per hour
  • API calls by endpoint

Logs

Timestamped, structured records of discrete events. Logs answer "why did it happen?"

{
  "timestamp": "2026-03-16T14:32:01.234Z",
  "level": "error",
  "service": "api",
  "requestId": "req_abc123",
  "userId": "usr_456",
  "message": "Payment processing failed",
  "error": "Stripe API timeout after 30000ms",
  "endpoint": "POST /billing/checkout",
  "duration": 30142
}

Log best practices:

  • Use structured JSON logging, not plain text
  • Include correlation IDs (requestId) across services
  • Log at appropriate levels (ERROR for failures, WARN for degradation, INFO for key events)
  • Never log sensitive data (passwords, tokens, full credit card numbers)

Traces

End-to-end request paths through distributed systems. Traces answer "where is the bottleneck?"

A single user request to an eCommerce checkout might touch:

  1. Nginx (2ms) to Next.js frontend (50ms) to NestJS API (120ms) to PostgreSQL (45ms) to Stripe API (800ms) to email service (200ms)

Without tracing, you see "checkout takes 1.2 seconds." With tracing, you see "Stripe API accounts for 67% of checkout latency."


Monitoring Stack Setup

Prometheus + Grafana (Self-Hosted)

# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus:v2.50.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:10.3.0
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3030:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
      - GF_USERS_ALLOW_SIGN_UP=false

  loki:
    image: grafana/loki:2.9.0
    volumes:
      - loki-data:/loki
    ports:
      - "3100:3100"

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"

volumes:
  prometheus-data:
  grafana-data:
  loki-data:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: "api"
    metrics_path: /metrics
    static_configs:
      - targets: ["api:3001"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "postgres"
    static_configs:
      - targets: ["postgres-exporter:9187"]

  - job_name: "redis"
    static_configs:
      - targets: ["redis-exporter:9121"]

NestJS Application Metrics

Exposing Prometheus Metrics

// metrics.module.ts
import { Module } from '@nestjs/common';
import { PrometheusModule } from '@willsoto/nestjs-prometheus';
import {
  makeCounterProvider,
  makeHistogramProvider,
  makeGaugeProvider,
} from '@willsoto/nestjs-prometheus';

@Module({
  imports: [
    PrometheusModule.register({
      path: '/metrics',
      defaultMetrics: { enabled: true },
    }),
  ],
  providers: [
    makeCounterProvider({
      name: 'http_requests_total',
      help: 'Total HTTP requests',
      labelNames: ['method', 'path', 'status'],
    }),
    makeHistogramProvider({
      name: 'http_request_duration_seconds',
      help: 'HTTP request duration in seconds',
      labelNames: ['method', 'path'],
      buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    }),
    makeGaugeProvider({
      name: 'active_connections',
      help: 'Number of active connections',
    }),
  ],
  exports: [PrometheusModule],
})
export class MetricsModule {}

Alert Configuration

The Five Essential Alerts

Every production system needs these alerts from day one:

# alerts/essential.yml
groups:
  - name: essential
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          runbook: "https://wiki.example.com/runbooks/service-down"

      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 2 seconds"

      - alert: DiskSpaceLow
        expr: |
          node_filesystem_avail_bytes{mountpoint="/"}
          / node_filesystem_size_bytes{mountpoint="/"} < 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 20% on {{ $labels.instance }}"

      - alert: SSLCertExpiringSoon
        expr: |
          probe_ssl_earliest_cert_expiry - time() < 14 * 24 * 3600
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expires within 14 days"

Alert Routing

# alertmanager.yml
global:
  slack_api_url: "${SLACK_WEBHOOK_URL}"

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .CommonAnnotations.summary }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_KEY}"
        severity: '{{ .GroupLabels.severity }}'

  - name: 'slack'
    slack_configs:
      - channel: '#alerts-warnings'
        title: '{{ .GroupLabels.alertname }}'

Alert Quality Rules

PracticeWhy
Alert on symptoms, not causes"Error rate high" is actionable; "CPU at 80%" may not be
Every alert has a runbookOn-call engineers should not need to think at 3 AM
Alerts must be actionableIf no one can act on it, it is noise, not an alert
Tune thresholds after 2 weeksInitial thresholds are guesses; adjust based on baselines
Review alert fatigue monthlyIf alerts fire daily without action, raise thresholds or remove them

Grafana Dashboards

Dashboard Hierarchy

  1. Overview dashboard: High-level health across all services. This is the first screen anyone looks at during an incident.
  2. Service dashboards: Detailed metrics for each service (API, web, workers).
  3. Infrastructure dashboards: Node-level metrics (CPU, memory, disk, network).
  4. Business dashboards: Revenue, orders, user activity.

The RED Method for Service Dashboards

For every service, display:

  • Rate: Requests per second
  • Errors: Error rate as a percentage
  • Duration: Latency distribution (P50, P95, P99)

This provides instant visibility into service health without cognitive overload.


Error Tracking with Sentry

// sentry.config.ts
import * as Sentry from '@sentry/nestjs';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1,
  profilesSampleRate: 0.1,
  integrations: [
    Sentry.postgresIntegration(),
  ],
  beforeSend(event) {
    // Strip sensitive data
    if (event.request?.headers) {
      delete event.request.headers['authorization'];
      delete event.request.headers['cookie'];
    }
    return event;
  },
});

Sentry provides:

  • Automatic error grouping and deduplication
  • Stack traces with source maps
  • Release tracking (which deploy introduced the error)
  • Performance monitoring (transaction traces)

Frequently Asked Questions

How much does a monitoring stack cost?

Self-hosted (Prometheus + Grafana + Loki): approximately $50-100/month for the hosting resources. Managed alternatives: Datadog starts at $15/host/month for infrastructure, plus $0.10/GB for logs. Sentry cloud is $26/month for the team plan. A reasonable starting budget for a small business is $100-200/month total.

What is the difference between monitoring and observability?

Monitoring tells you when something is wrong. Observability tells you why. Monitoring is about predefined dashboards and alerts for known failure modes. Observability is about the ability to ask arbitrary questions about your system's behavior using metrics, logs, and traces. You need both, but monitoring is the foundation.

How do we avoid alert fatigue?

Three rules: (1) every alert must require human action, (2) set thresholds based on actual baselines not theoretical ideals, (3) review and tune alerts monthly. If an alert fires more than once per week without requiring action, either fix the underlying issue or raise the threshold. Teams suffering from alert fatigue ignore all alerts, including the critical ones.

Should we monitor our ERP system differently?

ERP systems have unique monitoring requirements. Beyond standard web metrics, monitor: database connection pool usage, background job queue depth, integration sync status (Shopify, payment gateways), scheduled report execution time, and user session counts by module. ECOSIRE provides managed Odoo monitoring as part of our support packages.


What Comes Next

Monitoring is the eyes and ears of your production infrastructure. Pair it with CI/CD automation for deployment confidence and disaster recovery planning for resilience. For a comprehensive DevOps roadmap, see our DevOps guide for small businesses.

Contact ECOSIRE for monitoring setup and managed infrastructure services.


Published by ECOSIRE -- helping businesses see what matters in production.

E

تحریر

ECOSIRE Research and Development Team

ECOSIRE میں انٹرپرائز گریڈ ڈیجیٹل مصنوعات بنانا۔ Odoo انٹیگریشنز، ای کامرس آٹومیشن، اور AI سے چلنے والے کاروباری حل پر بصیرت شیئر کرنا۔

Chat on WhatsApp