Bu makale şu anda yalnızca İngilizce olarak mevcuttur. Çeviri yakında eklenecektir.
{series} serimizin bir parçası
Tam kılavuzu okuyunProduction Monitoring and Alerting: The Complete Setup Guide
The average production incident costs $5,600 per minute of downtime. Companies with mature monitoring detect issues in under 5 minutes, while those without monitoring average 197 minutes to detection --- the difference between a minor blip and a customer-losing catastrophe.
This guide covers end-to-end production monitoring setup: what to measure, how to collect it, where to visualize it, and when to alert.
Key Takeaways
- The three pillars of observability (metrics, logs, traces) serve different purposes and all three are necessary
- Alert on symptoms (error rate, latency) not causes (CPU usage) to reduce noise by 80%
- Runbooks attached to every alert ensure consistent incident response regardless of who is on call
- Start with 5 essential alerts and expand only when you understand the baseline
The Three Pillars of Observability
Metrics
Numeric measurements sampled over time. Metrics answer "what is happening right now?"
Application metrics:
- Request rate (requests per second)
- Error rate (5xx responses per second)
- Latency distribution (P50, P95, P99)
- Active sessions / concurrent users
Infrastructure metrics:
- CPU utilization per service
- Memory usage and garbage collection
- Disk I/O and available space
- Network throughput
Business metrics:
- Orders per minute
- Cart abandonment rate
- Revenue per hour
- API calls by endpoint
Logs
Timestamped, structured records of discrete events. Logs answer "why did it happen?"
{
"timestamp": "2026-03-16T14:32:01.234Z",
"level": "error",
"service": "api",
"requestId": "req_abc123",
"userId": "usr_456",
"message": "Payment processing failed",
"error": "Stripe API timeout after 30000ms",
"endpoint": "POST /billing/checkout",
"duration": 30142
}
Log best practices:
- Use structured JSON logging, not plain text
- Include correlation IDs (
requestId) across services - Log at appropriate levels (ERROR for failures, WARN for degradation, INFO for key events)
- Never log sensitive data (passwords, tokens, full credit card numbers)
Traces
End-to-end request paths through distributed systems. Traces answer "where is the bottleneck?"
A single user request to an eCommerce checkout might touch:
- Nginx (2ms) to Next.js frontend (50ms) to NestJS API (120ms) to PostgreSQL (45ms) to Stripe API (800ms) to email service (200ms)
Without tracing, you see "checkout takes 1.2 seconds." With tracing, you see "Stripe API accounts for 67% of checkout latency."
Monitoring Stack Setup
Prometheus + Grafana (Self-Hosted)
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:10.3.0
volumes:
- grafana-data:/var/lib/grafana
ports:
- "3030:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
loki:
image: grafana/loki:2.9.0
volumes:
- loki-data:/loki
ports:
- "3100:3100"
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
volumes:
prometheus-data:
grafana-data:
loki-data:
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: "api"
metrics_path: /metrics
static_configs:
- targets: ["api:3001"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "postgres"
static_configs:
- targets: ["postgres-exporter:9187"]
- job_name: "redis"
static_configs:
- targets: ["redis-exporter:9121"]
NestJS Application Metrics
Exposing Prometheus Metrics
// metrics.module.ts
import { Module } from '@nestjs/common';
import { PrometheusModule } from '@willsoto/nestjs-prometheus';
import {
makeCounterProvider,
makeHistogramProvider,
makeGaugeProvider,
} from '@willsoto/nestjs-prometheus';
@Module({
imports: [
PrometheusModule.register({
path: '/metrics',
defaultMetrics: { enabled: true },
}),
],
providers: [
makeCounterProvider({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status'],
}),
makeHistogramProvider({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
}),
makeGaugeProvider({
name: 'active_connections',
help: 'Number of active connections',
}),
],
exports: [PrometheusModule],
})
export class MetricsModule {}
Alert Configuration
The Five Essential Alerts
Every production system needs these alerts from day one:
# alerts/essential.yml
groups:
- name: essential
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
runbook: "https://wiki.example.com/runbooks/service-down"
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 2 seconds"
- alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} < 0.2
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space below 20% on {{ $labels.instance }}"
- alert: SSLCertExpiringSoon
expr: |
probe_ssl_earliest_cert_expiry - time() < 14 * 24 * 3600
labels:
severity: warning
annotations:
summary: "SSL certificate expires within 14 days"
Alert Routing
# alertmanager.yml
global:
slack_api_url: "${SLACK_WEBHOOK_URL}"
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
repeat_interval: 15m
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'
- name: 'pagerduty'
pagerduty_configs:
- routing_key: "${PAGERDUTY_KEY}"
severity: '{{ .GroupLabels.severity }}'
- name: 'slack'
slack_configs:
- channel: '#alerts-warnings'
title: '{{ .GroupLabels.alertname }}'
Alert Quality Rules
| Practice | Why |
|---|---|
| Alert on symptoms, not causes | "Error rate high" is actionable; "CPU at 80%" may not be |
| Every alert has a runbook | On-call engineers should not need to think at 3 AM |
| Alerts must be actionable | If no one can act on it, it is noise, not an alert |
| Tune thresholds after 2 weeks | Initial thresholds are guesses; adjust based on baselines |
| Review alert fatigue monthly | If alerts fire daily without action, raise thresholds or remove them |
Grafana Dashboards
Dashboard Hierarchy
- Overview dashboard: High-level health across all services. This is the first screen anyone looks at during an incident.
- Service dashboards: Detailed metrics for each service (API, web, workers).
- Infrastructure dashboards: Node-level metrics (CPU, memory, disk, network).
- Business dashboards: Revenue, orders, user activity.
The RED Method for Service Dashboards
For every service, display:
- Rate: Requests per second
- Errors: Error rate as a percentage
- Duration: Latency distribution (P50, P95, P99)
This provides instant visibility into service health without cognitive overload.
Error Tracking with Sentry
// sentry.config.ts
import * as Sentry from '@sentry/nestjs';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1,
profilesSampleRate: 0.1,
integrations: [
Sentry.postgresIntegration(),
],
beforeSend(event) {
// Strip sensitive data
if (event.request?.headers) {
delete event.request.headers['authorization'];
delete event.request.headers['cookie'];
}
return event;
},
});
Sentry provides:
- Automatic error grouping and deduplication
- Stack traces with source maps
- Release tracking (which deploy introduced the error)
- Performance monitoring (transaction traces)
Frequently Asked Questions
How much does a monitoring stack cost?
Self-hosted (Prometheus + Grafana + Loki): approximately $50-100/month for the hosting resources. Managed alternatives: Datadog starts at $15/host/month for infrastructure, plus $0.10/GB for logs. Sentry cloud is $26/month for the team plan. A reasonable starting budget for a small business is $100-200/month total.
What is the difference between monitoring and observability?
Monitoring tells you when something is wrong. Observability tells you why. Monitoring is about predefined dashboards and alerts for known failure modes. Observability is about the ability to ask arbitrary questions about your system's behavior using metrics, logs, and traces. You need both, but monitoring is the foundation.
How do we avoid alert fatigue?
Three rules: (1) every alert must require human action, (2) set thresholds based on actual baselines not theoretical ideals, (3) review and tune alerts monthly. If an alert fires more than once per week without requiring action, either fix the underlying issue or raise the threshold. Teams suffering from alert fatigue ignore all alerts, including the critical ones.
Should we monitor our ERP system differently?
ERP systems have unique monitoring requirements. Beyond standard web metrics, monitor: database connection pool usage, background job queue depth, integration sync status (Shopify, payment gateways), scheduled report execution time, and user session counts by module. ECOSIRE provides managed Odoo monitoring as part of our support packages.
What Comes Next
Monitoring is the eyes and ears of your production infrastructure. Pair it with CI/CD automation for deployment confidence and disaster recovery planning for resilience. For a comprehensive DevOps roadmap, see our DevOps guide for small businesses.
Contact ECOSIRE for monitoring setup and managed infrastructure services.
Published by ECOSIRE -- helping businesses see what matters in production.
Yazan
ECOSIRE Research and Development Team
ECOSIRE'da kurumsal düzeyde dijital ürünler geliştiriyor. Odoo entegrasyonları, e-ticaret otomasyonu ve yapay zeka destekli iş çözümleri hakkında içgörüler paylaşıyor.
İlgili Makaleler
Yapay Zeka Aracılarını Test Etme ve İzleme: Otonom Sistemler için Güvenilirlik Mühendisliği
Birim testi, entegrasyon testi, davranış testi, gözlemlenebilirlik ve üretim izleme stratejilerini kapsayan yapay zeka aracılarının test edilmesine ve izlenmesine yönelik eksiksiz kılavuz.
Modern Uygulamalar için API Ağ Geçidi Kalıpları ve En İyi Uygulamalar
Ölçeklenebilir web mimarileri için hız sınırlama, kimlik doğrulama, istek yönlendirme, devre kesiciler ve API sürüm oluşturma dahil olmak üzere API ağ geçidi modellerini uygulayın.
CDN Performans Optimizasyonu: Daha Hızlı Küresel Teslimat İçin Tam Kılavuz
Daha hızlı küresel içerik dağıtımı için önbelleğe alma stratejileri, uç bilgi işlem, görüntü optimizasyonu ve çoklu CDN mimarileriyle CDN performansını optimize edin.
{series} serisinden daha fazlası
Yapay Zeka Aracısı Performans Optimizasyonu: Hız, Doğruluk ve Maliyet Verimliliği
Hızlı mühendislik, önbelleğe alma, model seçimi ve izleme için kanıtlanmış tekniklerle yapay zeka aracısının performansını yanıt süresi, doğruluk ve maliyet açısından optimize edin.
Yapay Zeka Aracılarını Test Etme ve İzleme: Otonom Sistemler için Güvenilirlik Mühendisliği
Birim testi, entegrasyon testi, davranış testi, gözlemlenebilirlik ve üretim izleme stratejilerini kapsayan yapay zeka aracılarının test edilmesine ve izlenmesine yönelik eksiksiz kılavuz.
CDN Performans Optimizasyonu: Daha Hızlı Küresel Teslimat İçin Tam Kılavuz
Daha hızlı küresel içerik dağıtımı için önbelleğe alma stratejileri, uç bilgi işlem, görüntü optimizasyonu ve çoklu CDN mimarileriyle CDN performansını optimize edin.
Web Uygulamaları için Yük Testi Stratejileri: Kırılma Noktalarını Kullanıcılar Bulmadan Bulun
Web uygulamalarını k6, Artillery ve Locust ile test edin. Test tasarımını, trafik modellemeyi, performans temellerini ve sonuç yorumlama stratejilerini kapsar.
E-Ticaret için Mobil SEO: 2026 İçin Tam Optimizasyon Kılavuzu
E-Ticaret siteleri için Mobil SEO kılavuzu. Mobil öncelikli indekslemeyi, Önemli Web Verilerini, yapılandırılmış verileri, sayfa hızı optimizasyonunu ve mobil arama sıralama faktörlerini kapsar.
API Performansı: Hız Sınırlama, Sayfalandırma ve Eşzamansız İşleme
Hız sınırlama algoritmaları, imleç tabanlı sayfalandırma, eşzamansız iş kuyrukları ve yanıt sıkıştırmayla ilgili en iyi uygulamalarla yüksek performanslı API'ler oluşturun.