Monitoring & Observability: APM, Logging & Alerting Best Practices

Build production observability with the three pillars: metrics, logs, and traces. Compare APM tools and design alerts that reduce noise and catch real issues.

E

ECOSIRE Research and Development Team

ECOSIREチーム

2026年3月15日9 分で読める2.1k 語数

この記事は現在英語版のみです。翻訳は近日公開予定です。

Performance & Scalabilityシリーズの一部

完全ガイドを読む

Monitoring & Observability: APM, Logging & Alerting Best Practices

Companies with mature observability practices resolve incidents 69% faster according to Splunk's State of Observability report. Monitoring tells you something is broken. Observability tells you why it is broken and where to look. The difference between firefighting every production issue for hours and resolving them in minutes comes down to how well you instrument your systems, structure your logs, and design your alerts.

Key Takeaways

  • The three pillars of observability -- metrics, logs, and traces -- each answer different questions and work together to provide complete system understanding
  • Alert on symptoms (user-facing impact) not causes (internal metrics) to reduce alert fatigue and catch novel failure modes
  • Structured JSON logging with correlation IDs enables searching across services and reconstructing request flows
  • SLOs (Service Level Objectives) transform monitoring from "is anything broken" to "are we meeting our commitments to users"

The Three Pillars of Observability

Observability is built on three complementary data types. Each pillar answers different questions about your system's behavior.

Metrics

Metrics are numerical measurements collected at regular intervals. They answer "what is happening" questions: How many requests per second? What is the average response time? How much memory is in use?

Characteristics:

  • Aggregated and compact -- millions of events compressed into time-series counters
  • Cheap to store -- fixed-size data regardless of traffic volume
  • Ideal for dashboards and alerting
  • Limited context -- you know response time increased but not which specific requests are slow

Key metric types:

  • Counter -- monotonically increasing value (total requests, total errors)
  • Gauge -- value that goes up and down (current CPU usage, active connections)
  • Histogram -- distribution of values (response time percentiles, payload sizes)
  • Summary -- pre-calculated percentiles on the client side

Logs

Logs are timestamped text records of discrete events. They answer "what happened" questions: What error message did the user see? What parameters were passed to the failed function? What was the state of the system when the problem occurred?

Characteristics:

  • Rich context -- arbitrary detail about individual events
  • Expensive at scale -- high-traffic systems generate gigabytes of logs per hour
  • Searchable -- find specific events with full-text search
  • Requires structure -- unstructured log lines are hard to parse and correlate

Traces

Traces follow a single request across multiple services. They answer "where is the time spent" questions: Which service call is slow? Where does the request path diverge? Which database query is the bottleneck?

Characteristics:

  • Show causality -- parent-child relationships between operations
  • Reveal distributed system behavior -- latency across service boundaries
  • Sampling required at scale -- tracing every request is expensive
  • Essential for microservices -- without traces, debugging multi-service flows is guesswork

Observability Tool Ecosystem

| Category | Open Source | Commercial | Cloud-Native | |---|---|---|---| | Metrics | Prometheus + Grafana | Datadog, New Relic | AWS CloudWatch, Google Cloud Monitoring | | Logs | Loki, ELK Stack (Elasticsearch, Logstash, Kibana) | Datadog Logs, Splunk | AWS CloudWatch Logs, Google Cloud Logging | | Traces | Jaeger, Zipkin | Datadog APM, New Relic | AWS X-Ray, Google Cloud Trace | | All-in-one | Grafana Stack (Prometheus + Loki + Tempo) | Datadog, New Relic, Dynatrace | — | | Error tracking | Sentry (open core) | Sentry, Bugsnag, Rollbar | — | | Uptime monitoring | — | Better Uptime, Pingdom | AWS Route 53 health checks |

Choosing a Stack

For most growing businesses, we recommend starting with:

  1. Sentry for error tracking -- catches exceptions with full stack traces, source maps, and release tracking
  2. Prometheus + Grafana for metrics -- open source, battle-tested, extensive integration ecosystem
  3. Structured logging to a managed service -- Datadog Logs, AWS CloudWatch, or Grafana Loki depending on your cloud provider
  4. OpenTelemetry for instrumentation -- vendor-neutral standard that works with any backend

For teams that want a single vendor, Datadog provides the best all-in-one experience but at significant cost at scale. Grafana's open-source stack (Prometheus, Loki, Tempo) provides equivalent capabilities with lower licensing cost but higher operational burden.


Structured Logging Best Practices

Unstructured log lines like Error processing order 12345 for user [email protected] are human-readable but machine-hostile. Structured JSON logs enable searching, filtering, aggregating, and alerting on specific fields.

Log Structure

Every log entry should include:

| Field | Purpose | Example | |---|---|---| | timestamp | When the event occurred | 2026-03-15T14:30:00.123Z | | level | Severity (debug, info, warn, error) | error | | message | Human-readable description | Order processing failed | | service | Which service generated the log | api-server | | correlationId | Request tracing across services | req-abc123 | | userId | Who was affected | usr-456 | | duration | How long the operation took | 1523 (ms) | | error.name | Error class | DatabaseConnectionError | | error.stack | Stack trace (errors only) | ... |

Correlation IDs

A correlation ID is a unique identifier generated at the start of each request and passed to every downstream service call, database query, and background job. When investigating an issue, searching by correlation ID shows every log entry related to that specific request across all services.

Implementation: Generate a UUID at the API gateway or load balancer, pass it in the X-Request-ID header, and include it in every log entry. In NestJS, use a middleware that extracts or generates the correlation ID and stores it in the async local storage context.

Log Levels

Use log levels consistently:

  • DEBUG -- detailed diagnostic information, disabled in production unless actively debugging
  • INFO -- significant business events (order placed, user registered, payment processed)
  • WARN -- unexpected situations that the system handled but should be investigated (retry succeeded, cache miss, deprecated API call)
  • ERROR -- failures that affected user experience (request failed, payment declined, external service unavailable)
  • FATAL -- the application cannot continue (database unreachable, missing required configuration)

Log Retention and Cost Management

Logs are the most expensive observability data to store. Implement tiered retention:

  • Hot storage (30 days) -- full-text searchable, fast queries, high cost
  • Warm storage (90 days) -- compressed, slower queries, moderate cost
  • Cold storage (1 year+) -- archived, query-on-demand, low cost
  • Debug logs -- do not store in production unless actively troubleshooting

Alerting Design

Bad alerting creates alert fatigue -- teams stop responding to alerts because most are false positives or low-priority noise. Good alerting surfaces genuine issues that require human intervention.

Alert on Symptoms, Not Causes

Symptom-based alert (good): "Error rate on /api/orders exceeded 1% for 5 minutes" -- this directly indicates user impact regardless of the underlying cause.

Cause-based alert (bad): "Database CPU exceeded 90%" -- this may or may not affect users. The database might handle 95% CPU just fine, or it might be at 50% CPU but completely deadlocked.

Cause-based metrics belong on dashboards for investigation, not in alerting rules.

Alert Severity Levels

| Severity | Criteria | Response | Notification | |---|---|---|---| | Critical (P1) | Revenue-impacting, all users affected | Immediate response, wake engineers | PagerDuty phone call, Slack | | High (P2) | Feature degraded, subset of users affected | Respond within 30 minutes | PagerDuty push, Slack | | Medium (P3) | Performance degraded, no feature loss | Respond within 4 hours | Slack channel, email | | Low (P4) | Anomaly detected, no user impact | Respond within 24 hours | Email, ticket |

Reducing Alert Noise

  1. Group related alerts -- if the database goes down, you get one "database unavailable" alert, not 50 alerts from every service that depends on it
  2. Require sustained violation -- "CPU above 90% for 5 minutes" not "CPU above 90% for 1 second" to avoid transient spikes
  3. Auto-resolve -- alerts should clear automatically when the condition resolves
  4. Weekly alert review -- review all alerts that fired, identify and fix or silence those that did not require human action
  5. On-call feedback loop -- after every on-call rotation, the engineer documents which alerts were actionable and which need tuning

SLOs: Service Level Objectives

SLOs transform monitoring from reactive ("something broke, fix it") to proactive ("we are consuming our error budget, let's investigate before users notice").

Defining SLOs

An SLO defines the target reliability for a service. It consists of:

  • SLI (Service Level Indicator) -- the metric being measured (request success rate, latency percentile)
  • Target -- the threshold that defines acceptable performance (99.9% success rate, P95 under 200ms)
  • Window -- the time period over which the target is evaluated (rolling 30 days)

Example SLOs for an eCommerce Platform

| Service | SLI | Target | Error Budget (30 days) | |---|---|---|---| | Product API | Successful responses (non-5xx) | 99.9% | 43 minutes of downtime | | Checkout | Successful transactions | 99.95% | 21 minutes of downtime | | Search | Results returned under 500ms | 99% | 7.2 hours of slow responses | | Admin dashboard | Page loads under 3s | 95% | 36 hours of slow loads |

Error Budgets

The error budget is the inverse of your SLO target. A 99.9% SLO allows 0.1% errors -- approximately 43 minutes of downtime per month. When the error budget is exhausted, the team shifts focus from features to reliability work.

Error budgets provide a shared language between engineering and product teams. Instead of debating whether a service is "reliable enough," both teams can see exactly how much error budget remains and make data-driven decisions about shipping new features versus investing in stability.


Frequently Asked Questions

How much does observability cost at scale?

Observability costs range from $10-50 per host per month for open-source stacks (Prometheus, Grafana, Loki) to $30-100+ per host for commercial solutions (Datadog, New Relic). The biggest cost driver is log volume -- optimize by sampling debug logs, compressing stored logs, and setting appropriate retention periods. For most businesses under 50 servers, the cost is $500-2,000 per month.

Should I use OpenTelemetry or vendor-specific agents?

Use OpenTelemetry. It is the industry standard for instrumentation, supported by every major observability vendor, and prevents vendor lock-in. You can switch backends (from Datadog to Grafana, for example) without re-instrumenting your code. Vendor-specific agents sometimes offer deeper integration, but the portability trade-off is not worth it.

How do I set up monitoring for a NestJS application?

In NestJS, use interceptors for request timing, exception filters for error tracking, and middleware for correlation ID propagation. Integrate Sentry with @sentry/nestjs for error tracking. Export Prometheus metrics with the prom-client library exposed on a /metrics endpoint. Use structured logging with nestjs-pino or winston configured for JSON output.

What is the difference between monitoring and observability?

Monitoring tells you when predefined things go wrong (CPU high, error rate up, disk full). Observability lets you ask new questions about system behavior without deploying new instrumentation. A system is observable when you can understand its internal state from its external outputs (metrics, logs, traces). In practice, good monitoring is a subset of observability.

How do I convince my team to invest in observability?

Track Mean Time to Resolution (MTTR) for production incidents before and after observability improvements. Teams with good observability typically reduce MTTR by 60-70%. Multiply the time saved by engineering cost to show ROI. Also track the number of incidents detected by monitoring versus by user reports -- proactive detection builds customer trust.


What Is Next

Start with error tracking (Sentry) if you have nothing -- it provides the most immediate value by catching and alerting on production errors. Next, add structured logging with correlation IDs. Then implement metrics collection with Prometheus and Grafana dashboards. Finally, add distributed tracing when you have multiple services.

For the complete performance engineering context, see our pillar guide on scaling your business platform. To optimize the infrastructure your monitoring watches over, read our infrastructure scaling guide.

ECOSIRE implements observability stacks for business platforms running on Odoo ERP and custom architectures. Contact our DevOps team for a monitoring and observability assessment.


Published by ECOSIRE — helping businesses scale with AI-powered solutions across Odoo ERP, Shopify eCommerce, and OpenClaw AI.

E

執筆者

ECOSIRE Research and Development Team

ECOSIREでエンタープライズグレードのデジタル製品を開発。Odoo統合、eコマース自動化、AI搭載ビジネスソリューションに関するインサイトを共有しています。

WhatsAppでチャット