本文目前仅提供英文版本。翻译即将推出。
属于我们的Performance & Scalability系列
阅读完整指南Monitoring & Observability: APM, Logging & Alerting Best Practices
Companies with mature observability practices resolve incidents 69% faster according to Splunk's State of Observability report. Monitoring tells you something is broken. Observability tells you why it is broken and where to look. The difference between firefighting every production issue for hours and resolving them in minutes comes down to how well you instrument your systems, structure your logs, and design your alerts.
Key Takeaways
- The three pillars of observability -- metrics, logs, and traces -- each answer different questions and work together to provide complete system understanding
- Alert on symptoms (user-facing impact) not causes (internal metrics) to reduce alert fatigue and catch novel failure modes
- Structured JSON logging with correlation IDs enables searching across services and reconstructing request flows
- SLOs (Service Level Objectives) transform monitoring from "is anything broken" to "are we meeting our commitments to users"
The Three Pillars of Observability
Observability is built on three complementary data types. Each pillar answers different questions about your system's behavior.
Metrics
Metrics are numerical measurements collected at regular intervals. They answer "what is happening" questions: How many requests per second? What is the average response time? How much memory is in use?
Characteristics:
- Aggregated and compact -- millions of events compressed into time-series counters
- Cheap to store -- fixed-size data regardless of traffic volume
- Ideal for dashboards and alerting
- Limited context -- you know response time increased but not which specific requests are slow
Key metric types:
- Counter -- monotonically increasing value (total requests, total errors)
- Gauge -- value that goes up and down (current CPU usage, active connections)
- Histogram -- distribution of values (response time percentiles, payload sizes)
- Summary -- pre-calculated percentiles on the client side
Logs
Logs are timestamped text records of discrete events. They answer "what happened" questions: What error message did the user see? What parameters were passed to the failed function? What was the state of the system when the problem occurred?
Characteristics:
- Rich context -- arbitrary detail about individual events
- Expensive at scale -- high-traffic systems generate gigabytes of logs per hour
- Searchable -- find specific events with full-text search
- Requires structure -- unstructured log lines are hard to parse and correlate
Traces
Traces follow a single request across multiple services. They answer "where is the time spent" questions: Which service call is slow? Where does the request path diverge? Which database query is the bottleneck?
Characteristics:
- Show causality -- parent-child relationships between operations
- Reveal distributed system behavior -- latency across service boundaries
- Sampling required at scale -- tracing every request is expensive
- Essential for microservices -- without traces, debugging multi-service flows is guesswork
Observability Tool Ecosystem
| Category | Open Source | Commercial | Cloud-Native | |---|---|---|---| | Metrics | Prometheus + Grafana | Datadog, New Relic | AWS CloudWatch, Google Cloud Monitoring | | Logs | Loki, ELK Stack (Elasticsearch, Logstash, Kibana) | Datadog Logs, Splunk | AWS CloudWatch Logs, Google Cloud Logging | | Traces | Jaeger, Zipkin | Datadog APM, New Relic | AWS X-Ray, Google Cloud Trace | | All-in-one | Grafana Stack (Prometheus + Loki + Tempo) | Datadog, New Relic, Dynatrace | — | | Error tracking | Sentry (open core) | Sentry, Bugsnag, Rollbar | — | | Uptime monitoring | — | Better Uptime, Pingdom | AWS Route 53 health checks |
Choosing a Stack
For most growing businesses, we recommend starting with:
- Sentry for error tracking -- catches exceptions with full stack traces, source maps, and release tracking
- Prometheus + Grafana for metrics -- open source, battle-tested, extensive integration ecosystem
- Structured logging to a managed service -- Datadog Logs, AWS CloudWatch, or Grafana Loki depending on your cloud provider
- OpenTelemetry for instrumentation -- vendor-neutral standard that works with any backend
For teams that want a single vendor, Datadog provides the best all-in-one experience but at significant cost at scale. Grafana's open-source stack (Prometheus, Loki, Tempo) provides equivalent capabilities with lower licensing cost but higher operational burden.
Structured Logging Best Practices
Unstructured log lines like Error processing order 12345 for user [email protected] are human-readable but machine-hostile. Structured JSON logs enable searching, filtering, aggregating, and alerting on specific fields.
Log Structure
Every log entry should include:
| Field | Purpose | Example | |---|---|---| | timestamp | When the event occurred | 2026-03-15T14:30:00.123Z | | level | Severity (debug, info, warn, error) | error | | message | Human-readable description | Order processing failed | | service | Which service generated the log | api-server | | correlationId | Request tracing across services | req-abc123 | | userId | Who was affected | usr-456 | | duration | How long the operation took | 1523 (ms) | | error.name | Error class | DatabaseConnectionError | | error.stack | Stack trace (errors only) | ... |
Correlation IDs
A correlation ID is a unique identifier generated at the start of each request and passed to every downstream service call, database query, and background job. When investigating an issue, searching by correlation ID shows every log entry related to that specific request across all services.
Implementation: Generate a UUID at the API gateway or load balancer, pass it in the X-Request-ID header, and include it in every log entry. In NestJS, use a middleware that extracts or generates the correlation ID and stores it in the async local storage context.
Log Levels
Use log levels consistently:
- DEBUG -- detailed diagnostic information, disabled in production unless actively debugging
- INFO -- significant business events (order placed, user registered, payment processed)
- WARN -- unexpected situations that the system handled but should be investigated (retry succeeded, cache miss, deprecated API call)
- ERROR -- failures that affected user experience (request failed, payment declined, external service unavailable)
- FATAL -- the application cannot continue (database unreachable, missing required configuration)
Log Retention and Cost Management
Logs are the most expensive observability data to store. Implement tiered retention:
- Hot storage (30 days) -- full-text searchable, fast queries, high cost
- Warm storage (90 days) -- compressed, slower queries, moderate cost
- Cold storage (1 year+) -- archived, query-on-demand, low cost
- Debug logs -- do not store in production unless actively troubleshooting
Alerting Design
Bad alerting creates alert fatigue -- teams stop responding to alerts because most are false positives or low-priority noise. Good alerting surfaces genuine issues that require human intervention.
Alert on Symptoms, Not Causes
Symptom-based alert (good): "Error rate on /api/orders exceeded 1% for 5 minutes" -- this directly indicates user impact regardless of the underlying cause.
Cause-based alert (bad): "Database CPU exceeded 90%" -- this may or may not affect users. The database might handle 95% CPU just fine, or it might be at 50% CPU but completely deadlocked.
Cause-based metrics belong on dashboards for investigation, not in alerting rules.
Alert Severity Levels
| Severity | Criteria | Response | Notification | |---|---|---|---| | Critical (P1) | Revenue-impacting, all users affected | Immediate response, wake engineers | PagerDuty phone call, Slack | | High (P2) | Feature degraded, subset of users affected | Respond within 30 minutes | PagerDuty push, Slack | | Medium (P3) | Performance degraded, no feature loss | Respond within 4 hours | Slack channel, email | | Low (P4) | Anomaly detected, no user impact | Respond within 24 hours | Email, ticket |
Reducing Alert Noise
- Group related alerts -- if the database goes down, you get one "database unavailable" alert, not 50 alerts from every service that depends on it
- Require sustained violation -- "CPU above 90% for 5 minutes" not "CPU above 90% for 1 second" to avoid transient spikes
- Auto-resolve -- alerts should clear automatically when the condition resolves
- Weekly alert review -- review all alerts that fired, identify and fix or silence those that did not require human action
- On-call feedback loop -- after every on-call rotation, the engineer documents which alerts were actionable and which need tuning
SLOs: Service Level Objectives
SLOs transform monitoring from reactive ("something broke, fix it") to proactive ("we are consuming our error budget, let's investigate before users notice").
Defining SLOs
An SLO defines the target reliability for a service. It consists of:
- SLI (Service Level Indicator) -- the metric being measured (request success rate, latency percentile)
- Target -- the threshold that defines acceptable performance (99.9% success rate, P95 under 200ms)
- Window -- the time period over which the target is evaluated (rolling 30 days)
Example SLOs for an eCommerce Platform
| Service | SLI | Target | Error Budget (30 days) | |---|---|---|---| | Product API | Successful responses (non-5xx) | 99.9% | 43 minutes of downtime | | Checkout | Successful transactions | 99.95% | 21 minutes of downtime | | Search | Results returned under 500ms | 99% | 7.2 hours of slow responses | | Admin dashboard | Page loads under 3s | 95% | 36 hours of slow loads |
Error Budgets
The error budget is the inverse of your SLO target. A 99.9% SLO allows 0.1% errors -- approximately 43 minutes of downtime per month. When the error budget is exhausted, the team shifts focus from features to reliability work.
Error budgets provide a shared language between engineering and product teams. Instead of debating whether a service is "reliable enough," both teams can see exactly how much error budget remains and make data-driven decisions about shipping new features versus investing in stability.
Frequently Asked Questions
How much does observability cost at scale?
Observability costs range from $10-50 per host per month for open-source stacks (Prometheus, Grafana, Loki) to $30-100+ per host for commercial solutions (Datadog, New Relic). The biggest cost driver is log volume -- optimize by sampling debug logs, compressing stored logs, and setting appropriate retention periods. For most businesses under 50 servers, the cost is $500-2,000 per month.
Should I use OpenTelemetry or vendor-specific agents?
Use OpenTelemetry. It is the industry standard for instrumentation, supported by every major observability vendor, and prevents vendor lock-in. You can switch backends (from Datadog to Grafana, for example) without re-instrumenting your code. Vendor-specific agents sometimes offer deeper integration, but the portability trade-off is not worth it.
How do I set up monitoring for a NestJS application?
In NestJS, use interceptors for request timing, exception filters for error tracking, and middleware for correlation ID propagation. Integrate Sentry with @sentry/nestjs for error tracking. Export Prometheus metrics with the prom-client library exposed on a /metrics endpoint. Use structured logging with nestjs-pino or winston configured for JSON output.
What is the difference between monitoring and observability?
Monitoring tells you when predefined things go wrong (CPU high, error rate up, disk full). Observability lets you ask new questions about system behavior without deploying new instrumentation. A system is observable when you can understand its internal state from its external outputs (metrics, logs, traces). In practice, good monitoring is a subset of observability.
How do I convince my team to invest in observability?
Track Mean Time to Resolution (MTTR) for production incidents before and after observability improvements. Teams with good observability typically reduce MTTR by 60-70%. Multiply the time saved by engineering cost to show ROI. Also track the number of incidents detected by monitoring versus by user reports -- proactive detection builds customer trust.
What Is Next
Start with error tracking (Sentry) if you have nothing -- it provides the most immediate value by catching and alerting on production errors. Next, add structured logging with correlation IDs. Then implement metrics collection with Prometheus and Grafana dashboards. Finally, add distributed tracing when you have multiple services.
For the complete performance engineering context, see our pillar guide on scaling your business platform. To optimize the infrastructure your monitoring watches over, read our infrastructure scaling guide.
ECOSIRE implements observability stacks for business platforms running on Odoo ERP and custom architectures. Contact our DevOps team for a monitoring and observability assessment.
Published by ECOSIRE — helping businesses scale with AI-powered solutions across Odoo ERP, Shopify eCommerce, and OpenClaw AI.
作者
ECOSIRE Research and Development Team
在 ECOSIRE 构建企业级数字产品。分享关于 Odoo 集成、电商自动化和 AI 驱动商业解决方案的洞见。
相关文章
Cost Optimization: Reducing Cloud Infrastructure Spend by 40%
Cut cloud costs by 30-40% with reserved instances, right-sizing, storage tiering, and data transfer optimization. Practical AWS cost reduction strategies.
Infrastructure Scaling: Horizontal vs Vertical, Load Balancing & Auto-Scaling
Scale your infrastructure with the right strategy. Compare horizontal vs vertical scaling, L4 vs L7 load balancers, and auto-scaling policies for production.
Integration Monitoring: Detecting Sync Failures Before They Cost Revenue
Build integration monitoring with health checks, error categorization, retry strategies, dead letter queues, and alerting for multi-channel eCommerce sync.
更多来自Performance & Scalability
API Performance: Rate Limiting, Pagination & Async Processing
Build high-performance APIs with rate limiting algorithms, cursor-based pagination, async job queues, and response compression best practices.
Caching Strategies: Redis, CDN & HTTP Caching for Web Applications
Implement multi-layer caching with Redis, CDN edge caching, and HTTP cache headers to reduce latency by 90% and cut infrastructure costs.
Core Web Vitals Optimization: LCP, FID & CLS for eCommerce Sites
Optimize Core Web Vitals for eCommerce. Improve LCP, INP, and CLS scores to boost SEO rankings and reduce cart abandonment by 24%.
Database Query Optimization: Indexes, Execution Plans & Partitioning
Optimize PostgreSQL performance with proper indexing, EXPLAIN ANALYZE reading, N+1 detection, and partitioning strategies for growing datasets.
Integration Monitoring: Detecting Sync Failures Before They Cost Revenue
Build integration monitoring with health checks, error categorization, retry strategies, dead letter queues, and alerting for multi-channel eCommerce sync.
Load Testing Your eCommerce Platform: Preparing for Black Friday Traffic
Prepare your eCommerce site for Black Friday with load testing strategies using k6, Artillery, and Locust. Learn traffic modeling and bottleneck identification.