How to Implement Observability: Traces, Metrics, and Logs at Scale
Build a production observability stack. Covers OpenTelemetry instrumentation, Prometheus metrics, distributed tracing, log aggregation, and alerting strategies.
Monitoring tells you something is broken. Observability tells you why. In distributed systems, you can’t debug with console.log. You need traces to follow requests across services, metrics to spot trends, and logs for the details.
The Three Pillars
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ TRACES │ │ METRICS │ │ LOGS │
│ │ │ │ │ │
│ Request flow│ │ Aggregated │ │ Individual │
│ across │ │ time-series │ │ event │
│ services │ │ data │ │ records │
│ │ │ │ │ │
│ "What path?"│ │ "What trend?"│ │"What detail?"│
└─────────────┘ └─────────────┘ └─────────────┘
Step 1: Instrument with OpenTelemetry
1.1 Node.js Auto-Instrumentation
// tracing.js — load BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
serviceName: 'api-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
metricExporter: new OTLPMetricExporter({
url: 'http://otel-collector:4318/v1/metrics',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
# Run your app with tracing
node --require ./tracing.js app.js
1.2 Python Auto-Instrumentation
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Configure trace provider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
# Auto-instrument frameworks
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()
Step 2: Deploy the Collector Stack
# docker-compose.yml — Observability Stack
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # gRPC
- "4318:4318" # HTTP
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml
command: ["--config=/etc/otel/config.yaml"]
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # Collector
loki:
image: grafana/loki:latest
ports: ["3100:3100"]
Collector Configuration
# otel-config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: "0.0.0.0:4317" }
http: { endpoint: "0.0.0.0:4318" }
processors:
batch:
timeout: 5s
send_batch_size: 1024
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
jaeger:
endpoint: "jaeger:14250"
tls: { insecure: true }
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Step 3: Define Key Metrics (RED + USE)
RED Method (Request-oriented)
| Metric | What It Measures | PromQL Example |
|---|---|---|
| Rate | Requests per second | rate(http_requests_total[5m]) |
| Errors | Error rate percentage | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) |
| Duration | Latency percentiles | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
USE Method (Resource-oriented)
| Metric | What It Measures | PromQL Example |
|---|---|---|
| Utilization | How busy is the resource? | avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) |
| Saturation | How much does it queue? | node_load1 / count(node_cpu_seconds_total{mode="idle"}) |
| Errors | How often does it fail? | rate(node_disk_io_time_weighted_seconds_total[5m]) |
Step 4: Build Custom Metrics
from opentelemetry import metrics
meter = metrics.get_meter("api-service")
# Counter — monotonically increasing (requests, errors)
request_counter = meter.create_counter(
"api.requests",
description="Total API requests",
unit="1"
)
# Histogram — distribution of values (latency, sizes)
latency_histogram = meter.create_histogram(
"api.latency",
description="Request latency in milliseconds",
unit="ms"
)
# Observable Gauge — current state (connections, queue depth)
def get_queue_depth(observer):
observer.observe(queue.qsize(), {"queue": "main"})
meter.create_observable_gauge(
"api.queue_depth",
callbacks=[get_queue_depth],
description="Current queue depth"
)
# Usage in request handler
@app.route("/api/customers")
def list_customers():
start = time.time()
try:
result = db.query("SELECT * FROM customers")
request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "200"})
return jsonify(result)
except Exception as e:
request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "500"})
raise
finally:
latency_histogram.record(
(time.time() - start) * 1000,
{"method": "GET", "endpoint": "/customers"}
)
Step 5: Configure Alerting
Alert Rules (Prometheus)
# alerts.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.service }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 2s for {{ $labels.service }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
Observability Checklist
- OpenTelemetry SDK integrated in all services
- Auto-instrumentation for HTTP, database, cache
- Collector deployed and receiving telemetry
- Prometheus scraping metrics
- Jaeger/Tempo receiving traces
- Loki/ELK aggregating logs
- Grafana dashboards for RED + USE metrics
- Alert rules for error rate, latency, and availability
- Trace correlation across service boundaries
- On-call rotation with escalation paths
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure audits, visit garnetgrid.com. :::