Verified by Garnet Grid

How to Implement Observability: Traces, Metrics, and Logs at Scale

Build a production observability stack. Covers OpenTelemetry instrumentation, Prometheus metrics, distributed tracing, log aggregation, and alerting strategies.

Monitoring tells you something is broken. Observability tells you why. In distributed systems, you can’t debug with console.log. You need traces to follow requests across services, metrics to spot trends, and logs for the details.


The Three Pillars

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│   TRACES    │   │   METRICS   │   │    LOGS     │
│             │   │             │   │             │
│ Request flow│   │ Aggregated  │   │ Individual  │
│ across      │   │ time-series │   │ event       │
│ services    │   │ data        │   │ records     │
│             │   │             │   │             │
│ "What path?"│   │ "What trend?"│   │"What detail?"│
└─────────────┘   └─────────────┘   └─────────────┘

Step 1: Instrument with OpenTelemetry

1.1 Node.js Auto-Instrumentation

// tracing.js — load BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  serviceName: 'api-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricExporter: new OTLPMetricExporter({
    url: 'http://otel-collector:4318/v1/metrics',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();
# Run your app with tracing
node --require ./tracing.js app.js

1.2 Python Auto-Instrumentation

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure trace provider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

# Auto-instrument frameworks
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()

Step 2: Deploy the Collector Stack

# docker-compose.yml — Observability Stack
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    command: ["--config=/etc/otel/config.yaml"]

  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector

  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]

Collector Configuration

# otel-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: "0.0.0.0:4317" }
      http: { endpoint: "0.0.0.0:4318" }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: "jaeger:14250"
    tls: { insecure: true }
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Step 3: Define Key Metrics (RED + USE)

RED Method (Request-oriented)

MetricWhat It MeasuresPromQL Example
RateRequests per secondrate(http_requests_total[5m])
ErrorsError rate percentagerate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
DurationLatency percentileshistogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

USE Method (Resource-oriented)

MetricWhat It MeasuresPromQL Example
UtilizationHow busy is the resource?avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
SaturationHow much does it queue?node_load1 / count(node_cpu_seconds_total{mode="idle"})
ErrorsHow often does it fail?rate(node_disk_io_time_weighted_seconds_total[5m])

Step 4: Build Custom Metrics

from opentelemetry import metrics

meter = metrics.get_meter("api-service")

# Counter — monotonically increasing (requests, errors)
request_counter = meter.create_counter(
    "api.requests",
    description="Total API requests",
    unit="1"
)

# Histogram — distribution of values (latency, sizes)
latency_histogram = meter.create_histogram(
    "api.latency",
    description="Request latency in milliseconds",
    unit="ms"
)

# Observable Gauge — current state (connections, queue depth)
def get_queue_depth(observer):
    observer.observe(queue.qsize(), {"queue": "main"})

meter.create_observable_gauge(
    "api.queue_depth",
    callbacks=[get_queue_depth],
    description="Current queue depth"
)

# Usage in request handler
@app.route("/api/customers")
def list_customers():
    start = time.time()
    try:
        result = db.query("SELECT * FROM customers")
        request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "200"})
        return jsonify(result)
    except Exception as e:
        request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "500"})
        raise
    finally:
        latency_histogram.record(
            (time.time() - start) * 1000,
            {"method": "GET", "endpoint": "/customers"}
        )

Step 5: Configure Alerting

Alert Rules (Prometheus)

# alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.service }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s for {{ $labels.service }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"

Observability Checklist

  • OpenTelemetry SDK integrated in all services
  • Auto-instrumentation for HTTP, database, cache
  • Collector deployed and receiving telemetry
  • Prometheus scraping metrics
  • Jaeger/Tempo receiving traces
  • Loki/ELK aggregating logs
  • Grafana dashboards for RED + USE metrics
  • Alert rules for error rate, latency, and availability
  • Trace correlation across service boundaries
  • On-call rotation with escalation paths

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure audits, visit garnetgrid.com. :::