TL;DR

Distributed tracing answers "where did this request spend its time?" across services. Use OpenTelemetry (OTEL) for instrumentation and collection, Tempo or Jaeger as backends. Traces are most useful during latency investigations and postmortems — correlate with Loki logs using trace IDs.

Concepts

TermDefinition
TraceA tree of spans representing a single request through a distributed system
SpanOne unit of work (e.g., an HTTP call, DB query); has name, duration, attributes, and status
Trace IDGlobally unique ID propagated in HTTP headers (traceparent) across service boundaries
SamplerDecides which traces to record; head-based (at start) or tail-based (after completion)
CollectorOTEL Collector receives spans from services, processes, and exports to backends

OpenTelemetry Collector

Deploy the OTEL Collector as a Deployment (gateway) or DaemonSet (agent); apps send spans to the local agent, which batches and forwards to Tempo/Jaeger — decoupling the app from the backend choice.

yamlotel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: {endpoint: 0.0.0.0:4317}
      http: {endpoint: 0.0.0.0:4318}
  # Also accept Jaeger format from older services
  jaeger:
    protocols:
      thrift_http: {endpoint: 0.0.0.0:14268}

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  # Drop internal health-check spans (high volume, low value)
  filter:
    spans:
      exclude:
        match_type: regexp
        attributes:
          - {key: http.target, value: "^/health.*"}

exporters:
  otlp:
    endpoint: tempo.monitoring.svc:4317
    tls: {insecure: true}
  # Also export to Jaeger for teams still using it
  jaeger:
    endpoint: jaeger-collector.monitoring.svc:14250

service:
  pipelines:
    traces:
      receivers:  [otlp, jaeger]
      processors: [batch, filter]
      exporters:  [otlp, jaeger]

Grafana Tempo

Tempo stores traces cheaply in object storage (S3, GCS) and integrates directly with Grafana; use TraceQL to query by service, duration, or attributes, and link traces to Loki logs via the shared trace ID.

bashtraceql-examples.txt
# TraceQL: find slow traces in a specific service
{ .service.name = "checkout" } | avg(duration) > 500ms

# Traces with errors from a specific operation
{ .service.name = "payment" && status = error }

# Find traces longer than 2 seconds across all services
{ duration > 2s }

# Traces with a specific HTTP status code
{ span.http.status_code = 500 }

# Tempo HTTP API (useful for scripting)
curl "http://tempo.monitoring.svc:3200/api/traces/<trace-id>"
curl "http://tempo.monitoring.svc:3200/api/search?tags=service.name%3Dcheckout&limit=20"

Log–Trace Correlation

Include the traceId and spanId in every structured log line; Grafana can then auto-link from a Loki log entry to the corresponding Tempo trace, dramatically reducing context-switching during incidents.

yamlgrafana-datasource-derived-fields.yaml
# Grafana Loki datasource configuration: add a derived field to link trace IDs to Tempo
apiVersion: 1
datasources:
- name: Loki
  type: loki
  url: http://loki.monitoring.svc:3100
  jsonData:
    derivedFields:
    - matcherRegex: '"traceId":"([0-9a-f]+)"'   # extract traceId from JSON logs
      name: TraceID
      url: '$${__value.raw}'                     # link to Tempo datasource
      datasourceUid: tempo-uid                   # UID of your Tempo datasource

SRE Use Cases

  • Latency investigation: a p99 latency alert fires → find the slowest traces in Tempo → identify the bottleneck span (slow DB query, downstream call).
  • Postmortem timeline: use trace waterfall to prove which service was slow during the incident window.
  • Error attribution: filter by status=error + service to find which operation is producing 5xx responses.
  • Dependency mapping: Tempo service graph view shows which services call each other and their error rates.