Distributed Tracing — K8s SRE Reference

TL;DR

Distributed tracing answers "where did this request spend its time?" across services. Use OpenTelemetry (OTEL) for instrumentation and collection, Tempo or Jaeger as backends. Traces are most useful during latency investigations and postmortems — correlate with Loki logs using trace IDs.

Concepts

Term	Definition
Trace	A tree of spans representing a single request through a distributed system
Span	One unit of work (e.g., an HTTP call, DB query); has name, duration, attributes, and status
Trace ID	Globally unique ID propagated in HTTP headers (`traceparent`) across service boundaries
Sampler	Decides which traces to record; head-based (at start) or tail-based (after completion)
Collector	OTEL Collector receives spans from services, processes, and exports to backends

OpenTelemetry Collector

Deploy the OTEL Collector as a Deployment (gateway) or DaemonSet (agent); apps send spans to the local agent, which batches and forwards to Tempo/Jaeger — decoupling the app from the backend choice.

yamlotel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc: {endpoint: 0.0.0.0:4317}
      http: {endpoint: 0.0.0.0:4318}
  # Also accept Jaeger format from older services
  jaeger:
    protocols:
      thrift_http: {endpoint: 0.0.0.0:14268}

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  # Drop internal health-check spans (high volume, low value)
  filter:
    spans:
      exclude:
        match_type: regexp
        attributes:
          - {key: http.target, value: "^/health.*"}

exporters:
  otlp:
    endpoint: tempo.monitoring.svc:4317
    tls: {insecure: true}
  # Also export to Jaeger for teams still using it
  jaeger:
    endpoint: jaeger-collector.monitoring.svc:14250

service:
  pipelines:
    traces:
      receivers:  [otlp, jaeger]
      processors: [batch, filter]
      exporters:  [otlp, jaeger]

Grafana Tempo

Tempo stores traces cheaply in object storage (S3, GCS) and integrates directly with Grafana; use TraceQL to query by service, duration, or attributes, and link traces to Loki logs via the shared trace ID.

bashtraceql-examples.txt

# TraceQL: find slow traces in a specific service
{ .service.name = "checkout" } | avg(duration) > 500ms

# Traces with errors from a specific operation
{ .service.name = "payment" && status = error }

# Find traces longer than 2 seconds across all services
{ duration > 2s }

# Traces with a specific HTTP status code
{ span.http.status_code = 500 }

# Tempo HTTP API (useful for scripting)
curl "http://tempo.monitoring.svc:3200/api/traces/<trace-id>"
curl "http://tempo.monitoring.svc:3200/api/search?tags=service.name%3Dcheckout&limit=20"

Log–Trace Correlation

Include the traceId and spanId in every structured log line; Grafana can then auto-link from a Loki log entry to the corresponding Tempo trace, dramatically reducing context-switching during incidents.

yamlgrafana-datasource-derived-fields.yaml

# Grafana Loki datasource configuration: add a derived field to link trace IDs to Tempo
apiVersion: 1
datasources:
- name: Loki
  type: loki
  url: http://loki.monitoring.svc:3100
  jsonData:
    derivedFields:
    - matcherRegex: '"traceId":"([0-9a-f]+)"'   # extract traceId from JSON logs
      name: TraceID
      url: '$${__value.raw}'                     # link to Tempo datasource
      datasourceUid: tempo-uid                   # UID of your Tempo datasource

SRE Use Cases

✓Latency investigation: a p99 latency alert fires → find the slowest traces in Tempo → identify the bottleneck span (slow DB query, downstream call).
✓Postmortem timeline: use trace waterfall to prove which service was slow during the incident window.
✓Error attribution: filter by status=error + service to find which operation is producing 5xx responses.
✓Dependency mapping: Tempo service graph view shows which services call each other and their error rates.

Concepts

OpenTelemetry Collector

Grafana Tempo

Log–Trace Correlation

SRE Use Cases

Related Pages