Distributed Tracing
Distributed tracing answers "where did this request spend its time?" across services. Use OpenTelemetry (OTEL) for instrumentation and collection, Tempo or Jaeger as backends. Traces are most useful during latency investigations and postmortems — correlate with Loki logs using trace IDs.
Concepts
| Term | Definition |
|---|---|
| Trace | A tree of spans representing a single request through a distributed system |
| Span | One unit of work (e.g., an HTTP call, DB query); has name, duration, attributes, and status |
| Trace ID | Globally unique ID propagated in HTTP headers (traceparent) across service boundaries |
| Sampler | Decides which traces to record; head-based (at start) or tail-based (after completion) |
| Collector | OTEL Collector receives spans from services, processes, and exports to backends |
OpenTelemetry Collector
Deploy the OTEL Collector as a Deployment (gateway) or DaemonSet (agent); apps send spans to the local agent, which batches and forwards to Tempo/Jaeger — decoupling the app from the backend choice.
receivers:
otlp:
protocols:
grpc: {endpoint: 0.0.0.0:4317}
http: {endpoint: 0.0.0.0:4318}
# Also accept Jaeger format from older services
jaeger:
protocols:
thrift_http: {endpoint: 0.0.0.0:14268}
processors:
batch:
timeout: 1s
send_batch_size: 1024
# Drop internal health-check spans (high volume, low value)
filter:
spans:
exclude:
match_type: regexp
attributes:
- {key: http.target, value: "^/health.*"}
exporters:
otlp:
endpoint: tempo.monitoring.svc:4317
tls: {insecure: true}
# Also export to Jaeger for teams still using it
jaeger:
endpoint: jaeger-collector.monitoring.svc:14250
service:
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [batch, filter]
exporters: [otlp, jaeger]Grafana Tempo
Tempo stores traces cheaply in object storage (S3, GCS) and integrates directly with Grafana; use TraceQL to query by service, duration, or attributes, and link traces to Loki logs via the shared trace ID.
# TraceQL: find slow traces in a specific service
{ .service.name = "checkout" } | avg(duration) > 500ms
# Traces with errors from a specific operation
{ .service.name = "payment" && status = error }
# Find traces longer than 2 seconds across all services
{ duration > 2s }
# Traces with a specific HTTP status code
{ span.http.status_code = 500 }
# Tempo HTTP API (useful for scripting)
curl "http://tempo.monitoring.svc:3200/api/traces/<trace-id>"
curl "http://tempo.monitoring.svc:3200/api/search?tags=service.name%3Dcheckout&limit=20"Log–Trace Correlation
Include the traceId and spanId in every structured log line; Grafana can then auto-link from a Loki log entry to the corresponding Tempo trace, dramatically reducing context-switching during incidents.
# Grafana Loki datasource configuration: add a derived field to link trace IDs to Tempo
apiVersion: 1
datasources:
- name: Loki
type: loki
url: http://loki.monitoring.svc:3100
jsonData:
derivedFields:
- matcherRegex: '"traceId":"([0-9a-f]+)"' # extract traceId from JSON logs
name: TraceID
url: '$${__value.raw}' # link to Tempo datasource
datasourceUid: tempo-uid # UID of your Tempo datasourceSRE Use Cases
- Latency investigation: a p99 latency alert fires → find the slowest traces in Tempo → identify the bottleneck span (slow DB query, downstream call).
- Postmortem timeline: use trace waterfall to prove which service was slow during the incident window.
- Error attribution: filter by
status=error+ service to find which operation is producing 5xx responses. - Dependency mapping: Tempo service graph view shows which services call each other and their error rates.