Thanos
TL;DR
Thanos extends Prometheus with long-term object storage, global query view, and downsampling. Sidecar uploads blocks to S3/GCS; Query federates multiple Prometheus/Store endpoints; Compactor reduces retention cost.
Components
| Component | Role |
|---|---|
| Sidecar | Runs alongside Prometheus; uploads TSDB blocks to object storage |
| Query | Single PromQL endpoint aggregating all stores |
| Store Gateway | Serves historical blocks from object storage |
| Compactor | Downsamples and compacts blocks in object storage |
| Receiver | Alternative ingest path (remote write) — less common with sidecar |
| Ruler | Global alerting/recording rules against Query |
Typical Flow
Sidecar uploads blocks to object storage; Query federates recent (sidecar) and historical (store gateway) data.
- Prometheus scrapes metrics locally (short retention, e.g. 15 days).
- Thanos Sidecar uploads completed blocks to S3/GCS.
- Thanos Query fans out to Sidecar (recent) + Store Gateway (historical).
- Grafana datasource points at Thanos Query for unified long-range dashboards.
- Compactor downsamples old data to reduce storage and query cost.
Helm Enable (kube-prometheus-stack)
yaml
values-thanos.yaml
prometheus:
prometheusSpec:
retention: 15d
thanos:
image: quay.io/thanos/thanos:v0.34.1
objectStorageConfig:
existingSecret:
name: thanos-objstore
key: thanos.yaml
thanosService:
enabled: true
thanosServiceMonitor:
enabled: true
yaml
thanos-objstore-secret.yaml
type: S3
config:
bucket: client-prod-thanos
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
# Use IRSA or access_key/secret_key per client policy.
Commands
bash
thanos.sh
kubectl get pods -n monitoring | grep thanos
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Thanos Query UI: check Stores — should list sidecar + store-gateway.
# Run same PromQL as Prometheus; extend time range beyond local retention.
Multi-Cluster Query
Each cluster runs Prometheus + Sidecar uploading to a shared bucket (or per-cluster buckets). Central Thanos Query discovers all Store endpoints.
| Pattern | When |
|---|---|
| Shared bucket, external labels | Distinguish clusters via externalLabels on Prometheus |
| Central Query in mgmt cluster | Single Grafana datasource for all clusters |
| Per-cluster Query | Smaller setups; query locally only |
Gotchas
- externalLabels required — without cluster label, series from multiple Prometheus instances collide.
- Compactor is singleton — only one Compactor per bucket; running duplicates corrupts data.
- Query ≠ Prometheus — some PromQL functions behave differently at global scale; test long-range queries.
- Block upload delay — Sidecar uploads after block completion (~2h); very recent data may only be in local Prometheus.