TL;DR

Thanos extends Prometheus with long-term object storage, global query view, and downsampling. Sidecar uploads blocks to S3/GCS; Query federates multiple Prometheus/Store endpoints; Compactor reduces retention cost.

Components

ComponentRole
SidecarRuns alongside Prometheus; uploads TSDB blocks to object storage
QuerySingle PromQL endpoint aggregating all stores
Store GatewayServes historical blocks from object storage
CompactorDownsamples and compacts blocks in object storage
ReceiverAlternative ingest path (remote write) — less common with sidecar
RulerGlobal alerting/recording rules against Query

Typical Flow

Prometheus local TSDB ~15d Thanos Sidecar S3 / GCS long-term blocks Store Gateway reads blocks Thanos Query global PromQL API Grafana long-range dashboards Compactor downsample (singleton)

Sidecar uploads blocks to object storage; Query federates recent (sidecar) and historical (store gateway) data.

  1. Prometheus scrapes metrics locally (short retention, e.g. 15 days).
  2. Thanos Sidecar uploads completed blocks to S3/GCS.
  3. Thanos Query fans out to Sidecar (recent) + Store Gateway (historical).
  4. Grafana datasource points at Thanos Query for unified long-range dashboards.
  5. Compactor downsamples old data to reduce storage and query cost.

Helm Enable (kube-prometheus-stack)

yaml values-thanos.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    thanos:
      image: quay.io/thanos/thanos:v0.34.1
      objectStorageConfig:
        existingSecret:
          name: thanos-objstore
          key: thanos.yaml

thanosService:
  enabled: true
thanosServiceMonitor:
  enabled: true
yaml thanos-objstore-secret.yaml
type: S3
config:
  bucket: client-prod-thanos
  endpoint: s3.us-east-1.amazonaws.com
  region: us-east-1
  # Use IRSA or access_key/secret_key per client policy.

Commands

bash thanos.sh
kubectl get pods -n monitoring | grep thanos
kubectl port-forward -n monitoring svc/thanos-query 9090:9090

# Thanos Query UI: check Stores — should list sidecar + store-gateway.
# Run same PromQL as Prometheus; extend time range beyond local retention.

Multi-Cluster Query

Each cluster runs Prometheus + Sidecar uploading to a shared bucket (or per-cluster buckets). Central Thanos Query discovers all Store endpoints.

PatternWhen
Shared bucket, external labelsDistinguish clusters via externalLabels on Prometheus
Central Query in mgmt clusterSingle Grafana datasource for all clusters
Per-cluster QuerySmaller setups; query locally only

Gotchas

  • !externalLabels required — without cluster label, series from multiple Prometheus instances collide.
  • !Compactor is singleton — only one Compactor per bucket; running duplicates corrupts data.
  • !Query ≠ Prometheus — some PromQL functions behave differently at global scale; test long-range queries.
  • !Block upload delay — Sidecar uploads after block completion (~2h); very recent data may only be in local Prometheus.