TL;DR

No metrics? Check scrape targets first. No data in Grafana? Check datasource and time range. Alert firing? Validate the PromQL in Prometheus before muting. Work top-down: stack health → scrape → query → alert routing.

First Steps

bash first-steps.sh
kubectl get pods -n monitoring
kubectl get prometheus,alertmanager -n monitoring
kubectl get servicemonitor,podmonitor -A

# Port-forward Prometheus → Status → Targets (check UP/DOWN).
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090

Symptom → Fix

SymptomLikely causeFix
Target DOWN in PrometheusWrong port/path, pod not ready, NetworkPolicyCheck ServiceMonitor selector, pod labels, curl from prometheus pod
ServiceMonitor exists, no scrapeLabel mismatch with Prometheus selectorAdd release: prometheus or matching label
Metric missing in GrafanaNot scraped, wrong query, time rangeQuery raw metric in Prometheus Explore first
kubectl top failsmetrics-server not installed or brokenkubectl get deployment metrics-server -n kube-system
Alert firing incorrectlyBad PromQL, for too short, label mismatchTest expr in Prometheus; check for duration
Alert not reaching PagerDuty/SlackAlertmanager routing, silences, receiver configCheck Alertmanager UI → Status → Config
Thanos Query empty for old dataSidecar upload failed, wrong bucket credsCheck sidecar logs; verify objstore secret
High cardinality / OOM PrometheusUnbounded label values in custom metricsFind metric via TSDB stats; fix instrumentation

Scrape Debugging

bash scrape-debug.sh
# From Prometheus pod — can it reach the target?
kubectl exec -n monitoring prometheus-prometheus-0 -c prometheus -- \
  wget -qO- http://web-api.app.svc:8080/metrics | head

kubectl describe servicemonitor web-api -n app
kubectl get svc web-api -n app --show-labels
kubectl get endpoints web-api -n app

# TLS scrape issues — check ServiceMonitor tlsConfig and cert paths.

Resource Triage

SignalCheck
High CPUkubectl top, CPU throttling metric, HPA status, recent deploy
High memory / OOMKilledLimits vs working set, kube_pod_container_status_last_terminated_reason
Restartskubectl describe pod, liveness probe failures, --previous logs
Disk pressureNode conditions, PVC usage, emptyDir size

Alertmanager Checks

bash alertmanager.sh
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# UI: /#/alerts (firing), /#/silences (active silences), /#/status

kubectl get secret alertmanager-prometheus-kube-prometheus-alertmanager \
  -n monitoring -o yaml   # Check routing config if alerts not delivered.

Gotchas

  • !Muting alerts — use Alertmanager silences with expiry; document why in the incident channel.
  • !Recording rules — if a dashboard depends on a recording rule, check the PrometheusRule CR is loaded.
  • !Clock skew — scrape timestamps off if node time is wrong; check NTP on nodes.