Monitoring Troubleshooting — K8s SRE Reference

TL;DR

No metrics? Check scrape targets first. No data in Grafana? Check datasource and time range. Alert firing? Validate the PromQL in Prometheus before muting. Work top-down: stack health → scrape → query → alert routing.

First Steps

bash first-steps.sh

kubectl get pods -n monitoring
kubectl get prometheus,alertmanager -n monitoring
kubectl get servicemonitor,podmonitor -A

# Port-forward Prometheus → Status → Targets (check UP/DOWN).
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090

Symptom → Fix

Symptom	Likely cause	Fix
Target DOWN in Prometheus	Wrong port/path, pod not ready, NetworkPolicy	Check ServiceMonitor selector, pod labels, curl from prometheus pod
ServiceMonitor exists, no scrape	Label mismatch with Prometheus selector	Add `release: prometheus` or matching label
Metric missing in Grafana	Not scraped, wrong query, time range	Query raw metric in Prometheus Explore first
`kubectl top` fails	metrics-server not installed or broken	`kubectl get deployment metrics-server -n kube-system`
Alert firing incorrectly	Bad PromQL, `for` too short, label mismatch	Test expr in Prometheus; check `for` duration
Alert not reaching PagerDuty/Slack	Alertmanager routing, silences, receiver config	Check Alertmanager UI → Status → Config
Thanos Query empty for old data	Sidecar upload failed, wrong bucket creds	Check sidecar logs; verify objstore secret
High cardinality / OOM Prometheus	Unbounded label values in custom metrics	Find metric via TSDB stats; fix instrumentation

Scrape Debugging

bash scrape-debug.sh

# From Prometheus pod — can it reach the target?
kubectl exec -n monitoring prometheus-prometheus-0 -c prometheus -- \
  wget -qO- http://web-api.app.svc:8080/metrics | head

kubectl describe servicemonitor web-api -n app
kubectl get svc web-api -n app --show-labels
kubectl get endpoints web-api -n app

# TLS scrape issues — check ServiceMonitor tlsConfig and cert paths.

Resource Triage

Signal	Check
High CPU	`kubectl top`, CPU throttling metric, HPA status, recent deploy
High memory / OOMKilled	Limits vs working set, `kube_pod_container_status_last_terminated_reason`
Restarts	`kubectl describe pod`, liveness probe failures, `--previous` logs
Disk pressure	Node conditions, PVC usage, emptyDir size

Alertmanager Checks

bash alertmanager.sh

kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# UI: /#/alerts (firing), /#/silences (active silences), /#/status

kubectl get secret alertmanager-prometheus-kube-prometheus-alertmanager \
  -n monitoring -o yaml   # Check routing config if alerts not delivered.

Gotchas

!Muting alerts — use Alertmanager silences with expiry; document why in the incident channel.
!Recording rules — if a dashboard depends on a recording rule, check the PrometheusRule CR is loaded.
!Clock skew — scrape timestamps off if node time is wrong; check NTP on nodes.