Monitoring Troubleshooting
TL;DR
No metrics? Check scrape targets first. No data in Grafana? Check datasource and time range. Alert firing? Validate the PromQL in Prometheus before muting. Work top-down: stack health → scrape → query → alert routing.
First Steps
bash
first-steps.sh
kubectl get pods -n monitoring
kubectl get prometheus,alertmanager -n monitoring
kubectl get servicemonitor,podmonitor -A
# Port-forward Prometheus → Status → Targets (check UP/DOWN).
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
Symptom → Fix
| Symptom | Likely cause | Fix |
|---|---|---|
| Target DOWN in Prometheus | Wrong port/path, pod not ready, NetworkPolicy | Check ServiceMonitor selector, pod labels, curl from prometheus pod |
| ServiceMonitor exists, no scrape | Label mismatch with Prometheus selector | Add release: prometheus or matching label |
| Metric missing in Grafana | Not scraped, wrong query, time range | Query raw metric in Prometheus Explore first |
kubectl top fails | metrics-server not installed or broken | kubectl get deployment metrics-server -n kube-system |
| Alert firing incorrectly | Bad PromQL, for too short, label mismatch | Test expr in Prometheus; check for duration |
| Alert not reaching PagerDuty/Slack | Alertmanager routing, silences, receiver config | Check Alertmanager UI → Status → Config |
| Thanos Query empty for old data | Sidecar upload failed, wrong bucket creds | Check sidecar logs; verify objstore secret |
| High cardinality / OOM Prometheus | Unbounded label values in custom metrics | Find metric via TSDB stats; fix instrumentation |
Scrape Debugging
bash
scrape-debug.sh
# From Prometheus pod — can it reach the target?
kubectl exec -n monitoring prometheus-prometheus-0 -c prometheus -- \
wget -qO- http://web-api.app.svc:8080/metrics | head
kubectl describe servicemonitor web-api -n app
kubectl get svc web-api -n app --show-labels
kubectl get endpoints web-api -n app
# TLS scrape issues — check ServiceMonitor tlsConfig and cert paths.
Resource Triage
| Signal | Check |
|---|---|
| High CPU | kubectl top, CPU throttling metric, HPA status, recent deploy |
| High memory / OOMKilled | Limits vs working set, kube_pod_container_status_last_terminated_reason |
| Restarts | kubectl describe pod, liveness probe failures, --previous logs |
| Disk pressure | Node conditions, PVC usage, emptyDir size |
Alertmanager Checks
bash
alertmanager.sh
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093
# UI: /#/alerts (firing), /#/silences (active silences), /#/status
kubectl get secret alertmanager-prometheus-kube-prometheus-alertmanager \
-n monitoring -o yaml # Check routing config if alerts not delivered.
Gotchas
- Muting alerts — use Alertmanager silences with expiry; document why in the incident channel.
- Recording rules — if a dashboard depends on a recording rule, check the PrometheusRule CR is loaded.
- Clock skew — scrape timestamps off if node time is wrong; check NTP on nodes.