TL;DR

Prometheus scrapes metrics on an interval and stores time series. On K8s, kube-prometheus-stack uses ServiceMonitor/PodMonitor CRs for discovery. Use kubectl top for quick checks, PromQL for trends, and PrometheusRule + Alertmanager for paging.

Metrics vs Logs vs Events

SignalBest forTool
MetricsTrends, saturation, SLOs, alertingPrometheus, Grafana
LogsApp errors, stack traces, request detailskubectl logs, Loki, CloudWatch
EventsK8s decisions — scheduling, probes, OOMkubectl get events
TracesLatency across servicesJaeger, Tempo, Zipkin

Quick Checks

bash resource-usage.sh
# Requires metrics-server.
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory

kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
kubectl get pods -A --field-selector=status.phase!=Running

kube-prometheus-stack

Most client clusters use the prometheus-community Helm chart (kube-prometheus-stack). It installs Prometheus Operator, Alertmanager, Grafana, node-exporter, and kube-state-metrics.

Pod :8080/metrics Service port: metrics ServiceMonitor selector + endpoint Prometheus Operator Prometheus TSDB scrape ServiceMonitor labels must match Prometheus serviceMonitorSelector or scrape never starts.

Prometheus Operator discovers targets via ServiceMonitor CRs — not static scrape configs.

bash stack-checks.sh
kubectl get pods,svc -n monitoring
kubectl get prometheus,alertmanager -n monitoring
kubectl get servicemonitor,podmonitor,prometheusrule -A

kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
kubectl port-forward -n monitoring svc/alertmanager-operated 9093:9093

ServiceMonitor

Tells Prometheus Operator which Services to scrape. Labels must match the Prometheus CR's serviceMonitorSelector.

yaml servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: web-api
  namespace: app
  labels:
    release: prometheus   # Must match Prometheus serviceMonitorSelector.
spec:
  selector:
    matchLabels:
      app: web-api
  namespaceSelector:
    matchNames:
      - app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

PodMonitor

Scrape Pods directly when there's no Service — common for DaemonSets or hostNetwork pods.

yaml podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: node-exporter-custom
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: node-exporter
  podMetricsEndpoints:
    - port: metrics
      interval: 30s

PromQL Starters

promql queries.promql
# CPU usage by pod (cores).
sum(rate(container_cpu_usage_seconds_total{container!="",pod!=""}[5m])) by (namespace,pod)

# Memory working set by pod.
sum(container_memory_working_set_bytes{container!="",pod!=""}) by (namespace,pod)

# Pod restart rate (15m).
sum(increase(kube_pod_container_status_restarts_total[15m])) by (namespace,pod,container)

# Pending pods.
sum(kube_pod_status_phase{phase="Pending"}) by (namespace)

# HTTP 5xx rate (app must expose http_requests_total).
sum(rate(http_requests_total{status=~"5.."}[5m])) by (namespace,service)

# Node disk pressure.
kube_node_status_condition{condition="DiskPressure",status="true"}

PrometheusRule & Alertmanager

yaml prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: web-api-alerts
  namespace: app
  labels:
    release: prometheus
spec:
  groups:
    - name: web-api
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{job="web-api",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{job="web-api"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High 5xx rate on web-api"
            description: "Error rate above 5% for 5 minutes."
bash alertmanager.sh
# Check firing alerts in Prometheus UI: /alerts
# Or via Alertmanager UI: /#/alerts (port-forward 9093).

kubectl get prometheusrule -A
kubectl describe prometheusrule web-api-alerts -n app

Container Logs

bash logs.sh
kubectl logs <pod> -n <ns> -c <container> --tail=200
kubectl logs <pod> -n <ns> -c <container> --previous   # Crashed container.
kubectl logs -n <ns> -l app=<label> --all-containers --tail=100

Gotchas

  • !ServiceMonitor labels must match Prometheus serviceMonitorSelector — missing label = no scrape.
  • !High cardinality — avoid unbounded label values (user IDs, URLs) in custom metrics.
  • !rate() needs range vector — always use [5m] or similar with rate() and increase().
  • !Retention — default Prometheus retention is ~15 days; use Thanos for long-term storage.