Alerting Design — K8s SRE Reference

TL;DR

Alert on user-visible symptoms first (SLO burn rate), not causes. Every page alert needs a runbook. Alerts that wake people up must be actionable, or they will be silenced and ignored. Prefer symptom-based alerts; use cause-based alerts for earlier warning where justified.

Alert Design Principles

✓Alert on symptoms, not causes. "5xx error rate > 1%" is a symptom; "pod restarted" is a cause. Symptoms page; causes create tickets.
✓Every page alert must have a runbook. Link it in annotations.runbook_url. If you can't write a runbook for it, don't page on it.
✓Alert must be actionable. If there's nothing to do except wait, don't page — use a ticket alert instead.
✓Tune thresholds with data. Start with conservative thresholds and a long for duration; tighten once you understand the signal.
✓Avoid alert storms. Use inhibit_rules in Alertmanager to suppress cause alerts when a symptom alert is already firing.

PrometheusRule Templates

These Kubernetes-native alert rules cover the most common SRE scenarios; apply them via kubectl apply and they are picked up by the Prometheus Operator automatically.

yamlprometheus-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-service-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus   # must match your Prometheus selector
spec:
  groups:
  - name: my-service.availability
    rules:
    # Symptom: high error rate
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{job="my-service", code=~"5.."}[5m]))
          / sum(rate(http_requests_total{job="my-service"}[5m])) > 0.01
      for: 5m
      labels:
        severity: page
        team: platform
      annotations:
        summary: "High 5xx error rate on my-service"
        description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
        runbook_url: "https://wiki.example.com/runbooks/my-service-errors"

    # Symptom: high latency
    - alert: HighP99Latency
      expr: |
        histogram_quantile(0.99, sum by(le)(
          rate(http_request_duration_seconds_bucket{job="my-service"}[5m])
        )) > 1.0
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "p99 latency > 1s on my-service"
        description: "p99 is {{ $value | humanizeDuration }}"

  - name: kubernetes.workloads
    rules:
    # Cause: pod CrashLoopBackOff — ticket severity
    - alert: PodCrashLooping
      expr: |
        rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0
      for: 15m
      labels:
        severity: ticket
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"
        description: "Container {{ $labels.container }} restarting {{ $value | printf \"%.0f\" }} times/min"

    # Cause: deployment not progressing
    - alert: DeploymentNotAvailable
      expr: |
        kube_deployment_status_replicas_available
          / kube_deployment_spec_replicas < 0.5
      for: 10m
      labels:
        severity: page
      annotations:
        summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has < 50% pods available"

Alertmanager Configuration

Alertmanager routes alerts to the right receiver (PagerDuty, Slack, email) based on labels; use inhibit rules to suppress child alerts when a parent alert is firing to reduce noise.

yamlalertmanager.yaml

global:
  resolve_timeout: 5m

route:
  group_by: [alertname, namespace, team]
  group_wait:      30s   # wait before sending first notification (allows grouping)
  group_interval:  5m    # how often to send new alerts in an existing group
  repeat_interval: 4h    # how often to re-notify for a still-firing alert
  receiver: slack-general

  routes:
  # Page alerts go to PagerDuty immediately
  - match:
      severity: page
    receiver: pagerduty
    continue: false

  # Team-specific routing
  - match:
      team: platform
    receiver: slack-platform

receivers:
- name: slack-general
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#alerts'
    title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'

- name: pagerduty
  pagerduty_configs:
  - routing_key: <your-pagerduty-key>
    description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

- name: slack-platform
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#platform-alerts'

inhibit_rules:
# Suppress CrashLoopBackOff alerts when a higher-level service alert is firing
- source_match:
    severity: page
    alertname: DeploymentNotAvailable
  target_match:
    severity: ticket
    alertname: PodCrashLooping
  equal: [namespace, deployment]

Testing Alerts

Use promtool to unit-test alert rules before deploying them — prevents silent failures where the query syntax is valid but the alert never fires as expected.

bashalert-testing.sh

# Validate rule file syntax
promtool check rules my-rules.yaml

# Run unit tests
promtool test rules alert-tests.yaml

# Check Alertmanager config
amtool config check alertmanager.yaml

# See active alerts (from Alertmanager)
amtool alert query                           # all active alerts
amtool alert query severity=page             # page alerts only
amtool silence add alertname=PodCrashLooping namespace=staging --duration=1h --comment="Scheduled maintenance"

# Test a PromQL expression before writing a rule
curl 'http://prometheus.monitoring.svc:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total[5m])' | jq .

# Check Prometheus rule evaluation status
kubectl port-forward svc/prometheus-operated 9090 -n monitoring
# Open http://localhost:9090/rules in browser

Alert Design Principles

PrometheusRule Templates

Alertmanager Configuration

Testing Alerts

Related Pages