Alerting Design
Alert on user-visible symptoms first (SLO burn rate), not causes. Every page alert needs a runbook. Alerts that wake people up must be actionable, or they will be silenced and ignored. Prefer symptom-based alerts; use cause-based alerts for earlier warning where justified.
Alert Design Principles
- Alert on symptoms, not causes. "5xx error rate > 1%" is a symptom; "pod restarted" is a cause. Symptoms page; causes create tickets.
- Every page alert must have a runbook. Link it in
annotations.runbook_url. If you can't write a runbook for it, don't page on it. - Alert must be actionable. If there's nothing to do except wait, don't page — use a ticket alert instead.
- Tune thresholds with data. Start with conservative thresholds and a long
forduration; tighten once you understand the signal. - Avoid alert storms. Use
inhibit_rulesin Alertmanager to suppress cause alerts when a symptom alert is already firing.
PrometheusRule Templates
These Kubernetes-native alert rules cover the most common SRE scenarios; apply them via kubectl apply and they are picked up by the Prometheus Operator automatically.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-service-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus # must match your Prometheus selector
spec:
groups:
- name: my-service.availability
rules:
# Symptom: high error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{job="my-service", code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="my-service"}[5m])) > 0.01
for: 5m
labels:
severity: page
team: platform
annotations:
summary: "High 5xx error rate on my-service"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
runbook_url: "https://wiki.example.com/runbooks/my-service-errors"
# Symptom: high latency
- alert: HighP99Latency
expr: |
histogram_quantile(0.99, sum by(le)(
rate(http_request_duration_seconds_bucket{job="my-service"}[5m])
)) > 1.0
for: 5m
labels:
severity: page
annotations:
summary: "p99 latency > 1s on my-service"
description: "p99 is {{ $value | humanizeDuration }}"
- name: kubernetes.workloads
rules:
# Cause: pod CrashLoopBackOff — ticket severity
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0
for: 15m
labels:
severity: ticket
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"
description: "Container {{ $labels.container }} restarting {{ $value | printf \"%.0f\" }} times/min"
# Cause: deployment not progressing
- alert: DeploymentNotAvailable
expr: |
kube_deployment_status_replicas_available
/ kube_deployment_spec_replicas < 0.5
for: 10m
labels:
severity: page
annotations:
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has < 50% pods available"Alertmanager Configuration
Alertmanager routes alerts to the right receiver (PagerDuty, Slack, email) based on labels; use inhibit rules to suppress child alerts when a parent alert is firing to reduce noise.
global:
resolve_timeout: 5m
route:
group_by: [alertname, namespace, team]
group_wait: 30s # wait before sending first notification (allows grouping)
group_interval: 5m # how often to send new alerts in an existing group
repeat_interval: 4h # how often to re-notify for a still-firing alert
receiver: slack-general
routes:
# Page alerts go to PagerDuty immediately
- match:
severity: page
receiver: pagerduty
continue: false
# Team-specific routing
- match:
team: platform
receiver: slack-platform
receivers:
- name: slack-general
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
- name: pagerduty
pagerduty_configs:
- routing_key: <your-pagerduty-key>
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: slack-platform
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#platform-alerts'
inhibit_rules:
# Suppress CrashLoopBackOff alerts when a higher-level service alert is firing
- source_match:
severity: page
alertname: DeploymentNotAvailable
target_match:
severity: ticket
alertname: PodCrashLooping
equal: [namespace, deployment]Testing Alerts
Use promtool to unit-test alert rules before deploying them — prevents silent failures where the query syntax is valid but the alert never fires as expected.
# Validate rule file syntax
promtool check rules my-rules.yaml
# Run unit tests
promtool test rules alert-tests.yaml
# Check Alertmanager config
amtool config check alertmanager.yaml
# See active alerts (from Alertmanager)
amtool alert query # all active alerts
amtool alert query severity=page # page alerts only
amtool silence add alertname=PodCrashLooping namespace=staging --duration=1h --comment="Scheduled maintenance"
# Test a PromQL expression before writing a rule
curl 'http://prometheus.monitoring.svc:9090/api/v1/query' \
--data-urlencode 'query=rate(http_requests_total[5m])' | jq .
# Check Prometheus rule evaluation status
kubectl port-forward svc/prometheus-operated 9090 -n monitoring
# Open http://localhost:9090/rules in browser