Incident Response
Gather evidence before changing anything. Narrow the blast radius, restore service, communicate, then understand why. This page covers the repeatable steps from alert to post-incident review.
Incident Triage Flow
Follow this sequence every time — even when you think you know the answer. Panic-driven shortcuts cause cascading failures more often than the original incident.
First 5 Minutes
This script establishes your working context, locates unhealthy resources, and captures a baseline snapshot of events — run it verbatim before touching anything.
# 1. Confirm context — never assume you're on the right cluster
kubectl config current-context
kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}'; echo
# 2. Quick cluster health
kubectl get nodes -o wide
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
# 3. Scope to the impacted namespace / app
NS=<namespace>
APP=<app-label-value>
kubectl get deploy,sts,ds,svc,ingress,pvc -n "$NS" -o wide
kubectl get events -n "$NS" --sort-by=.lastTimestamp | tail -40
kubectl get pods -n "$NS" -l app="$APP" -o wide
# 4. Capture baseline logs (before any change)
kubectl logs -n "$NS" -l app="$APP" --tail=200 --all-containers --prefix 2>&1 | tee /tmp/incident-logs-$(date +%s).txt
# 5. Who owns it? (GitOps / Helm / Terraform)
kubectl get deploy "$APP" -n "$NS" -o jsonpath='{.metadata.annotations}' | python3 -m json.toolSeverity & Escalation
| Severity | Definition | Response time | Action |
|---|---|---|---|
| SEV1 | Complete outage, data loss risk, security breach | Immediate | Page incident commander, customer comms within 5 min |
| SEV2 | Significant degradation, major feature broken | < 15 min | Engage on-call, status page update |
| SEV3 | Minor degradation, workaround available | < 1 hour | Track in ticket, fix in business hours |
| SEV4 | Cosmetic issue, no user impact | Next sprint | Log and schedule |
Communication Templates
Use these templates for Slack / status page updates. Consistent, frequent communication reduces stakeholder interruptions so you can focus on fixing the issue.
# --- INCIDENT OPENED ---
:rotating_light: *[SEV2] Incident: <short title>*
- *Impact:* <what users see>
- *Affected:* <service / cluster / namespace>
- *Started:* <time in UTC>
- *IC:* <your name>
- *Status:* Investigating — next update in 15 min
# --- UPDATE (every 15–30 min) ---
:speech_balloon: *[SEV2] Update — <time UTC>*
- *Findings:* <what you found>
- *Action taken:* <what you changed and why>
- *Next step:* <what you're doing next>
- *ETA:* <estimate or "unknown">
# --- RESOLVED ---
:white_check_mark: *[SEV2] Resolved — <time UTC>*
- *Duration:* <start to resolve time>
- *Root cause:* <one-line summary>
- *Fix:* <what was done>
- *Follow-up:* postmortem scheduled for <date>Rollback Decision Tree
Rollback first when the cause is unclear or the fix will take more than 15 minutes; a slow rollback on a healthy release is always better than a prolonged outage.
# View rollout history
kubectl rollout history deploy/<app> -n <ns>
kubectl rollout history deploy/<app> -n <ns> --revision=3 # see revision detail
# Roll back to previous version
kubectl rollout undo deploy/<app> -n <ns>
# Roll back to a specific revision
kubectl rollout undo deploy/<app> -n <ns> --to-revision=4
# Watch rollback progress
kubectl rollout status deploy/<app> -n <ns> --timeout=5m
# Helm rollback (if app is Helm-managed)
helm history <release> -n <ns>
helm rollback <release> <revision> -n <ns> --wait
# ArgoCD: sync back to known-good commit
argocd app set <app> --revision <commit-sha>
argocd app sync <app>Timeline Keeping
A real-time incident timeline doubles as the postmortem skeleton; paste every action + outcome into a shared doc as you go — you will forget details within an hour.
# Incident Timeline — <title> — <date> UTC
| Time (UTC) | Who | Action / Observation | Outcome |
|------------|----------|---------------------------------------------------|----------------------|
| 14:02 | alert | PagerDuty fired: api-server 5xx > 10% | Incident opened |
| 14:05 | @you | Checked context, pods, events | 3 pods OOMKilled |
| 14:08 | @you | Increased memory limit from 256Mi to 512Mi | Pods restarted |
| 14:12 | @you | Error rate dropped to 0; rollout complete | Service restored |
| 14:15 | @you | Communicated resolved on #incidents | |
# Root Cause: memory limit set too low for peak traffic
# Contributing factor: no VPA or alert on memory utilisation
# Follow-up: add memory alert, increase limit, enable VPAPostmortem Template
A blameless postmortem focuses on system failures, not people; the goal is to generate concrete follow-up actions with owners and deadlines.
# Postmortem — <Title>
## Summary
<1–2 sentence description of what happened, duration, and user impact.>
## Timeline (UTC)
<paste from incident timeline>
## Root Cause
<Specific technical reason. Avoid "human error" as a root cause.>
## Contributing Factors
- <Factor 1: e.g., no alerting on memory utilisation>
- <Factor 2: e.g., capacity not reviewed after traffic increase>
## Impact
- Services affected: <list>
- Users impacted: <count or %>
- Error budget consumed: <minutes / SLO math>
## What Went Well
- <Detection was fast>
- <On-call had runbook access>
## Action Items
| Action | Owner | Due | Priority |
|-------------------------------------|-----------|------------|----------|
| Add memory utilisation alert | @platform | 2026-06-01 | P1 |
| Enable VPA for api-service | @platform | 2026-06-07 | P2 |
| Review all service memory limits | @team | 2026-06-14 | P2 |