Incident Response — K8s SRE Reference

TL;DR

Gather evidence before changing anything. Narrow the blast radius, restore service, communicate, then understand why. This page covers the repeatable steps from alert to post-incident review.

Incident Triage Flow

Follow this sequence every time — even when you think you know the answer. Panic-driven shortcuts cause cascading failures more often than the original incident.

First 5 Minutes

This script establishes your working context, locates unhealthy resources, and captures a baseline snapshot of events — run it verbatim before touching anything.

bashfirst-5-min.sh

# 1. Confirm context — never assume you're on the right cluster
kubectl config current-context
kubectl config view --minify -o jsonpath='{.contexts[0].context.namespace}'; echo

# 2. Quick cluster health
kubectl get nodes -o wide
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded

# 3. Scope to the impacted namespace / app
NS=<namespace>
APP=<app-label-value>

kubectl get deploy,sts,ds,svc,ingress,pvc -n "$NS" -o wide
kubectl get events -n "$NS" --sort-by=.lastTimestamp | tail -40
kubectl get pods -n "$NS" -l app="$APP" -o wide

# 4. Capture baseline logs (before any change)
kubectl logs -n "$NS" -l app="$APP" --tail=200 --all-containers --prefix 2>&1 | tee /tmp/incident-logs-$(date +%s).txt

# 5. Who owns it? (GitOps / Helm / Terraform)
kubectl get deploy "$APP" -n "$NS" -o jsonpath='{.metadata.annotations}' | python3 -m json.tool

Severity & Escalation

Severity	Definition	Response time	Action
SEV1	Complete outage, data loss risk, security breach	Immediate	Page incident commander, customer comms within 5 min
SEV2	Significant degradation, major feature broken	< 15 min	Engage on-call, status page update
SEV3	Minor degradation, workaround available	< 1 hour	Track in ticket, fix in business hours
SEV4	Cosmetic issue, no user impact	Next sprint	Log and schedule

Communication Templates

Use these templates for Slack / status page updates. Consistent, frequent communication reduces stakeholder interruptions so you can focus on fixing the issue.

bashcomms-templates.txt

# --- INCIDENT OPENED ---
:rotating_light: *[SEV2] Incident: <short title>*
- *Impact:* <what users see>
- *Affected:* <service / cluster / namespace>
- *Started:* <time in UTC>
- *IC:* <your name>
- *Status:* Investigating — next update in 15 min

# --- UPDATE (every 15–30 min) ---
:speech_balloon: *[SEV2] Update — <time UTC>*
- *Findings:* <what you found>
- *Action taken:* <what you changed and why>
- *Next step:* <what you're doing next>
- *ETA:* <estimate or "unknown">

# --- RESOLVED ---
:white_check_mark: *[SEV2] Resolved — <time UTC>*
- *Duration:* <start to resolve time>
- *Root cause:* <one-line summary>
- *Fix:* <what was done>
- *Follow-up:* postmortem scheduled for <date>

Rollback Decision Tree

Rollback first when the cause is unclear or the fix will take more than 15 minutes; a slow rollback on a healthy release is always better than a prolonged outage.

bashrollback.sh

# View rollout history
kubectl rollout history deploy/<app> -n <ns>
kubectl rollout history deploy/<app> -n <ns> --revision=3  # see revision detail

# Roll back to previous version
kubectl rollout undo deploy/<app> -n <ns>

# Roll back to a specific revision
kubectl rollout undo deploy/<app> -n <ns> --to-revision=4

# Watch rollback progress
kubectl rollout status deploy/<app> -n <ns> --timeout=5m

# Helm rollback (if app is Helm-managed)
helm history <release> -n <ns>
helm rollback <release> <revision> -n <ns> --wait

# ArgoCD: sync back to known-good commit
argocd app set <app> --revision <commit-sha>
argocd app sync <app>

Timeline Keeping

A real-time incident timeline doubles as the postmortem skeleton; paste every action + outcome into a shared doc as you go — you will forget details within an hour.

bashtimeline-template.md

# Incident Timeline — <title>  — <date> UTC

| Time (UTC) | Who      | Action / Observation                              | Outcome              |
|------------|----------|---------------------------------------------------|----------------------|
| 14:02      | alert    | PagerDuty fired: api-server 5xx > 10%            | Incident opened      |
| 14:05      | @you     | Checked context, pods, events                     | 3 pods OOMKilled     |
| 14:08      | @you     | Increased memory limit from 256Mi to 512Mi        | Pods restarted       |
| 14:12      | @you     | Error rate dropped to 0; rollout complete         | Service restored     |
| 14:15      | @you     | Communicated resolved on #incidents               |                      |

# Root Cause: memory limit set too low for peak traffic
# Contributing factor: no VPA or alert on memory utilisation
# Follow-up: add memory alert, increase limit, enable VPA

Postmortem Template

A blameless postmortem focuses on system failures, not people; the goal is to generate concrete follow-up actions with owners and deadlines.

bashpostmortem.md

# Postmortem — <Title>

## Summary
<1–2 sentence description of what happened, duration, and user impact.>

## Timeline (UTC)
<paste from incident timeline>

## Root Cause
<Specific technical reason. Avoid "human error" as a root cause.>

## Contributing Factors
- <Factor 1: e.g., no alerting on memory utilisation>
- <Factor 2: e.g., capacity not reviewed after traffic increase>

## Impact
- Services affected: <list>
- Users impacted: <count or %>
- Error budget consumed: <minutes / SLO math>

## What Went Well
- <Detection was fast>
- <On-call had runbook access>

## Action Items
| Action                              | Owner     | Due        | Priority |
|-------------------------------------|-----------|------------|----------|
| Add memory utilisation alert        | @platform | 2026-06-01 | P1       |
| Enable VPA for api-service          | @platform | 2026-06-07 | P2       |
| Review all service memory limits    | @team     | 2026-06-14 | P2       |