K8s SRE Reference
This site is a practical Kubernetes and platform SRE reference: short explanations, commands you can copy, annotated YAML, architecture diagrams, troubleshooting checks, and client-ready implementation examples.
Start Here
Use this site like an operations notebook. Open the page for the thing you are touching, skim the mental model, copy the safest command or manifest pattern, then validate with the troubleshooting section before making changes in a client cluster.
Learning Path & Field Workflow
Study order, from-scratch lab sequence, client-site workflow, and the content pattern each topic page should follow.
Incident Response
First-5-minutes triage, severity table, comms templates, rollback, and postmortem structure.
Linux Daily Commands
Essential bash, file, process, network, and user management commands for SRE day-to-day work.
Python SRE Scripts
Ready-to-use health check, Kubernetes reporting, log scanning, and Slack notification scripts.
Architecture & Control Plane
Cluster layout, apiserver, scheduler, controller-manager, etcd, kubelet, and HA control plane diagrams.
Deployments
Rollouts, rollbacks, probes, resources, update strategy, and production Deployment YAML.
Services & Load Balancing
ClusterIP, NodePort, LoadBalancer, endpoints, kube-proxy, and traffic troubleshooting.
RBAC
ServiceAccounts, Roles, ClusterRoles, bindings, and permission debugging with can-i checks.
Alerting Design
SLO burn-rate alerts, PrometheusRule templates, Alertmanager routing, and on-call testing.
AWS Ops (EKS / IAM)
EKS cluster management, IRSA setup, ECR operations, and IAM debugging commands.
Dockerfile Patterns
Multi-stage builds, distroless images, layer caching, and security hardening patterns.
Helm Charts
Chart layout, values, templates, helpers, releases, and safe upgrade patterns.
Client-Side Workflow
If you are still learning the topic, start with the Learning Path & Field Workflow. If you are already in the middle of client work, use this shorter checklist.
- Identify the layer: workload, network, storage, security, control plane, or CI/CD.
- Read the page TL;DR: confirm the object type and expected behavior.
- Copy commands carefully: replace placeholders like
<namespace>,<pod>, and<release>. - Prefer read-only checks first: use
get,describe,logs,events, andauth can-ibefore applying changes. - Apply changes through the client process: GitOps, Helm, Terraform, or change ticket depending on the environment.
# Confirm which cluster and namespace your kubectl context is pointing at.
kubectl config current-context
kubectl config view --minify --output 'jsonpath={..namespace}'; echo
# Quick cluster health. Use this before changing anything.
kubectl get nodes -o wide
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -n 30
# Replace <namespace> with the app namespace you are investigating.
kubectl get deploy,sts,ds,svc,ingress,pvc -n <namespace>
Coverage
| Area | What It Covers | When You Need It |
|---|---|---|
| Linux | Daily commands, networking, performance, systemd & logs, storage & filesystems. | Node debugging, SSH sessions, system-level incident triage. |
| Python | Automation scripts, SRE utilities, API/JSON/YAML processing, subprocess & CLI wrapping. | Writing quick tools, health checks, or K8s automation without a full framework. |
| Core Kubernetes | Architecture, internals, worker nodes, pod lifecycle, namespaces, RBAC boundaries. | Explaining and debugging cluster behavior without guessing. |
| Workloads | Deployments, StatefulSets, DaemonSets, Jobs, ConfigMaps, Secrets, scheduling, autoscaling. | Most day-to-day app incidents and capacity decisions. |
| Networking | Services, Ingress, Gateway API, ExternalDNS, CoreDNS, CNIs, service mesh. | Network symptoms that cross team boundaries. |
| Security | RBAC, network policy, pod security, admission policy, image supply chain, secrets. | Strict client environments and audit requirements. |
| Operations | Incident response, SLOs, production readiness, capacity planning, upgrades, backup/DR. | On-call, maintenance windows, and pre-production reviews. |
| Monitoring | Prometheus, Grafana, Thanos, logging, distributed tracing, alerting design. | Observability setup and alert triage during incidents. |
| Cloud | AWS EKS/IAM/ECR, AKS, GKE, cost optimisation. | Cloud-specific tooling for managed K8s platforms. |
| Containers | Dockerfile patterns, runtime debugging (crictl/nsenter), image troubleshooting. | Build pipeline issues, ImagePullBackOff, and container runtime failures. |
| GitOps & CI/CD | ArgoCD, Flux, GitHub Actions, GitLab CI, Jenkins, Git reference. | Production changes flowing through automation and code review. |
| Delivery | Helm charts, Terraform for AWS/Azure/GCP. | Provisioning infrastructure and deploying applications reproducibly. |
Safety Defaults
- Never run destructive commands from memory. Check the namespace, context, labels, and owner references first.
- Prefer
kubectl diff -f file.yamlbeforekubectl apply -f file.yamlwhen applying direct manifests. - In GitOps clusters, avoid manual drift unless it is an approved emergency action. Patch Git, Helm values, or Terraform instead.
- For production incidents, capture evidence before changing state: events, logs, rollout history, pod specs, and relevant metrics.