K8s SRE Reference
This site is a practical Kubernetes and platform SRE reference: short explanations, commands you can copy, annotated YAML, architecture diagrams, troubleshooting checks, and client-ready implementation examples.
Start Here
Use this site like an operations notebook. Open the page for the thing you are touching, skim the mental model, copy the safest command or manifest pattern, then validate with the troubleshooting section before making changes in a client cluster.
Learning Path & Field Workflow
Study order, from-scratch lab sequence, client-site workflow, and the content pattern each topic page should follow.
Architecture & Control Plane
Cluster layout, apiserver, scheduler, controller-manager, etcd, kubelet, and HA control plane diagrams.
Internals
How the API server, etcd, admission, watches, controllers, and reconciliation actually fit together.
Deployments
Rollouts, rollbacks, probes, resources, update strategy, and production Deployment YAML.
Services & Load Balancing
ClusterIP, NodePort, LoadBalancer, endpoints, kube-proxy, and traffic troubleshooting.
RBAC
ServiceAccounts, Roles, ClusterRoles, bindings, and permission debugging with can-i checks.
Helm Charts
Chart layout, values, templates, helpers, releases, and safe upgrade patterns.
Client-Side Workflow
If you are still learning the topic, start with the Learning Path & Field Workflow. If you are already in the middle of client work, use this shorter checklist.
- Identify the layer: workload, network, storage, security, control plane, or CI/CD.
- Read the page TL;DR: confirm the object type and expected behavior.
- Copy commands carefully: replace placeholders like
<namespace>,<pod>, and<release>. - Prefer read-only checks first: use
get,describe,logs,events, andauth can-ibefore applying changes. - Apply changes through the client process: GitOps, Helm, Terraform, or change ticket depending on the environment.
# Confirm which cluster and namespace your kubectl context is pointing at.
kubectl config current-context
kubectl config view --minify --output 'jsonpath={..namespace}'; echo
# Quick cluster health. Use this before changing anything.
kubectl get nodes -o wide
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -n 30
# Replace <namespace> with the app namespace you are investigating.
kubectl get deploy,sts,ds,svc,ingress,pvc -n <namespace>
Coverage Plan
| Area | What It Will Cover | Why It Matters On Contract |
|---|---|---|
| Core | Architecture, internals, worker nodes, pod lifecycle, namespaces, RBAC boundaries. | Helps you explain and debug cluster behavior without guessing. |
| Workloads | Deployments, StatefulSets, DaemonSets, Jobs, ConfigMaps, Secrets, scheduling. | Most day-to-day app incidents are here. |
| Networking | Services, ingress, DNS, CNIs, service mesh, load balancers. | Network symptoms are common and usually cross team boundaries. |
| Security | RBAC, network policy, pod security, certificates, secret management. | Client environments often have strict controls and audit needs. |
| Delivery | Helm, ArgoCD, CircleCI, generic CI/CD, Terraform. | Production changes usually flow through code and automation. |
| Operations | Monitoring, draining, upgrades, operators, hardware maintenance, troubleshooting. | This is the incident and maintenance toolkit. |
Safety Defaults
- Never run destructive commands from memory. Check the namespace, context, labels, and owner references first.
- Prefer
kubectl diff -f file.yamlbeforekubectl apply -f file.yamlwhen applying direct manifests. - In GitOps clusters, avoid manual drift unless it is an approved emergency action. Patch Git, Helm values, or Terraform instead.
- For production incidents, capture evidence before changing state: events, logs, rollout history, pod specs, and relevant metrics.