K8s SRE Reference

Purpose

This site is a practical Kubernetes and platform SRE reference: short explanations, commands you can copy, annotated YAML, architecture diagrams, troubleshooting checks, and client-ready implementation examples.

Start Here

Use this site like an operations notebook. Open the page for the thing you are touching, skim the mental model, copy the safest command or manifest pattern, then validate with the troubleshooting section before making changes in a client cluster.

Learning Path & Field Workflow

Study order, from-scratch lab sequence, client-site workflow, and the content pattern each topic page should follow.

Architecture & Control Plane

Cluster layout, apiserver, scheduler, controller-manager, etcd, kubelet, and HA control plane diagrams.

Internals

How the API server, etcd, admission, watches, controllers, and reconciliation actually fit together.

Deployments

Rollouts, rollbacks, probes, resources, update strategy, and production Deployment YAML.

Services & Load Balancing

ClusterIP, NodePort, LoadBalancer, endpoints, kube-proxy, and traffic troubleshooting.

RBAC

ServiceAccounts, Roles, ClusterRoles, bindings, and permission debugging with can-i checks.

Helm Charts

Chart layout, values, templates, helpers, releases, and safe upgrade patterns.

Client-Side Workflow

If you are still learning the topic, start with the Learning Path & Field Workflow. If you are already in the middle of client work, use this shorter checklist.

Identify the layer: workload, network, storage, security, control plane, or CI/CD.
Read the page TL;DR: confirm the object type and expected behavior.
Copy commands carefully: replace placeholders like <namespace>, <pod>, and <release>.
Prefer read-only checks first: use get, describe, logs, events, and auth can-i before applying changes.
Apply changes through the client process: GitOps, Helm, Terraform, or change ticket depending on the environment.

bash first-look.sh

# Confirm which cluster and namespace your kubectl context is pointing at.
kubectl config current-context
kubectl config view --minify --output 'jsonpath={..namespace}'; echo

# Quick cluster health. Use this before changing anything.
kubectl get nodes -o wide
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -n 30

# Replace <namespace> with the app namespace you are investigating.
kubectl get deploy,sts,ds,svc,ingress,pvc -n <namespace>

Coverage Plan

Area	What It Will Cover	Why It Matters On Contract
Core	Architecture, internals, worker nodes, pod lifecycle, namespaces, RBAC boundaries.	Helps you explain and debug cluster behavior without guessing.
Workloads	Deployments, StatefulSets, DaemonSets, Jobs, ConfigMaps, Secrets, scheduling.	Most day-to-day app incidents are here.
Networking	Services, ingress, DNS, CNIs, service mesh, load balancers.	Network symptoms are common and usually cross team boundaries.
Security	RBAC, network policy, pod security, certificates, secret management.	Client environments often have strict controls and audit needs.
Delivery	Helm, ArgoCD, CircleCI, generic CI/CD, Terraform.	Production changes usually flow through code and automation.
Operations	Monitoring, draining, upgrades, operators, hardware maintenance, troubleshooting.	This is the incident and maintenance toolkit.

Safety Defaults

!Never run destructive commands from memory. Check the namespace, context, labels, and owner references first.
!Prefer kubectl diff -f file.yaml before kubectl apply -f file.yaml when applying direct manifests.
!In GitOps clusters, avoid manual drift unless it is an approved emergency action. Patch Git, Helm values, or Terraform instead.
!For production incidents, capture evidence before changing state: events, logs, rollout history, pod specs, and relevant metrics.