Purpose

This site is a practical Kubernetes and platform SRE reference: short explanations, commands you can copy, annotated YAML, architecture diagrams, troubleshooting checks, and client-ready implementation examples.

Start Here

Use this site like an operations notebook. Open the page for the thing you are touching, skim the mental model, copy the safest command or manifest pattern, then validate with the troubleshooting section before making changes in a client cluster.

Client-Side Workflow

If you are still learning the topic, start with the Learning Path & Field Workflow. If you are already in the middle of client work, use this shorter checklist.

  1. Identify the layer: workload, network, storage, security, control plane, or CI/CD.
  2. Read the page TL;DR: confirm the object type and expected behavior.
  3. Copy commands carefully: replace placeholders like <namespace>, <pod>, and <release>.
  4. Prefer read-only checks first: use get, describe, logs, events, and auth can-i before applying changes.
  5. Apply changes through the client process: GitOps, Helm, Terraform, or change ticket depending on the environment.
bash first-look.sh
# Confirm which cluster and namespace your kubectl context is pointing at.
kubectl config current-context
kubectl config view --minify --output 'jsonpath={..namespace}'; echo

# Quick cluster health. Use this before changing anything.
kubectl get nodes -o wide
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -n 30

# Replace <namespace> with the app namespace you are investigating.
kubectl get deploy,sts,ds,svc,ingress,pvc -n <namespace>

Coverage

Area What It Covers When You Need It
Linux Daily commands, networking, performance, systemd & logs, storage & filesystems. Node debugging, SSH sessions, system-level incident triage.
Python Automation scripts, SRE utilities, API/JSON/YAML processing, subprocess & CLI wrapping. Writing quick tools, health checks, or K8s automation without a full framework.
Core Kubernetes Architecture, internals, worker nodes, pod lifecycle, namespaces, RBAC boundaries. Explaining and debugging cluster behavior without guessing.
Workloads Deployments, StatefulSets, DaemonSets, Jobs, ConfigMaps, Secrets, scheduling, autoscaling. Most day-to-day app incidents and capacity decisions.
Networking Services, Ingress, Gateway API, ExternalDNS, CoreDNS, CNIs, service mesh. Network symptoms that cross team boundaries.
Security RBAC, network policy, pod security, admission policy, image supply chain, secrets. Strict client environments and audit requirements.
Operations Incident response, SLOs, production readiness, capacity planning, upgrades, backup/DR. On-call, maintenance windows, and pre-production reviews.
Monitoring Prometheus, Grafana, Thanos, logging, distributed tracing, alerting design. Observability setup and alert triage during incidents.
Cloud AWS EKS/IAM/ECR, AKS, GKE, cost optimisation. Cloud-specific tooling for managed K8s platforms.
Containers Dockerfile patterns, runtime debugging (crictl/nsenter), image troubleshooting. Build pipeline issues, ImagePullBackOff, and container runtime failures.
GitOps & CI/CD ArgoCD, Flux, GitHub Actions, GitLab CI, Jenkins, Git reference. Production changes flowing through automation and code review.
Delivery Helm charts, Terraform for AWS/Azure/GCP. Provisioning infrastructure and deploying applications reproducibly.

Safety Defaults

  • !Never run destructive commands from memory. Check the namespace, context, labels, and owner references first.
  • !Prefer kubectl diff -f file.yaml before kubectl apply -f file.yaml when applying direct manifests.
  • !In GitOps clusters, avoid manual drift unless it is an approved emergency action. Patch Git, Helm values, or Terraform instead.
  • !For production incidents, capture evidence before changing state: events, logs, rollout history, pod specs, and relevant metrics.