Purpose

This site is a practical Kubernetes and platform SRE reference: short explanations, commands you can copy, annotated YAML, architecture diagrams, troubleshooting checks, and client-ready implementation examples.

Start Here

Use this site like an operations notebook. Open the page for the thing you are touching, skim the mental model, copy the safest command or manifest pattern, then validate with the troubleshooting section before making changes in a client cluster.

Client-Side Workflow

If you are still learning the topic, start with the Learning Path & Field Workflow. If you are already in the middle of client work, use this shorter checklist.

  1. Identify the layer: workload, network, storage, security, control plane, or CI/CD.
  2. Read the page TL;DR: confirm the object type and expected behavior.
  3. Copy commands carefully: replace placeholders like <namespace>, <pod>, and <release>.
  4. Prefer read-only checks first: use get, describe, logs, events, and auth can-i before applying changes.
  5. Apply changes through the client process: GitOps, Helm, Terraform, or change ticket depending on the environment.
bash first-look.sh
# Confirm which cluster and namespace your kubectl context is pointing at.
kubectl config current-context
kubectl config view --minify --output 'jsonpath={..namespace}'; echo

# Quick cluster health. Use this before changing anything.
kubectl get nodes -o wide
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
kubectl get events --all-namespaces --sort-by=.lastTimestamp | tail -n 30

# Replace <namespace> with the app namespace you are investigating.
kubectl get deploy,sts,ds,svc,ingress,pvc -n <namespace>

Coverage Plan

Area What It Will Cover Why It Matters On Contract
Core Architecture, internals, worker nodes, pod lifecycle, namespaces, RBAC boundaries. Helps you explain and debug cluster behavior without guessing.
Workloads Deployments, StatefulSets, DaemonSets, Jobs, ConfigMaps, Secrets, scheduling. Most day-to-day app incidents are here.
Networking Services, ingress, DNS, CNIs, service mesh, load balancers. Network symptoms are common and usually cross team boundaries.
Security RBAC, network policy, pod security, certificates, secret management. Client environments often have strict controls and audit needs.
Delivery Helm, ArgoCD, CircleCI, generic CI/CD, Terraform. Production changes usually flow through code and automation.
Operations Monitoring, draining, upgrades, operators, hardware maintenance, troubleshooting. This is the incident and maintenance toolkit.

Safety Defaults

  • !Never run destructive commands from memory. Check the namespace, context, labels, and owner references first.
  • !Prefer kubectl diff -f file.yaml before kubectl apply -f file.yaml when applying direct manifests.
  • !In GitOps clusters, avoid manual drift unless it is an approved emergency action. Patch Git, Helm values, or Terraform instead.
  • !For production incidents, capture evidence before changing state: events, logs, rollout history, pod specs, and relevant metrics.