TL;DR

Troubleshoot from broad to narrow: API reachable, nodes Ready, system Pods healthy, target namespace events clean, Pod scheduled, containers started, app logs sane, resource pressure absent.

First Pass

bashfirst-pass.sh
kubectl cluster-info # Confirms the API server endpoint is reachable.
kubectl get nodes -o wide # Checks Ready status, versions, internal IPs, and node OS.
kubectl get pods -A --sort-by=.metadata.namespace # Finds failing system and app Pods.
kubectl get events -A --sort-by=.lastTimestamp | tail -50 # Recent warnings usually point at the failure.
kubectl top nodes # Requires metrics-server; shows CPU/memory pressure.
kubectl top pods -A --sort-by=memory # Finds noisy workloads.

Node Troubleshooting

bashnode-debug.sh
NODE=worker-1
kubectl describe node "$NODE" # Conditions, taints, allocatable resources, recent node events.
kubectl get pods -A --field-selector spec.nodeName="$NODE" # Workloads placed on this node.

# If you have node shell access:
sudo systemctl status kubelet --no-pager # kubelet health.
sudo journalctl -u kubelet -n 200 --no-pager # kubelet errors: CNI, image pull, cert, eviction.
sudo crictl ps -a # Containers known to the runtime.
sudo crictl logs <container-id> # Logs when kubectl logs is not enough.

Pod State Map

StatusLikely causeFirst check
PendingScheduling blocked, PVC pending, quota, node selector mismatch.kubectl describe pod
ContainerCreatingImage pull, CNI, CSI mount, Secret/ConfigMap missing.Events on Pod
CrashLoopBackOffApp exits after start, bad config, failed probe.Previous logs
ImagePullBackOffBad tag, auth, registry outage.Image name and pull secret
OOMKilledContainer exceeded memory limit.Limits, metrics, app memory

CrashLoopBackOff Deep Dive

SignalInterpretationMitigation
Restart count climbingProcess exits immediately or probe never passes.kubectl logs --previous for last crash reason.
probe failures in eventsLiveness/readiness path wrong.Align probe port/path with real listener.
exit code 137 / OOM scoreMemory spike at boot.Raise limit or defer heavy init.
panic stack in logsApp bug/config.Rollback image or fix env/Secret.
Crash only on subset of nodesArches, kernel feature, SELinux/AppArmor variance.Describe node labels + inspect security contexts.
bashcrashloop-debug.sh
POD=my-pod
NS=app
kubectl describe pod "$POD" -n "$NS"
kubectl logs "$POD" -n "$NS" --all-containers=true --tail=120
kubectl logs "$POD" -n "$NS" --all-containers=true --previous
kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.containers[*].name}{"\n"}'

Node NotReady Playbook

EvidenceSubsystemEvidence command
Kubelet stopped posting statuskubelet/agent crash or disk-full.journalctl/kubelet svc on host.
CNI/network not readyCNI DaemonSet CrashLoop.kubectl logs -n kube-system -l k8s-app=calico-node pattern.
Runtime not readycontainerd/cri-o wedged.crictl info + systemd.
PIDPressure / DiskPressureHost saturated.df -h, inode usage, cgroup pids.max.
Unknown node heartbeatNetwork partition or API/etcd auth.Connectivity from node to apiserver LB.
bashnotready.sh
NODE=worker-1
kubectl describe node "$NODE"
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name="$NODE"

OOM And Memory Pressure

SymptomLayerRemediation
Reason OOMKilledContainer crosses memory.limit.Raise realistic limit based on spikes; tune JVM/go env.
QoS evictionNode starving; Guaranteed last.Add nodes / drop noisy neighbors.
cgroup kills before limitMemoryBacked volumes or huge caches.inspect cat /sys/fs/cgroup/...memory.events on host.
metrics show low util but kills happenTransient spike shorter than scraping.Use eBPF/kubelet traces or profiling.
multiple pods restarted same nodeUnderlying hardware memory errors.ECC logs / BMC alerts.
bashoom-hunt.sh
kubectl describe pod crashing-pod -n app | grep -A3 -i oom
kubectl top pods -A --containers --sort-by=memory | tail -40

Image Pull Failures

Event fragmentUsually meansFix vector
401 UnauthorizedMissing/expired registry secret.Ensure imagePullSecrets referenced + IAM token TTL.
403 ForbiddenRepo IAM / network egress policy.Relax policy mirror image.
ImagePullBackOffBad tag/arch digests.Inspect manifest & node arch (kubectl debug node ...).
ErrImageNeverPullWrong imagePullPolicy on local builds.Use Never/IfNotPresent intentionally.
timeout / i/o deadlineFirewall or registry outage.circuit break reroute mirror.
bashimage-pull-debug.sh
kubectl describe pod my-pod -n app | grep -A8 Events
kubectl get secret registry-creds -n app -o yaml # compare type dockerconfigjson entries

Kubelet Eviction Matrix

ConditionSignalsOperational response
DiskPressurekubectl describe node; image GC thrash.Trim logs, enlarge root volume, move containerd dir.
MemoryPressureKernel reclaim + pod rank eviction.Relax limits / add RAM / drain hotspot.
PIDPressureRunaway fork bombs / short-lived Pods.Throttle workload; raise pids.limit carefully.
AllocatedTooManyPodsCheap small pods exhaust maxPods.Shard workloads or tweak kubelet flags (platform-specific).
Forced drain during incidentPreserves PDB-less chaos?Always pair with PDB and capacity math.
basheviction-tail.sh
kubectl get pods -n app -o json | jq '..|.reason?|select(.!=null)'
kubectl get events -n app --field-selector reason=Evicted --sort-by=.lastTimestamp

Deployments And Rollout Signals

When Pods look fine individually but Deployments stay degraded, widen the aperture to ReplicaSet ownership, PDB overlap, terminating Pods, or HPAs fighting manual scales.

ObservationInterpretationPivot command
Replicas < desiredInsufficient cluster capacity vs PDB clamp.kubectl describe pdb -n ...
Old ReplicaSet spikesRollback in progress.kubectl rollout history deploy/...
ProgressDeadlineExceededNew Pods never become Ready.Probe/logs on new RS template hash.
Termination stuckFinalizers/preStop hooks/long grace.--force delete only with policy clearance.
bashrollout-health.sh
kubectl describe deploy api -n app
kubectl rollout status deploy/api -n app --timeout=5m | cat
kubectl get rs -n app -o wide --show-labels
kubectl get hpa -n app

Quotas And Admission Blocks

Admission messageUnderlying objectRapid unblock
Forbidden: exceeded quotaResourceQuota on namespace CPU/mem/pods.Adjust quota or reschedule.
LimitRange min/max mismatchAdmission defaults rejecting manifest.Describe LimitRange objects.
Forbidden: forbidden: requests.storagePV/PVC quotas.Recycle PVCs/shrink StatefulSets temporarily.
Exceeded pod countcheap batch jobs.Garbage collect completed Pods.
bashquota-audit.sh
kubectl describe resourcequota -n app
kubectl describe limitrange -n app
kubectl get pods -n app --field-selector=status.phase=Failed

RBAC And ServiceAccounts

Permission errors bubble up from kube-apiserver audits as Forbidden messages on Pods or CSI sidecars mounting tokens.

bashrbac-pod.sh
kubectl auth can-i watch pods --as=system:serviceaccount:app:api-sa -n app
kubectl describe rolebinding -n app
kubectl get role,rolebinding -n app -o wide

PVCs Blocking Pod Start

Stateful workloads often wedge in ContainerCreating because CSI attach hangs or zoning mismatches propagate late.

Event hintInterpretationNarrow drill
waiting for PersistentVolumeClaimProvisioning backlog.kubectl describe pvc.
Multi-Attach errorRWO reused across nodes.Volumes Attachment objects.
permission denied mountingSecrets for cloud provider missing RBAC.Controller logs CSI.
Timed out waiting for volumesAZ drift topologies.storage.Align node+PVM zone annotations.
bashpvc-stuck.sh
kubectl get pvc,pv -n data
kubectl get volumeattachment

Container Output Streams

bashlogs.sh
kubectl logs deploy/web-api -n app --tail=100 # Logs from one current Pod behind the Deployment.
kubectl logs deploy/web-api -n app --all-containers=true --tail=100 # Include sidecars.
kubectl logs pod/web-api-abc123 -n app -c api --previous # Previous crashed container logs.
kubectl logs -n app -l app=web-api --since=30m --prefix=true # Aggregate by label for recent logs.
kubectl describe pod web-api-abc123 -n app # Events explain scheduling, pull, probe, and mount failures.

Control Plane Components

bashcontrol-plane.sh
kubectl get pods -n kube-system # Managed clusters expose component Pods differently.
kubectl get componentstatuses # Deprecated, but may exist in older clusters.
kubectl logs -n kube-system kube-apiserver-control-plane --tail=100 # kubeadm static Pod example.
kubectl logs -n kube-system kube-controller-manager-control-plane --tail=100
kubectl logs -n kube-system kube-scheduler-control-plane --tail=100
kubectl get --raw='/readyz?verbose' # API server readiness details if your RBAC allows it.

CronJob Saturation Signals

Batch bursts masquerade as cluster-wide slowness when API priority for Job creation competes with Service discovery watches.

Leading indicatorLikely cause
Pod count nearing namespace quota nightlyCronJobs without history limits pinned.
APIServer spikes at minute zeroThousands of deterministic schedules stacked.
Stateful pipeline delaysBackoffLimit exhaustion thrashing etcd.
Garbage collector backlogCompleted Pods never TTL-deleted.
  • Trim successfulJobsHistoryLimit/failedJobsHistoryLimit proactively.
  • Prefer ttlSecondsAfterFinished on Jobs when controllers support migration.
  • Shard noisy tenants into namespaces with independent quotas.
  • Alert when kubectl get jobs -n batch --field-selector status.successful=0 grows unbounded.
  • Throttle CI-driven Job storms when GitOps emits duplicate stamped manifests.
bashjobs-audit.sh
kubectl get cronjobs -A
kubectl get jobs -A --sort-by=.status.startTime | tail

Ephemeral Debugging

Use kubectl debug profiles when distroless images block shells or when copying binaries is unsafe.

ProfileWhenNote
generalInteractive shell beside target.Shares namespaces; beware side effects.
baselineCloser to distroless hardened.Minimal tooling.
restrictedHighly locked clusters.Needs PSA/compliance approvals.
copy-to-debugCopy Pod spec for cloning.Preserves env for reproducibility.
bashephemeral.sh
kubectl debug pod/api-XXXX -n app -it --image=busybox --target=api
kubectl debug node/worker-1 -it --image=ubuntu -- chroot /host bash

Incident Evidence Ordering

Preserve ordering so post-incident reviewers can correlate operator actions vs automated controllers.

Minute offsetArtifactCaptures
t-10kubectl describe node/pod tarballsteady-state snapshot.
t0Alerts + Grafana snapshot linkslatency/memory graphs.
t+5API audit slice if availableFlood of deletes/evictions?
t+30Workload controller logs aggregatedStatefulSet churn vs ReplicaSet churn.
close-outCHG/ticket linkage + rollback manifestsDemonstrate guardrails rebuilt.
bashbundle.sh
kubectl cluster-info dump --namespaces kube-system,app --output-directory=/tmp/k8s-dump
tar czf bundle.tgz /tmp/k8s-dump

API Server Stress Signals

SymptomWatch pathMitigate
dial tcp i/o timeoutapiserver overloaded or LB flapping.Kill expensive kubectl logs -f sessions; widen rate limits thoughtfully.
/etcdserver: timeoutCompaction/rebuild lag.Raise infra ticket; throttle writes temporarily.
Admission webhook latency highWebhook deployment scaling.kubectl logs deploy/webhook-validator pattern.
Repeated 429 from kubeletNode storm listing huge objects.List/watch scopes + reduce cardinally.
TLS handshake resetsStale kubeconfig / MTU MSS issues.Renew kubeconfig certs or adjust network path.

Observability should watch APF priority levels and etcd disk latency at the platform layer; escalate when tail latency climbs before hard timeouts appear at clients.

Event Mining Cheat Sheet

Namespaces can emit thousands of events; scope by object, severity, reason, or time window before grepping arbitrarily.

bashevent-queries.sh
# Recent warnings scoped to workloads that match a label selector.
kubectl get events -A \
  --field-selector type=Warning \
  --sort-by=.lastTimestamp | tail -n 80

# Follow a single object's timeline (replace kind/name/ns).
kubectl get events --namespace app \
  --field-selector involvedObject.kind=Pod,involvedObject.name=api-aaaa -w

# Group reasons to see dominating failure modes quickly.
kubectl get events -A --sort-by=.lastTimestamp -o yaml | grep ' reason:' | sort | uniq -c | tail

# Correlate DaemonSet turbulence with node maintenance.
kubectl get events -n kube-system --sort-by=.lastTimestamp | grep -i -E 'cni|cilium|calico|flannel|kube-proxy'

sysctl And Runtime Guards

ObservationInterpretationMitigating experiment
PodSecurity warningsPrivileged hooks missing CAPs.Relax baseline profile only under CAB.
too many open filesUlimit inside container low.Increase ulimit -n via workload / init.
transparent hugepage warnings DBlatency jitter on mmap heavy apps.Tune node sysctl DaemonSet deliberately.
SELinux AVC denialsHost policy rejecting volume mounts.Use audit2allow path or label volumes.
cgroups v1 vs v2 mismatchOlder agents assume v1 hierarchies.Bake agent matrix into node image changelog.
PID cgroup throttling batchThousands of kubectl exec sessions.Throttle automation using shared session pools.
inode exhaustionCrashLoop churn writing tiny files.Use tmpfs quotas + pruning.
hardware clock skewJWT issued in future failures.Fix NTP on control plane boundary.

Runtime-level findings should funnel back into image hardening backlog items so application teams stop papering gaps with permissive SCC/PodSecurity exceptions.

Exit code bucketUsually
1Unhandled exception or missing dependency.
137SIGKILL (OOM watchdog or kubelet kill).
143SIGTERM after exceeding grace budgets.
139SIGSEGV; native libs or cgroup limits.
126/127Shebang missing or interpreter absent in container.
Crash before PID1/bin/sh missing due to distroless drift.

Fragmentation And MTU

Symmetric connectivity failures that only reproduce on certain paths (kubectl exec works but Istio egress fails, or GRE tunnels to SD-WAN) commonly trace to MSS clamp drift after NIC driver upgrades.

Test vectorPassesFails interpretation
ICMP large ping from netshoot PodICMP okPrefer TCP MSS tests next.
curl --interface forcing Pod IP vs Service IPBoth passProbably app-level failure.
Tracing shows black hole after ~1400 byte payloadn/aClamp TCP MSS on CNI DaemonSet overlay.
Cross-AZ hops only failingn/aMTU mismatches regional jumbo configs.
MetalLB BGP session flapsHold timer adjustmentsBGP communities blackholing prematurely.
bashmtu_probe.sh
kubectl run mtu-probe --rm -it --image=nicolaka/netshoot -- \
  ping -s 8972 -c3 172.31.255.253 # ICMP payload probing for PMTUD black holes

Pager Handoff Anchors

Leaving the bridge for the next responder should reuse the same breadcrumbs this page drills on so knowledge does not dissipate.

  1. Current failing namespace list with blast radius annotated.
  2. Whether cordon/drain/eviction tooling was already exercised.
  3. Whether GitOps reconcile is paused and who authorized it.
  4. Most recent infra change correlated with outage clock.
  5. Links to Grafana/Loki queries already scoped.
  6. Outstanding risky mitigations awaiting CAB.
  7. List of flaky nodes even if Pods already evicted.
  8. Customer-visible SLO deltas if applicable.
  9. Known missing observability gaps discovered mid-incident.
  10. Follow-up bugs filed or pending creation.

Short acknowledgements (“we tried X”) prevent thrash loops for on-call rotations.

Pin links to dashboards instead of exporting PNGs whenever possible.

Prefer UTC timestamps aligned with Grafana panels when narrating timelines.

Normalize language: “hypothesis” vs “confirmed root cause” avoids mis-read postmortems.