Cluster, Nodes & Apps Troubleshooting

TL;DR

Troubleshoot from broad to narrow: API reachable, nodes Ready, system Pods healthy, target namespace events clean, Pod scheduled, containers started, app logs sane, resource pressure absent.

First Pass

bashfirst-pass.sh

kubectl cluster-info # Confirms the API server endpoint is reachable.
kubectl get nodes -o wide # Checks Ready status, versions, internal IPs, and node OS.
kubectl get pods -A --sort-by=.metadata.namespace # Finds failing system and app Pods.
kubectl get events -A --sort-by=.lastTimestamp | tail -50 # Recent warnings usually point at the failure.
kubectl top nodes # Requires metrics-server; shows CPU/memory pressure.
kubectl top pods -A --sort-by=memory # Finds noisy workloads.

Node Troubleshooting

bashnode-debug.sh

NODE=worker-1
kubectl describe node "$NODE" # Conditions, taints, allocatable resources, recent node events.
kubectl get pods -A --field-selector spec.nodeName="$NODE" # Workloads placed on this node.

# If you have node shell access:
sudo systemctl status kubelet --no-pager # kubelet health.
sudo journalctl -u kubelet -n 200 --no-pager # kubelet errors: CNI, image pull, cert, eviction.
sudo crictl ps -a # Containers known to the runtime.
sudo crictl logs <container-id> # Logs when kubectl logs is not enough.

Pod State Map

Status	Likely cause	First check
Pending	Scheduling blocked, PVC pending, quota, node selector mismatch.	`kubectl describe pod`
ContainerCreating	Image pull, CNI, CSI mount, Secret/ConfigMap missing.	Events on Pod
CrashLoopBackOff	App exits after start, bad config, failed probe.	Previous logs
ImagePullBackOff	Bad tag, auth, registry outage.	Image name and pull secret
OOMKilled	Container exceeded memory limit.	Limits, metrics, app memory

CrashLoopBackOff Deep Dive

Signal	Interpretation	Mitigation
Restart count climbing	Process exits immediately or probe never passes.	`kubectl logs --previous` for last crash reason.
probe failures in events	Liveness/readiness path wrong.	Align probe port/path with real listener.
exit code 137 / OOM score	Memory spike at boot.	Raise limit or defer heavy init.
panic stack in logs	App bug/config.	Rollback image or fix env/Secret.
Crash only on subset of nodes	Arches, kernel feature, SELinux/AppArmor variance.	Describe node labels + inspect security contexts.

bashcrashloop-debug.sh

POD=my-pod
NS=app
kubectl describe pod "$POD" -n "$NS"
kubectl logs "$POD" -n "$NS" --all-containers=true --tail=120
kubectl logs "$POD" -n "$NS" --all-containers=true --previous
kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.containers[*].name}{"\n"}'

Node NotReady Playbook

Evidence	Subsystem	Evidence command
`Kubelet stopped posting status`	kubelet/agent crash or disk-full.	journalctl/kubelet svc on host.
`CNI/network not ready`	CNI DaemonSet CrashLoop.	`kubectl logs -n kube-system -l k8s-app=calico-node` pattern.
`Runtime not ready`	containerd/cri-o wedged.	`crictl info` + systemd.
`PIDPressure / DiskPressure`	Host saturated.	`df -h`, inode usage, cgroup pids.max.
Unknown node heartbeat	Network partition or API/etcd auth.	Connectivity from node to apiserver LB.

bashnotready.sh

NODE=worker-1
kubectl describe node "$NODE"
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name="$NODE"

OOM And Memory Pressure

Symptom	Layer	Remediation
Reason `OOMKilled`	Container crosses memory.limit.	Raise realistic limit based on spikes; tune JVM/go env.
QoS eviction	Node starving; Guaranteed last.	Add nodes / drop noisy neighbors.
cgroup kills before limit	MemoryBacked volumes or huge caches.	inspect `cat /sys/fs/cgroup/...memory.events` on host.
metrics show low util but kills happen	Transient spike shorter than scraping.	Use eBPF/kubelet traces or profiling.
multiple pods restarted same node	Underlying hardware memory errors.	ECC logs / BMC alerts.

bashoom-hunt.sh

kubectl describe pod crashing-pod -n app | grep -A3 -i oom
kubectl top pods -A --containers --sort-by=memory | tail -40

Image Pull Failures

Event fragment	Usually means	Fix vector
`401 Unauthorized`	Missing/expired registry secret.	Ensure `imagePullSecrets` referenced + IAM token TTL.
`403 Forbidden`	Repo IAM / network egress policy.	Relax policy mirror image.
`ImagePullBackOff`	Bad tag/arch digests.	Inspect manifest & node arch (`kubectl debug node ...`).
`ErrImageNeverPull`	Wrong `imagePullPolicy` on local builds.	Use `Never/IfNotPresent` intentionally.
timeout / i/o deadline	Firewall or registry outage.	circuit break reroute mirror.

bashimage-pull-debug.sh

kubectl describe pod my-pod -n app | grep -A8 Events
kubectl get secret registry-creds -n app -o yaml # compare type dockerconfigjson entries

Kubelet Eviction Matrix

Condition	Signals	Operational response
DiskPressure	`kubectl describe node`; image GC thrash.	Trim logs, enlarge root volume, move containerd dir.
MemoryPressure	Kernel reclaim + pod rank eviction.	Relax limits / add RAM / drain hotspot.
PIDPressure	Runaway fork bombs / short-lived Pods.	Throttle workload; raise `pids.limit` carefully.
AllocatedTooManyPods	Cheap small pods exhaust maxPods.	Shard workloads or tweak kubelet flags (platform-specific).
Forced drain during incident	Preserves PDB-less chaos?	Always pair with PDB and capacity math.

basheviction-tail.sh

kubectl get pods -n app -o json | jq '..|.reason?|select(.!=null)'
kubectl get events -n app --field-selector reason=Evicted --sort-by=.lastTimestamp

Deployments And Rollout Signals

When Pods look fine individually but Deployments stay degraded, widen the aperture to ReplicaSet ownership, PDB overlap, terminating Pods, or HPAs fighting manual scales.

Observation	Interpretation	Pivot command
Replicas < desired	Insufficient cluster capacity vs PDB clamp.	`kubectl describe pdb -n ...`
Old ReplicaSet spikes	Rollback in progress.	`kubectl rollout history deploy/...`
ProgressDeadlineExceeded	New Pods never become Ready.	Probe/logs on new RS template hash.
Termination stuck	Finalizers/preStop hooks/long grace.	`--force delete` only with policy clearance.

bashrollout-health.sh

kubectl describe deploy api -n app
kubectl rollout status deploy/api -n app --timeout=5m | cat
kubectl get rs -n app -o wide --show-labels
kubectl get hpa -n app

Quotas And Admission Blocks

Admission message	Underlying object	Rapid unblock
Forbidden: exceeded quota	ResourceQuota on namespace CPU/mem/pods.	Adjust quota or reschedule.
LimitRange min/max mismatch	Admission defaults rejecting manifest.	Describe LimitRange objects.
Forbidden: forbidden: requests.storage	PV/PVC quotas.	Recycle PVCs/shrink StatefulSets temporarily.
Exceeded pod count	cheap batch jobs.	Garbage collect completed Pods.

bashquota-audit.sh

kubectl describe resourcequota -n app
kubectl describe limitrange -n app
kubectl get pods -n app --field-selector=status.phase=Failed

RBAC And ServiceAccounts

Permission errors bubble up from kube-apiserver audits as Forbidden messages on Pods or CSI sidecars mounting tokens.

bashrbac-pod.sh

kubectl auth can-i watch pods --as=system:serviceaccount:app:api-sa -n app
kubectl describe rolebinding -n app
kubectl get role,rolebinding -n app -o wide

PVCs Blocking Pod Start

Stateful workloads often wedge in ContainerCreating because CSI attach hangs or zoning mismatches propagate late.

Event hint	Interpretation	Narrow drill
waiting for PersistentVolumeClaim	Provisioning backlog.	`kubectl describe pvc`.
Multi-Attach error	RWO reused across nodes.	Volumes Attachment objects.
permission denied mounting	Secrets for cloud provider missing RBAC.	Controller logs CSI.
Timed out waiting for volumes	AZ drift topologies.storage.	Align node+PVM zone annotations.

bashpvc-stuck.sh

kubectl get pvc,pv -n data
kubectl get volumeattachment

Container Output Streams

bashlogs.sh

kubectl logs deploy/web-api -n app --tail=100 # Logs from one current Pod behind the Deployment.
kubectl logs deploy/web-api -n app --all-containers=true --tail=100 # Include sidecars.
kubectl logs pod/web-api-abc123 -n app -c api --previous # Previous crashed container logs.
kubectl logs -n app -l app=web-api --since=30m --prefix=true # Aggregate by label for recent logs.
kubectl describe pod web-api-abc123 -n app # Events explain scheduling, pull, probe, and mount failures.

Control Plane Components

bashcontrol-plane.sh

kubectl get pods -n kube-system # Managed clusters expose component Pods differently.
kubectl get componentstatuses # Deprecated, but may exist in older clusters.
kubectl logs -n kube-system kube-apiserver-control-plane --tail=100 # kubeadm static Pod example.
kubectl logs -n kube-system kube-controller-manager-control-plane --tail=100
kubectl logs -n kube-system kube-scheduler-control-plane --tail=100
kubectl get --raw='/readyz?verbose' # API server readiness details if your RBAC allows it.

CronJob Saturation Signals

Batch bursts masquerade as cluster-wide slowness when API priority for Job creation competes with Service discovery watches.

Leading indicator	Likely cause
Pod count nearing namespace quota nightly	CronJobs without history limits pinned.
APIServer spikes at minute zero	Thousands of deterministic schedules stacked.
Stateful pipeline delays	BackoffLimit exhaustion thrashing etcd.
Garbage collector backlog	Completed Pods never TTL-deleted.

Trim successfulJobsHistoryLimit/failedJobsHistoryLimit proactively.
Prefer ttlSecondsAfterFinished on Jobs when controllers support migration.
Shard noisy tenants into namespaces with independent quotas.
Alert when kubectl get jobs -n batch --field-selector status.successful=0 grows unbounded.
Throttle CI-driven Job storms when GitOps emits duplicate stamped manifests.

bashjobs-audit.sh

kubectl get cronjobs -A
kubectl get jobs -A --sort-by=.status.startTime | tail

Ephemeral Debugging

Use kubectl debug profiles when distroless images block shells or when copying binaries is unsafe.

Profile	When	Note
general	Interactive shell beside target.	Shares namespaces; beware side effects.
baseline	Closer to distroless hardened.	Minimal tooling.
restricted	Highly locked clusters.	Needs PSA/compliance approvals.
copy-to-debug	Copy Pod spec for cloning.	Preserves env for reproducibility.

bashephemeral.sh

kubectl debug pod/api-XXXX -n app -it --image=busybox --target=api
kubectl debug node/worker-1 -it --image=ubuntu -- chroot /host bash

Incident Evidence Ordering

Preserve ordering so post-incident reviewers can correlate operator actions vs automated controllers.

Minute offset	Artifact	Captures
t-10	`kubectl describe node/pod` tarball	steady-state snapshot.
t0	Alerts + Grafana snapshot links	latency/memory graphs.
t+5	API audit slice if available	Flood of deletes/evictions?
t+30	Workload controller logs aggregated	StatefulSet churn vs ReplicaSet churn.
close-out	CHG/ticket linkage + rollback manifests	Demonstrate guardrails rebuilt.

bashbundle.sh

kubectl cluster-info dump --namespaces kube-system,app --output-directory=/tmp/k8s-dump
tar czf bundle.tgz /tmp/k8s-dump

API Server Stress Signals

Symptom	Watch path	Mitigate
`dial tcp i/o timeout`	apiserver overloaded or LB flapping.	Kill expensive `kubectl logs -f` sessions; widen rate limits thoughtfully.
`/etcdserver: timeout`	Compaction/rebuild lag.	Raise infra ticket; throttle writes temporarily.
Admission webhook latency high	Webhook deployment scaling.	`kubectl logs deploy/webhook-validator` pattern.
Repeated 429 from kubelet	Node storm listing huge objects.	List/watch scopes + reduce cardinally.
TLS handshake resets	Stale kubeconfig / MTU MSS issues.	Renew kubeconfig certs or adjust network path.

Observability should watch APF priority levels and etcd disk latency at the platform layer; escalate when tail latency climbs before hard timeouts appear at clients.

Event Mining Cheat Sheet

Namespaces can emit thousands of events; scope by object, severity, reason, or time window before grepping arbitrarily.

bashevent-queries.sh

# Recent warnings scoped to workloads that match a label selector.
kubectl get events -A \
  --field-selector type=Warning \
  --sort-by=.lastTimestamp | tail -n 80

# Follow a single object's timeline (replace kind/name/ns).
kubectl get events --namespace app \
  --field-selector involvedObject.kind=Pod,involvedObject.name=api-aaaa -w

# Group reasons to see dominating failure modes quickly.
kubectl get events -A --sort-by=.lastTimestamp -o yaml | grep ' reason:' | sort | uniq -c | tail

# Correlate DaemonSet turbulence with node maintenance.
kubectl get events -n kube-system --sort-by=.lastTimestamp | grep -i -E 'cni|cilium|calico|flannel|kube-proxy'

sysctl And Runtime Guards

Observation	Interpretation	Mitigating experiment
PodSecurity warnings	Privileged hooks missing CAPs.	Relax baseline profile only under CAB.
too many open files	Ulimit inside container low.	Increase `ulimit -n` via workload / init.
transparent hugepage warnings DB	latency jitter on mmap heavy apps.	Tune node sysctl DaemonSet deliberately.
SELinux AVC denials	Host policy rejecting volume mounts.	Use audit2allow path or label volumes.
cgroups v1 vs v2 mismatch	Older agents assume v1 hierarchies.	Bake agent matrix into node image changelog.
PID cgroup throttling batch	Thousands of kubectl exec sessions.	Throttle automation using shared session pools.
inode exhaustion	CrashLoop churn writing tiny files.	Use tmpfs quotas + pruning.
hardware clock skew	JWT issued in future failures.	Fix NTP on control plane boundary.

Runtime-level findings should funnel back into image hardening backlog items so application teams stop papering gaps with permissive SCC/PodSecurity exceptions.

Exit code bucket	Usually
1	Unhandled exception or missing dependency.
137	SIGKILL (OOM watchdog or kubelet kill).
143	SIGTERM after exceeding grace budgets.
139	SIGSEGV; native libs or cgroup limits.
126/127	Shebang missing or interpreter absent in container.
Crash before PID1	`/bin/sh` missing due to distroless drift.

Fragmentation And MTU

Symmetric connectivity failures that only reproduce on certain paths (kubectl exec works but Istio egress fails, or GRE tunnels to SD-WAN) commonly trace to MSS clamp drift after NIC driver upgrades.

Test vector	Passes	Fails interpretation
ICMP large ping from netshoot Pod	ICMP ok	Prefer TCP MSS tests next.
`curl --interface` forcing Pod IP vs Service IP	Both pass	Probably app-level failure.
Tracing shows black hole after ~1400 byte payload	n/a	Clamp TCP MSS on CNI DaemonSet overlay.
Cross-AZ hops only failing	n/a	MTU mismatches regional jumbo configs.
MetalLB BGP session flaps	Hold timer adjustments	BGP communities blackholing prematurely.

bashmtu_probe.sh

kubectl run mtu-probe --rm -it --image=nicolaka/netshoot -- \
  ping -s 8972 -c3 172.31.255.253 # ICMP payload probing for PMTUD black holes

Pager Handoff Anchors

Leaving the bridge for the next responder should reuse the same breadcrumbs this page drills on so knowledge does not dissipate.

Current failing namespace list with blast radius annotated.
Whether cordon/drain/eviction tooling was already exercised.
Whether GitOps reconcile is paused and who authorized it.
Most recent infra change correlated with outage clock.
Links to Grafana/Loki queries already scoped.
Outstanding risky mitigations awaiting CAB.
List of flaky nodes even if Pods already evicted.
Customer-visible SLO deltas if applicable.
Known missing observability gaps discovered mid-incident.
Follow-up bugs filed or pending creation.

Short acknowledgements (“we tried X”) prevent thrash loops for on-call rotations.

Pin links to dashboards instead of exporting PNGs whenever possible.

Prefer UTC timestamps aligned with Grafana panels when narrating timelines.

Normalize language: “hypothesis” vs “confirmed root cause” avoids mis-read postmortems.