Cluster, Nodes & Apps Troubleshooting
Troubleshoot from broad to narrow: API reachable, nodes Ready, system Pods healthy, target namespace events clean, Pod scheduled, containers started, app logs sane, resource pressure absent.
First Pass
kubectl cluster-info # Confirms the API server endpoint is reachable.
kubectl get nodes -o wide # Checks Ready status, versions, internal IPs, and node OS.
kubectl get pods -A --sort-by=.metadata.namespace # Finds failing system and app Pods.
kubectl get events -A --sort-by=.lastTimestamp | tail -50 # Recent warnings usually point at the failure.
kubectl top nodes # Requires metrics-server; shows CPU/memory pressure.
kubectl top pods -A --sort-by=memory # Finds noisy workloads.Node Troubleshooting
NODE=worker-1
kubectl describe node "$NODE" # Conditions, taints, allocatable resources, recent node events.
kubectl get pods -A --field-selector spec.nodeName="$NODE" # Workloads placed on this node.
# If you have node shell access:
sudo systemctl status kubelet --no-pager # kubelet health.
sudo journalctl -u kubelet -n 200 --no-pager # kubelet errors: CNI, image pull, cert, eviction.
sudo crictl ps -a # Containers known to the runtime.
sudo crictl logs <container-id> # Logs when kubectl logs is not enough.Pod State Map
| Status | Likely cause | First check |
|---|---|---|
| Pending | Scheduling blocked, PVC pending, quota, node selector mismatch. | kubectl describe pod |
| ContainerCreating | Image pull, CNI, CSI mount, Secret/ConfigMap missing. | Events on Pod |
| CrashLoopBackOff | App exits after start, bad config, failed probe. | Previous logs |
| ImagePullBackOff | Bad tag, auth, registry outage. | Image name and pull secret |
| OOMKilled | Container exceeded memory limit. | Limits, metrics, app memory |
CrashLoopBackOff Deep Dive
| Signal | Interpretation | Mitigation |
|---|---|---|
| Restart count climbing | Process exits immediately or probe never passes. | kubectl logs --previous for last crash reason. |
| probe failures in events | Liveness/readiness path wrong. | Align probe port/path with real listener. |
| exit code 137 / OOM score | Memory spike at boot. | Raise limit or defer heavy init. |
| panic stack in logs | App bug/config. | Rollback image or fix env/Secret. |
| Crash only on subset of nodes | Arches, kernel feature, SELinux/AppArmor variance. | Describe node labels + inspect security contexts. |
POD=my-pod
NS=app
kubectl describe pod "$POD" -n "$NS"
kubectl logs "$POD" -n "$NS" --all-containers=true --tail=120
kubectl logs "$POD" -n "$NS" --all-containers=true --previous
kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.containers[*].name}{"\n"}'Node NotReady Playbook
| Evidence | Subsystem | Evidence command |
|---|---|---|
Kubelet stopped posting status | kubelet/agent crash or disk-full. | journalctl/kubelet svc on host. |
CNI/network not ready | CNI DaemonSet CrashLoop. | kubectl logs -n kube-system -l k8s-app=calico-node pattern. |
Runtime not ready | containerd/cri-o wedged. | crictl info + systemd. |
PIDPressure / DiskPressure | Host saturated. | df -h, inode usage, cgroup pids.max. |
| Unknown node heartbeat | Network partition or API/etcd auth. | Connectivity from node to apiserver LB. |
NODE=worker-1
kubectl describe node "$NODE"
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name="$NODE"OOM And Memory Pressure
| Symptom | Layer | Remediation |
|---|---|---|
Reason OOMKilled | Container crosses memory.limit. | Raise realistic limit based on spikes; tune JVM/go env. |
| QoS eviction | Node starving; Guaranteed last. | Add nodes / drop noisy neighbors. |
| cgroup kills before limit | MemoryBacked volumes or huge caches. | inspect cat /sys/fs/cgroup/...memory.events on host. |
| metrics show low util but kills happen | Transient spike shorter than scraping. | Use eBPF/kubelet traces or profiling. |
| multiple pods restarted same node | Underlying hardware memory errors. | ECC logs / BMC alerts. |
kubectl describe pod crashing-pod -n app | grep -A3 -i oom
kubectl top pods -A --containers --sort-by=memory | tail -40Image Pull Failures
| Event fragment | Usually means | Fix vector |
|---|---|---|
401 Unauthorized | Missing/expired registry secret. | Ensure imagePullSecrets referenced + IAM token TTL. |
403 Forbidden | Repo IAM / network egress policy. | Relax policy mirror image. |
ImagePullBackOff | Bad tag/arch digests. | Inspect manifest & node arch (kubectl debug node ...). |
ErrImageNeverPull | Wrong imagePullPolicy on local builds. | Use Never/IfNotPresent intentionally. |
| timeout / i/o deadline | Firewall or registry outage. | circuit break reroute mirror. |
kubectl describe pod my-pod -n app | grep -A8 Events
kubectl get secret registry-creds -n app -o yaml # compare type dockerconfigjson entriesKubelet Eviction Matrix
| Condition | Signals | Operational response |
|---|---|---|
| DiskPressure | kubectl describe node; image GC thrash. | Trim logs, enlarge root volume, move containerd dir. |
| MemoryPressure | Kernel reclaim + pod rank eviction. | Relax limits / add RAM / drain hotspot. |
| PIDPressure | Runaway fork bombs / short-lived Pods. | Throttle workload; raise pids.limit carefully. |
| AllocatedTooManyPods | Cheap small pods exhaust maxPods. | Shard workloads or tweak kubelet flags (platform-specific). |
| Forced drain during incident | Preserves PDB-less chaos? | Always pair with PDB and capacity math. |
kubectl get pods -n app -o json | jq '..|.reason?|select(.!=null)'
kubectl get events -n app --field-selector reason=Evicted --sort-by=.lastTimestampDeployments And Rollout Signals
When Pods look fine individually but Deployments stay degraded, widen the aperture to ReplicaSet ownership, PDB overlap, terminating Pods, or HPAs fighting manual scales.
| Observation | Interpretation | Pivot command |
|---|---|---|
| Replicas < desired | Insufficient cluster capacity vs PDB clamp. | kubectl describe pdb -n ... |
| Old ReplicaSet spikes | Rollback in progress. | kubectl rollout history deploy/... |
| ProgressDeadlineExceeded | New Pods never become Ready. | Probe/logs on new RS template hash. |
| Termination stuck | Finalizers/preStop hooks/long grace. | --force delete only with policy clearance. |
kubectl describe deploy api -n app
kubectl rollout status deploy/api -n app --timeout=5m | cat
kubectl get rs -n app -o wide --show-labels
kubectl get hpa -n appQuotas And Admission Blocks
| Admission message | Underlying object | Rapid unblock |
|---|---|---|
| Forbidden: exceeded quota | ResourceQuota on namespace CPU/mem/pods. | Adjust quota or reschedule. |
| LimitRange min/max mismatch | Admission defaults rejecting manifest. | Describe LimitRange objects. |
| Forbidden: forbidden: requests.storage | PV/PVC quotas. | Recycle PVCs/shrink StatefulSets temporarily. |
| Exceeded pod count | cheap batch jobs. | Garbage collect completed Pods. |
kubectl describe resourcequota -n app
kubectl describe limitrange -n app
kubectl get pods -n app --field-selector=status.phase=FailedRBAC And ServiceAccounts
Permission errors bubble up from kube-apiserver audits as Forbidden messages on Pods or CSI sidecars mounting tokens.
kubectl auth can-i watch pods --as=system:serviceaccount:app:api-sa -n app
kubectl describe rolebinding -n app
kubectl get role,rolebinding -n app -o widePVCs Blocking Pod Start
Stateful workloads often wedge in ContainerCreating because CSI attach hangs or zoning mismatches propagate late.
| Event hint | Interpretation | Narrow drill |
|---|---|---|
| waiting for PersistentVolumeClaim | Provisioning backlog. | kubectl describe pvc. |
| Multi-Attach error | RWO reused across nodes. | Volumes Attachment objects. |
| permission denied mounting | Secrets for cloud provider missing RBAC. | Controller logs CSI. |
| Timed out waiting for volumes | AZ drift topologies.storage. | Align node+PVM zone annotations. |
kubectl get pvc,pv -n data
kubectl get volumeattachmentContainer Output Streams
kubectl logs deploy/web-api -n app --tail=100 # Logs from one current Pod behind the Deployment.
kubectl logs deploy/web-api -n app --all-containers=true --tail=100 # Include sidecars.
kubectl logs pod/web-api-abc123 -n app -c api --previous # Previous crashed container logs.
kubectl logs -n app -l app=web-api --since=30m --prefix=true # Aggregate by label for recent logs.
kubectl describe pod web-api-abc123 -n app # Events explain scheduling, pull, probe, and mount failures.Control Plane Components
kubectl get pods -n kube-system # Managed clusters expose component Pods differently.
kubectl get componentstatuses # Deprecated, but may exist in older clusters.
kubectl logs -n kube-system kube-apiserver-control-plane --tail=100 # kubeadm static Pod example.
kubectl logs -n kube-system kube-controller-manager-control-plane --tail=100
kubectl logs -n kube-system kube-scheduler-control-plane --tail=100
kubectl get --raw='/readyz?verbose' # API server readiness details if your RBAC allows it.CronJob Saturation Signals
Batch bursts masquerade as cluster-wide slowness when API priority for Job creation competes with Service discovery watches.
| Leading indicator | Likely cause |
|---|---|
| Pod count nearing namespace quota nightly | CronJobs without history limits pinned. |
| APIServer spikes at minute zero | Thousands of deterministic schedules stacked. |
| Stateful pipeline delays | BackoffLimit exhaustion thrashing etcd. |
| Garbage collector backlog | Completed Pods never TTL-deleted. |
- Trim
successfulJobsHistoryLimit/failedJobsHistoryLimitproactively. - Prefer
ttlSecondsAfterFinishedon Jobs when controllers support migration. - Shard noisy tenants into namespaces with independent quotas.
- Alert when
kubectl get jobs -n batch --field-selector status.successful=0grows unbounded. - Throttle CI-driven Job storms when GitOps emits duplicate stamped manifests.
kubectl get cronjobs -A
kubectl get jobs -A --sort-by=.status.startTime | tailEphemeral Debugging
Use kubectl debug profiles when distroless images block shells or when copying binaries is unsafe.
| Profile | When | Note |
|---|---|---|
| general | Interactive shell beside target. | Shares namespaces; beware side effects. |
| baseline | Closer to distroless hardened. | Minimal tooling. |
| restricted | Highly locked clusters. | Needs PSA/compliance approvals. |
| copy-to-debug | Copy Pod spec for cloning. | Preserves env for reproducibility. |
kubectl debug pod/api-XXXX -n app -it --image=busybox --target=api
kubectl debug node/worker-1 -it --image=ubuntu -- chroot /host bashIncident Evidence Ordering
Preserve ordering so post-incident reviewers can correlate operator actions vs automated controllers.
| Minute offset | Artifact | Captures |
|---|---|---|
| t-10 | kubectl describe node/pod tarball | steady-state snapshot. |
| t0 | Alerts + Grafana snapshot links | latency/memory graphs. |
| t+5 | API audit slice if available | Flood of deletes/evictions? |
| t+30 | Workload controller logs aggregated | StatefulSet churn vs ReplicaSet churn. |
| close-out | CHG/ticket linkage + rollback manifests | Demonstrate guardrails rebuilt. |
kubectl cluster-info dump --namespaces kube-system,app --output-directory=/tmp/k8s-dump
tar czf bundle.tgz /tmp/k8s-dumpAPI Server Stress Signals
| Symptom | Watch path | Mitigate |
|---|---|---|
dial tcp i/o timeout | apiserver overloaded or LB flapping. | Kill expensive kubectl logs -f sessions; widen rate limits thoughtfully. |
/etcdserver: timeout | Compaction/rebuild lag. | Raise infra ticket; throttle writes temporarily. |
| Admission webhook latency high | Webhook deployment scaling. | kubectl logs deploy/webhook-validator pattern. |
| Repeated 429 from kubelet | Node storm listing huge objects. | List/watch scopes + reduce cardinally. |
| TLS handshake resets | Stale kubeconfig / MTU MSS issues. | Renew kubeconfig certs or adjust network path. |
Observability should watch APF priority levels and etcd disk latency at the platform layer; escalate when tail latency climbs before hard timeouts appear at clients.
Event Mining Cheat Sheet
Namespaces can emit thousands of events; scope by object, severity, reason, or time window before grepping arbitrarily.
# Recent warnings scoped to workloads that match a label selector.
kubectl get events -A \
--field-selector type=Warning \
--sort-by=.lastTimestamp | tail -n 80
# Follow a single object's timeline (replace kind/name/ns).
kubectl get events --namespace app \
--field-selector involvedObject.kind=Pod,involvedObject.name=api-aaaa -w
# Group reasons to see dominating failure modes quickly.
kubectl get events -A --sort-by=.lastTimestamp -o yaml | grep ' reason:' | sort | uniq -c | tail
# Correlate DaemonSet turbulence with node maintenance.
kubectl get events -n kube-system --sort-by=.lastTimestamp | grep -i -E 'cni|cilium|calico|flannel|kube-proxy'sysctl And Runtime Guards
| Observation | Interpretation | Mitigating experiment |
|---|---|---|
| PodSecurity warnings | Privileged hooks missing CAPs. | Relax baseline profile only under CAB. |
| too many open files | Ulimit inside container low. | Increase ulimit -n via workload / init. |
| transparent hugepage warnings DB | latency jitter on mmap heavy apps. | Tune node sysctl DaemonSet deliberately. |
| SELinux AVC denials | Host policy rejecting volume mounts. | Use audit2allow path or label volumes. |
| cgroups v1 vs v2 mismatch | Older agents assume v1 hierarchies. | Bake agent matrix into node image changelog. |
| PID cgroup throttling batch | Thousands of kubectl exec sessions. | Throttle automation using shared session pools. |
| inode exhaustion | CrashLoop churn writing tiny files. | Use tmpfs quotas + pruning. |
| hardware clock skew | JWT issued in future failures. | Fix NTP on control plane boundary. |
Runtime-level findings should funnel back into image hardening backlog items so application teams stop papering gaps with permissive SCC/PodSecurity exceptions.
| Exit code bucket | Usually |
|---|---|
| 1 | Unhandled exception or missing dependency. |
| 137 | SIGKILL (OOM watchdog or kubelet kill). |
| 143 | SIGTERM after exceeding grace budgets. |
| 139 | SIGSEGV; native libs or cgroup limits. |
| 126/127 | Shebang missing or interpreter absent in container. |
| Crash before PID1 | /bin/sh missing due to distroless drift. |
Fragmentation And MTU
Symmetric connectivity failures that only reproduce on certain paths (kubectl exec works but Istio egress fails, or GRE tunnels to SD-WAN) commonly trace to MSS clamp drift after NIC driver upgrades.
| Test vector | Passes | Fails interpretation |
|---|---|---|
| ICMP large ping from netshoot Pod | ICMP ok | Prefer TCP MSS tests next. |
curl --interface forcing Pod IP vs Service IP | Both pass | Probably app-level failure. |
| Tracing shows black hole after ~1400 byte payload | n/a | Clamp TCP MSS on CNI DaemonSet overlay. |
| Cross-AZ hops only failing | n/a | MTU mismatches regional jumbo configs. |
| MetalLB BGP session flaps | Hold timer adjustments | BGP communities blackholing prematurely. |
kubectl run mtu-probe --rm -it --image=nicolaka/netshoot -- \
ping -s 8972 -c3 172.31.255.253 # ICMP payload probing for PMTUD black holesPager Handoff Anchors
Leaving the bridge for the next responder should reuse the same breadcrumbs this page drills on so knowledge does not dissipate.
- Current failing namespace list with blast radius annotated.
- Whether cordon/drain/eviction tooling was already exercised.
- Whether GitOps reconcile is paused and who authorized it.
- Most recent infra change correlated with outage clock.
- Links to Grafana/Loki queries already scoped.
- Outstanding risky mitigations awaiting CAB.
- List of flaky nodes even if Pods already evicted.
- Customer-visible SLO deltas if applicable.
- Known missing observability gaps discovered mid-incident.
- Follow-up bugs filed or pending creation.
Short acknowledgements (“we tried X”) prevent thrash loops for on-call rotations.
Pin links to dashboards instead of exporting PNGs whenever possible.
Prefer UTC timestamps aligned with Grafana panels when narrating timelines.
Normalize language: “hypothesis” vs “confirmed root cause” avoids mis-read postmortems.