Troubleshooting & Operations Knowledge

What is your first pass when an app is down?

Check namespace events, Deployment status, ReplicaSets, Pods, readiness, Service endpoints, logs, and recent changes. Verify whether the issue is scheduling, image pull, container start, readiness, networking, storage, or application behavior.

What does CrashLoopBackOff mean?

CrashLoopBackOff means a container starts, exits or crashes, and Kubernetes repeatedly restarts it with increasing backoff delay. Check previous logs, command and args, environment, mounted config, permissions, and app startup behavior.

How do you view logs of a crashing Pod?

Use kubectl logs <pod> --previous to see logs from the last crashed container instance. If the Pod has multiple containers, add -c <container>. Pair logs with kubectl describe pod <pod> for events and restart reason.

What does BackOff restarting failed container mean?

It means the container repeatedly fails to start or stay running, so kubelet is backing off restarts. It is the event-level form of the same pattern commonly seen as CrashLoopBackOff.

What does ImagePullBackOff indicate?

Kubernetes cannot pull the container image. Common causes include a wrong image name, wrong tag, missing registry credentials, private registry auth failure, network egress problems, registry rate limits, or unsupported node architecture.

How do you debug ImagePullBackOff?

Describe the Pod and read the exact pull event. Check image name, tag, registry hostname, pull secret, ServiceAccount imagePullSecrets, registry auth, network egress, rate limits, and whether the image exists for the node architecture.

What does ErrImageNeverPull mean?

ErrImageNeverPull means the Pod sets imagePullPolicy: Never, so kubelet will not pull the image from a registry. The image must already exist on the node, or the policy should be changed.

How do you debug a Pod stuck in Pending?

Describe the Pod and read scheduler events. Common causes include insufficient CPU or memory, node selectors, taints without tolerations, affinity mismatch, unbound PVCs, quota limits, missing runtime class, or unavailable nodes.

How do you check events for a Pod?

Use kubectl describe pod <pod> for Pod-specific events and state. For a timeline across namespaces, use kubectl get events -A --sort-by=.metadata.creationTimestamp. Events often explain scheduling, image pull, probe, and volume failures.

How do you check cluster-wide events?

Use kubectl get events -A --sort-by=.metadata.creationTimestamp. Newer clusters may also support sorting by lastTimestamp depending on event API shape. Use events as hints, then verify with object status and logs.

What does a node in NotReady state mean?

NotReady means the node is not reporting healthy status. Causes include kubelet failure, network partition, API server reachability problems, disk pressure, memory pressure, runtime failure, CNI problems, or certificate issues.

How do you check kubelet logs?

Use journalctl -u kubelet -f to follow live logs on a systemd node. For a focused window, use journalctl -u kubelet --since '30 minutes ago'. Kubelet logs explain node registration, Pod startup, CNI, CSI, and runtime issues.

What does OOMKilled mean?

OOMKilled means the container exceeded its memory limit or the node was under memory pressure and the kernel terminated the process. Check container limits, memory usage, app heap settings, node pressure, and previous logs.

What does NodePressure such as DiskPressure or MemoryPressure mean?

Node pressure means the node is low on a critical resource. DiskPressure or MemoryPressure can trigger Pod eviction, image garbage collection, scheduling avoidance, and degraded node readiness. Check node conditions, kubelet logs, and resource usage.

How do you debug DNS issues inside a Pod?

Use kubectl exec -it <pod> -- nslookup kubernetes.default or a debug Pod with DNS tools. Check /etc/resolv.conf, kube-dns Service, CoreDNS Pods and logs, NetworkPolicy, NodeLocal DNS, and CNI connectivity.

How do you test network connectivity between Pods?

Use kubectl exec -it <pod> -- curl <target-ip>:<port> or test the Service DNS name and port. If the image lacks curl, use a temporary debug image. Compare Pod IP, Service IP, DNS name, and endpoint readiness.

What causes a Pod to be stuck in Terminating?

Common causes include finalizers that are not removed, a node that is unreachable, a mounted volume still detaching or unmounting, long terminationGracePeriodSeconds, or a process ignoring SIGTERM.

How do you force delete a stuck Pod?

Use kubectl delete pod <pod> --force --grace-period=0 only after understanding the risk. Force deletion removes the API object quickly, but the process may still run on an unreachable node until kubelet or the node recovers.

What causes a Service to have no endpoints?

Common causes include selector mismatch, Pods not Ready, wrong namespace, readiness probe failure, or no matching Pods. Check the Service selector, Pod labels, readiness, and EndpointSlices.

How do you debug a failing liveness probe?

Check container logs, Pod events, and the probe configuration: path, port, scheme, initialDelaySeconds, timeoutSeconds, periodSeconds, and failureThreshold. Confirm the probe checks local process health, not a slow external dependency.

How do you check if a node can reach the API server?

From the node, test the API endpoint with a command such as curl -k https://<api-server-ip>:6443/healthz. Also check DNS, routes, firewall rules, proxy settings, certificates, kubelet logs, and whether the configured kubelet server URL is correct.

What causes a PVC to be stuck in Pending?

Common causes include no matching PV, wrong access mode, missing StorageClass, provisioner failure, quota limits, zone or region mismatch, or delayed binding with WaitForFirstConsumer until a Pod is scheduled.

What should you do before draining a node?

Cordon the node, check DaemonSets and local storage, inspect PodDisruptionBudgets, ensure enough capacity elsewhere, communicate impact, then drain with appropriate flags. Watch replacement Pods and uncordon only after node health is verified.

Why can kubectl drain hang?

PDBs may block eviction, unmanaged Pods may exist, Pods may use local storage, finalizers may block deletion, or replacement Pods may not schedule. The events usually explain the blocker.

What belongs in a useful handoff after an incident?

Timeline, user impact, detected symptoms, changed resources, commands run, current status, temporary mitigations, risks, owner, next checks, and rollback path. It should let another operator continue without re-discovering the facts.

Troubleshooting & Operations Knowledge

Questions

Keep going

See also