Practical Scenarios Knowledge
Practice full troubleshooting paths across control plane, workloads, networking, storage, autoscaling, DNS, ingress, and upgrade failures.
Use these scenarios to build ordered thinking: gather evidence, isolate the failing layer, make the smallest safe change, and verify recovery.
Questions
Your cluster API server is intermittently timing out. How do you troubleshoot?
Check API server logs, control-plane CPU and memory, API server request metrics, etcd latency with etcdctl endpoint status, network latency between API server and etcd, and load balancer health checks. The root cause is often etcd slowness, control-plane overload, or an unhealthy API load balancer path.
A Deployment rollout is stuck. What do you check first?
Start with kubectl describe deployment and rollout conditions. Then inspect new ReplicaSet Pods for readiness failures, Pending PVCs, image pull failures, scheduling failures, and rollout settings such as maxUnavailable and maxSurge that may block progress.
Pods cannot reach the external internet. What do you check?
Check node routing tables, NAT or masquerade rules on nodes, cloud route tables, CNI egress behavior, NetworkPolicy egress rules, firewall or security group rules, proxy settings, and whether DNS resolution is working before testing raw IP connectivity.
A node is NotReady. What is your step-by-step approach?
Run kubectl describe node, inspect node conditions and events, check kubelet logs, check disk and memory pressure, verify container runtime health, verify CNI health, test API server connectivity from the node, restart kubelet if appropriate, and recycle the node if it is cloud-managed and not worth repairing in place.
A StatefulSet Pod is stuck terminating. What do you do?
Check finalizers, node reachability, mounted PVC or CSI detach state, PodDisruptionBudget constraints, termination grace period, and whether the process is ignoring SIGTERM. Force delete only if the storage and application consistency risks are understood.
The cluster has high etcd latency. What do you check?
Check disk IOPS and fsync latency, because etcd needs fast storage. Also check network latency between control-plane nodes, large objects in etcd such as huge ConfigMaps or CRDs, database size, compaction and defrag schedule, leader changes, and control-plane resource pressure.
A Service has no endpoints. What is the likely root cause?
Most often the Service selector does not match Pod labels, selected Pods are not Ready, the Service points at the wrong namespace assumption, or readiness probes are failing. Compare Service selectors, Pod labels, Pod readiness, and EndpointSlices.
An HPA is not scaling. What do you check?
Check metrics-server or custom metrics availability, kubectl top pods, HPA status conditions, target metric names, CPU or memory requests, current utilization, min and max replicas, and whether another controller or manual process is fighting replica count.
A Pod is stuck in ContainerCreating. What do you check?
Describe the Pod and inspect events. Common causes include CNI setup failure, volume mount or attach issues, image pull delay, missing Secrets or ConfigMaps, node disk pressure, runtime problems, sandbox creation failures, and kubelet errors.
A cluster upgrade failed mid-way. What do you do?
Stop and capture current state. Check kubeadm upgrade plan, component versions, control-plane Pod status, kubelet logs, and API health. Roll back kubelet or kubectl packages only if needed, restore etcd snapshot if control-plane state is corrupted, then re-run the upgrade step once the blocker is clear.
A NetworkPolicy is blocking traffic unexpectedly. How do you debug?
Check whether a default deny policy exists, inspect both ingress and egress rules, verify namespace selectors and Pod labels, confirm required ports and protocols, allow DNS if needed, and use kubectl exec with curl or nc to test from the real source Pod.
Cluster autoscaler is not adding nodes. What do you check?
Check cloud provider permissions, autoscaler logs, node group max size, pending Pods with Unschedulable events, resource requests, taints and tolerations, affinity rules, quota limits, and whether Pods request resources that no node group can satisfy.
A Pod is OOMKilled repeatedly. What do you do?
Check previous logs, memory limits, actual usage, memory leaks, app heap settings, request and limit values, and node pressure. Mitigate by reducing app memory usage, increasing limits where justified, adding VPA recommendations, or splitting workload responsibilities.
An Ingress is returning 502 errors. What do you check?
Check backend Service endpoints, Pod readiness, targetPort, app listener port, Ingress controller logs, path and host rules, TLS certificate validity, upstream timeout settings, and whether the controller can reach backend Pods.
A PVC is stuck in Pending. What do you check?
Check that the StorageClass exists, access mode is supported, requested size is valid, provisioner is healthy, quota is available, zone or region matches the scheduled Pod, and volumeBindingMode is understood. Describe both PVC and Pod.
The cluster has high DNS latency. What do you check?
Check CoreDNS CPU and memory, CoreDNS logs, upstream DNS latency, NodeLocal DNS cache health if used, kube-dns Service endpoints, NetworkPolicy blocking DNS, noisy clients, search path amplification, and CNI packet loss.
A Pod cannot reach another Pod in a different namespace. What do you check?
Check NetworkPolicy in both namespaces, CNI routing, Pod IP reachability, Service DNS name, namespace-qualified Service name, target Pod readiness, port listeners, and whether egress or ingress policy allows the path.
A Deployment keeps rolling back automatically. Why might that happen?
A higher-level deploy tool or pipeline may trigger rollback when the new ReplicaSet fails health checks. Common causes include readiness probe failure, crash loops, failed post-deploy checks, insufficient capacity, or a progressive delivery controller marking the rollout unhealthy.
The API server has high load. What do you check?
Check request rate, long-running watches, too many watch clients, controllers in tight reconcile loops, large CRDs, excessive kubectl polling, admission webhook latency, metrics scraping frequency, audit logging volume, and etcd latency.
An application needs zero downtime during upgrades. What do you configure?
Use a RollingUpdate strategy, correct readiness probes, graceful shutdown hooks, terminationGracePeriodSeconds, PodDisruptionBudgets, enough replicas and capacity, proper resource requests, and compatibility between old and new versions during the rollout window.
Keep going
See also
- Debugging and Nodes — Troubleshooting & Operations Knowledge
- CI/CD and GitOps — DevOps & CI/CD Scenarios Knowledge
- SRE and Platform — SRE & Platform Scenarios Knowledge
- crictl and kubectl debug — Container Runtime Debugging
- DNS and Service Discovery — CoreDNS
- Runbook and nvidia-smi — GPU Diagnostics Runbook