Bare-Metal & Cloud GPU Diagnostics Runbook
TL;DR
When GPU workloads fail, triage in order: Pod events → node allocatable → device plugin / operator pods → nvidia-smi on the node. Use a privileged debug pod on Kubernetes; on EKS use SSM to the EC2 instance when exec to host is blocked. Correlate with DCGM XID metrics.
First-pass triage
bashgpu-first-pass.sh
POD=my-training-job-0
NS=ml
kubectl describe pod "$POD" -n "$NS" | grep -A12 Events
kubectl get nodes -l nvidia.com/gpu.present=true -o wide
kubectl describe node <gpu-node> | grep -A5 "Allocatable"
kubectl get pods -n gpu-operator -o wideRun nvidia-smi via debug pod
Preferred on managed clusters where SSH is disabled:
yamlnvidia-smi-debug-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi-check
namespace: default
spec:
restartPolicy: Never
nodeSelector:
kubernetes.io/hostname: gpu-node-1 # Pin to suspect node
tolerations:
- key: sku
operator: Equal
value: gpu
effect: NoSchedule
containers:
- name: nvidia-smi
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1bashnvidia-smi-logs.sh
kubectl logs nvidia-smi-check
kubectl delete pod nvidia-smi-checkHost-level checks (bare-metal / break-glass)
bashhost-gpu-check.sh
# On the GPU worker (SSM on EKS, SSH on-prem)
nvidia-smi
nvidia-smi -q | grep -A3 "Product Name"
dmesg | tail -50 | grep -i nvrm # NVIDIA kernel driver messages
systemctl status nvidia-persistenced 2>/dev/null || trueSymptom → cause → fix
| Symptom | Likely cause | Action |
|---|---|---|
Insufficient nvidia.com/gpu | No free GPUs or wrong pool | Scale GPU NodePool; check requests |
FailedCreatePodSandBox | Container toolkit / driver | Restart GPU Operator components; verify AMI |
| Pod runs but CUDA OOM | VRAM exceeded | Raise limit or smaller batch; check DCGM FB_USED |
nvidia-smi not found in app container | Image lacks CUDA user-space | Use NVIDIA base image; don't debug with minimal distroless only |
| XID errors in metrics/logs | Hardware/driver fault | Cordon node; replace instance; open vendor support |
| GPU count 0 on node | Device plugin not registered | GPU Operator logs |
Device plugin & operator logs
bashoperator-logs.sh
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=100
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=100Diagnostic decision flow
Figure 1 — Stop when root cause is identified; avoid rebooting entire cluster for one bad GPU node.