TL;DR

When GPU workloads fail, triage in order: Pod eventsnode allocatabledevice plugin / operator podsnvidia-smi on the node. Use a privileged debug pod on Kubernetes; on EKS use SSM to the EC2 instance when exec to host is blocked. Correlate with DCGM XID metrics.

First-pass triage

bashgpu-first-pass.sh
POD=my-training-job-0
NS=ml
kubectl describe pod "$POD" -n "$NS" | grep -A12 Events
kubectl get nodes -l nvidia.com/gpu.present=true -o wide
kubectl describe node <gpu-node> | grep -A5 "Allocatable"
kubectl get pods -n gpu-operator -o wide

Run nvidia-smi via debug pod

Preferred on managed clusters where SSH is disabled:

yamlnvidia-smi-debug-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi-check
  namespace: default
spec:
  restartPolicy: Never
  nodeSelector:
    kubernetes.io/hostname: gpu-node-1   # Pin to suspect node
  tolerations:
    - key: sku
      operator: Equal
      value: gpu
      effect: NoSchedule
  containers:
    - name: nvidia-smi
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1
bashnvidia-smi-logs.sh
kubectl logs nvidia-smi-check
kubectl delete pod nvidia-smi-check

Host-level checks (bare-metal / break-glass)

bashhost-gpu-check.sh
# On the GPU worker (SSM on EKS, SSH on-prem)
nvidia-smi
nvidia-smi -q | grep -A3 "Product Name"
dmesg | tail -50 | grep -i nvrm   # NVIDIA kernel driver messages
systemctl status nvidia-persistenced 2>/dev/null || true

Symptom → cause → fix

SymptomLikely causeAction
Insufficient nvidia.com/gpuNo free GPUs or wrong poolScale GPU NodePool; check requests
FailedCreatePodSandBoxContainer toolkit / driverRestart GPU Operator components; verify AMI
Pod runs but CUDA OOMVRAM exceededRaise limit or smaller batch; check DCGM FB_USED
nvidia-smi not found in app containerImage lacks CUDA user-spaceUse NVIDIA base image; don't debug with minimal distroless only
XID errors in metrics/logsHardware/driver faultCordon node; replace instance; open vendor support
GPU count 0 on nodeDevice plugin not registeredGPU Operator logs

Device plugin & operator logs

bashoperator-logs.sh
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=100
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=100

Diagnostic decision flow

Pod not healthy? describe pod node allocatable gpu-operator pods nvidia-smi on node Replace / cordon node

Figure 1 — Stop when root cause is identified; avoid rebooting entire cluster for one bad GPU node.

See also