GPU Diagnostics Runbook — K8s SRE Reference

TL;DR

When GPU workloads fail, triage in order: Pod events → node allocatable → device plugin / operator pods → nvidia-smi on the node. Use a privileged debug pod on Kubernetes; on EKS use SSM to the EC2 instance when exec to host is blocked. Correlate with DCGM XID metrics.

First-pass triage

bashgpu-first-pass.sh

POD=my-training-job-0
NS=ml
kubectl describe pod "$POD" -n "$NS" | grep -A12 Events
kubectl get nodes -l nvidia.com/gpu.present=true -o wide
kubectl describe node <gpu-node> | grep -A5 "Allocatable"
kubectl get pods -n gpu-operator -o wide

Run nvidia-smi via debug pod

Preferred on managed clusters where SSH is disabled:

yamlnvidia-smi-debug-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi-check
  namespace: default
spec:
  restartPolicy: Never
  nodeSelector:
    kubernetes.io/hostname: gpu-node-1   # Pin to suspect node
  tolerations:
    - key: sku
      operator: Equal
      value: gpu
      effect: NoSchedule
  containers:
    - name: nvidia-smi
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1

bashnvidia-smi-logs.sh

kubectl logs nvidia-smi-check
kubectl delete pod nvidia-smi-check

Host-level checks (bare-metal / break-glass)

bashhost-gpu-check.sh

# On the GPU worker (SSM on EKS, SSH on-prem)
nvidia-smi
nvidia-smi -q | grep -A3 "Product Name"
dmesg | tail -50 | grep -i nvrm   # NVIDIA kernel driver messages
systemctl status nvidia-persistenced 2>/dev/null || true

Symptom → cause → fix

Symptom	Likely cause	Action
`Insufficient nvidia.com/gpu`	No free GPUs or wrong pool	Scale GPU NodePool; check requests
`FailedCreatePodSandBox`	Container toolkit / driver	Restart GPU Operator components; verify AMI
Pod runs but CUDA OOM	VRAM exceeded	Raise limit or smaller batch; check DCGM FB_USED
`nvidia-smi` not found in app container	Image lacks CUDA user-space	Use NVIDIA base image; don't debug with minimal distroless only
XID errors in metrics/logs	Hardware/driver fault	Cordon node; replace instance; open vendor support
GPU count 0 on node	Device plugin not registered	GPU Operator logs

Device plugin & operator logs

bashoperator-logs.sh

kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=100
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=100

Diagnostic decision flow

Figure 1 — Stop when root cause is identified; avoid rebooting entire cluster for one bad GPU node.

Bare-Metal & Cloud GPU Diagnostics Runbook

First-pass triage

Run nvidia-smi via debug pod

Host-level checks (bare-metal / break-glass)

Symptom → cause → fix

Device plugin & operator logs

Diagnostic decision flow

See also

First-pass triage

Run nvidia-smi via debug pod

Host-level checks (bare-metal / break-glass)

Symptom → cause → fix

Device plugin & operator logs

Diagnostic decision flow

See also

Related