NVIDIA GPU Operator Stack — K8s SRE Reference

TL;DR

The GPU Operator is a controller that installs and reconciles NVIDIA stack components on each GPU node: driver (or pre-installed driver mode), container toolkit, device plugin, DCGM exporter, GPU Feature Discovery, and optional MIG manager. On EKS, install via Helm after picking GPU instance types; verify nvidia.com/gpu allocatable before scheduling training pods.

Stack architecture

The operator does not replace kubelet—it ensures daemons exist so the scheduler can advertise GPU extended resources.

Figure 1 — Reconciliation order on a GPU worker; failure at any layer blocks healthy nvidia.com/gpu capacity.

Component map

Component	Purpose	SRE signal
Driver daemonset	Kernel module / driver user-space	Node NotReady, driver pod CrashLoop
Device plugin	Advertises `nvidia.com/gpu`	Pods Pending “Insufficient nvidia.com/gpu”
DCGM Exporter	GPU metrics for Prometheus	See telemetry pillar
MIG Manager	Partitions A100/H100 into MIG slices	See scheduling & MIG

Helm install shape (EKS)

bashgpu-operator-helm.sh

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f -

helm upgrade --install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  -f gpu-operator-values.yaml \
  --wait

yamlgpu-operator-values.yaml

# EKS with AL2023 AMIs: often use pre-installed driver on GPU-optimized AMI
driver:
  enabled: true
  # useOpenKernelModules: true  # Uncomment when matching NVIDIA docs for your AMI

devicePlugin:
  enabled: true

dcgmExporter:
  enabled: true
  serviceMonitor:
    enabled: true  # Requires Prometheus Operator CRDs

migManager:
  enabled: false   # Enable when using MIG profiles on A100/H100 nodes

gfd:
  enabled: true    # GPU Feature Discovery — node labels for GFD-aware scheduling

Verify on the node

bashverify-gpu-operator.sh

kubectl get pods -n gpu-operator
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, gpu:(.status.allocatable["nvidia.com/gpu"]//"0")}'

⚠

Version skew: GPU Operator release notes must match your Kubernetes minor and container runtime. Pin Helm chart version in Git like other production Helm assets.

Above the operator (footnote)

Once nvidia.com/gpu is allocatable, ML teams may deploy Kubeflow or NVIDIA Training Operator to submit Jobs. Those projects are not part of this operator install—treat them as application-layer consumers.

NVIDIA GPU Operator Stack Architecture

Stack architecture

Component map

Helm install shape (EKS)

Verify on the node

Above the operator (footnote)

See also

Stack architecture

Component map

Helm install shape (EKS)

Verify on the node

Above the operator (footnote)

See also

Related