TL;DR

The GPU Operator is a controller that installs and reconciles NVIDIA stack components on each GPU node: driver (or pre-installed driver mode), container toolkit, device plugin, DCGM exporter, GPU Feature Discovery, and optional MIG manager. On EKS, install via Helm after picking GPU instance types; verify nvidia.com/gpu allocatable before scheduling training pods.

Stack architecture

The operator does not replace kubelet—it ensures daemons exist so the scheduler can advertise GPU extended resources.

GPU Operator (control) NVIDIA Driver (or use pre-installed) NVIDIA Container Toolkit Device Plugin → nvidia.com/gpu DCGM Exporter GPU Feature Discovery MIG Manager (optional) GPU workload pods (CUDA)

Figure 1 — Reconciliation order on a GPU worker; failure at any layer blocks healthy nvidia.com/gpu capacity.

Component map

ComponentPurposeSRE signal
Driver daemonsetKernel module / driver user-spaceNode NotReady, driver pod CrashLoop
Device pluginAdvertises nvidia.com/gpuPods Pending “Insufficient nvidia.com/gpu”
DCGM ExporterGPU metrics for PrometheusSee telemetry pillar
MIG ManagerPartitions A100/H100 into MIG slicesSee scheduling & MIG

Helm install shape (EKS)

bashgpu-operator-helm.sh
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f -

helm upgrade --install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  -f gpu-operator-values.yaml \
  --wait
yamlgpu-operator-values.yaml
# EKS with AL2023 AMIs: often use pre-installed driver on GPU-optimized AMI
driver:
  enabled: true
  # useOpenKernelModules: true  # Uncomment when matching NVIDIA docs for your AMI

devicePlugin:
  enabled: true

dcgmExporter:
  enabled: true
  serviceMonitor:
    enabled: true  # Requires Prometheus Operator CRDs

migManager:
  enabled: false   # Enable when using MIG profiles on A100/H100 nodes

gfd:
  enabled: true    # GPU Feature Discovery — node labels for GFD-aware scheduling

Verify on the node

bashverify-gpu-operator.sh
kubectl get pods -n gpu-operator
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, gpu:(.status.allocatable["nvidia.com/gpu"]//"0")}'
Version skew: GPU Operator release notes must match your Kubernetes minor and container runtime. Pin Helm chart version in Git like other production Helm assets.

Above the operator (footnote)

Once nvidia.com/gpu is allocatable, ML teams may deploy Kubeflow or NVIDIA Training Operator to submit Jobs. Those projects are not part of this operator install—treat them as application-layer consumers.

See also