NVIDIA GPU Operator Stack Architecture
The GPU Operator is a controller that installs and reconciles NVIDIA stack components on each GPU node: driver (or pre-installed driver mode), container toolkit, device plugin, DCGM exporter, GPU Feature Discovery, and optional MIG manager. On EKS, install via Helm after picking GPU instance types; verify nvidia.com/gpu allocatable before scheduling training pods.
Stack architecture
The operator does not replace kubelet—it ensures daemons exist so the scheduler can advertise GPU extended resources.
Figure 1 — Reconciliation order on a GPU worker; failure at any layer blocks healthy nvidia.com/gpu capacity.
Component map
| Component | Purpose | SRE signal |
|---|---|---|
| Driver daemonset | Kernel module / driver user-space | Node NotReady, driver pod CrashLoop |
| Device plugin | Advertises nvidia.com/gpu | Pods Pending “Insufficient nvidia.com/gpu” |
| DCGM Exporter | GPU metrics for Prometheus | See telemetry pillar |
| MIG Manager | Partitions A100/H100 into MIG slices | See scheduling & MIG |
Helm install shape (EKS)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f -
helm upgrade --install gpu-operator nvidia/gpu-operator \
-n gpu-operator \
-f gpu-operator-values.yaml \
--wait# EKS with AL2023 AMIs: often use pre-installed driver on GPU-optimized AMI
driver:
enabled: true
# useOpenKernelModules: true # Uncomment when matching NVIDIA docs for your AMI
devicePlugin:
enabled: true
dcgmExporter:
enabled: true
serviceMonitor:
enabled: true # Requires Prometheus Operator CRDs
migManager:
enabled: false # Enable when using MIG profiles on A100/H100 nodes
gfd:
enabled: true # GPU Feature Discovery — node labels for GFD-aware schedulingVerify on the node
kubectl get pods -n gpu-operator
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, gpu:(.status.allocatable["nvidia.com/gpu"]//"0")}'Above the operator (footnote)
Once nvidia.com/gpu is allocatable, ML teams may deploy Kubeflow or NVIDIA Training Operator to submit Jobs. Those projects are not part of this operator install—treat them as application-layer consumers.