GPU Infrastructure & MLOps on Kubernetes
Production GPU clusters on Kubernetes are a platform stack: pick the right cloud instance, install the NVIDIA GPU Operator, schedule with extended resources and taints, scale GPU nodes with Karpenter v1 on EKS, observe with DCGM Exporter, and triage with nvidia-smi and device-plugin runbooks. AWS/EKS is the primary path; GKE and AKS matrices are included for comparison.
What this series covers (and what it does not)
In scope — cluster platform layer: drivers, device plugin, MIG, node provisioning, Prometheus GPU metrics, and host-level GPU diagnostics.
Out of scope (footnote only): Kubeflow, NVIDIA Training Operator, and MPI Operator are workload orchestrators that run on top of GPU-ready nodes. They assume the platform below already exposes nvidia.com/gpu and healthy DCGM telemetry. Install them after this stack is green—not instead of it.
Six-pillar reading order
Figure 1 — Dependency order for standing up GPU capacity on EKS and comparable managed clusters.
Prerequisites on the cluster
- Managed Kubernetes with GPU-capable instance types in your account quota (instance matrix).
- EKS (or GKE/AKS) node IAM / drivers policy allowing NVIDIA driver install via operator.
- Taints and tolerations literacy before dedicating GPU pools.
- Prometheus if you enable DCGM ServiceMonitor scraping.