GPU & MLOps on Kubernetes — K8s SRE Reference

TL;DR

Production GPU clusters on Kubernetes are a platform stack: pick the right cloud instance, install the NVIDIA GPU Operator, schedule with extended resources and taints, scale GPU nodes with Karpenter v1 on EKS, observe with DCGM Exporter, and triage with nvidia-smi and device-plugin runbooks. AWS/EKS is the primary path; GKE and AKS matrices are included for comparison.

What this series covers (and what it does not)

In scope — cluster platform layer: drivers, device plugin, MIG, node provisioning, Prometheus GPU metrics, and host-level GPU diagnostics.

Out of scope (footnote only): Kubeflow, NVIDIA Training Operator, and MPI Operator are workload orchestrators that run on top of GPU-ready nodes. They assume the platform below already exposes nvidia.com/gpu and healthy DCGM telemetry. Install them after this stack is green—not instead of it.

💡

Analogy: GPU Operator + Karpenter = the highway; Training Operator/Kubeflow = the trucks. SREs own the highway.

Six-pillar reading order

Figure 1 — Dependency order for standing up GPU capacity on EKS and comparable managed clusters.

Prerequisites on the cluster

Managed Kubernetes with GPU-capable instance types in your account quota (instance matrix).
EKS (or GKE/AKS) node IAM / drivers policy allowing NVIDIA driver install via operator.
Taints and tolerations literacy before dedicating GPU pools.
Prometheus if you enable DCGM ServiceMonitor scraping.

GPU Infrastructure & MLOps on Kubernetes

What this series covers (and what it does not)

Six-pillar reading order

Prerequisites on the cluster

See also

What this series covers (and what it does not)

Six-pillar reading order

Prerequisites on the cluster

See also

Related platform pages