TL;DR

Match workload to GPU generation (inference L4/T4 vs training A100/H100), then pick a cloud SKU with enough GPU memory, NVLink/PCIe topology, and network (EFA, high-bandwidth NICs). On EKS, standardize on families like g5, p4d, p5; validate quota and AMI compatibility before enabling GPU Operator.

NVIDIA GPU generations (quick matrix)

GPUVRAMTypical useK8s note
L4 / T416–24 GBInference, smaller fine-tuneCost-efficient; time-slicing common
A10 / A10024–80 GBTraining & large inferenceMIG on A100 for sharing
H100 / H20080 GB+LLM training, HPCp5 on AWS; driver ≥ 535 stacks

AWS EKS instance matrix (primary)

These are the families SRE teams most often wire into EKS node groups or Karpenter GPU NodePools.

Instance familyGPUGPUs / nodeWhen to choose
g5.xlargeg5.48xlargeNVIDIA A10G1–8General inference, CV, mid-size training
g6* (region-dependent)L4 / newer1–8Efficient inference; check regional availability
p4d.24xlargeA100 40GB8Multi-GPU training; EFA networking
p5.48xlargeH1008Large-model training; highest cost
p3* (legacy)V1001–8Avoid greenfield; driver/support sunset paths
bashaws-gpu-quota.sh
# Confirm GPU instance quota in target region before Karpenter scales.
aws service-quotas list-service-quotas --service-code ec2 --region us-east-1 \
  --query "Quotas[?contains(QuotaName, 'Running On-Demand G')].[QuotaName,Value]" --output table

GKE & AKS comparison columns

CloudExample GPU SKUsPool modelDocs anchor
GKEa2-highgpu-8g, a3-highgpu-8g, g2-standard-*GPU node pools + GKE AcceleratorGKE Deep Dive
AKSStandard_NC*, Standard_ND*, NCads_A100_v4GPU node pools + driver extensionsAKS Deep Dive
EKSg5, p4d, p5MNG or Karpenter + GPU OperatorEKS Deep Dive

Host topology matters

EC2 GPU NODE (e.g. p4d.24xlarge) GPU 0–3 GPU 4–7 NVSwitch /NVLink domain EFA NIC Multi-GPU jobs need NCCL-aware topology; single-GPU pods often ignore NUMA

Figure 1 — Multi-GPU instances expose topology that affects distributed training—not just raw GPU count.

Selection checklist

  • Enough GPU memory for model + optimizer states (not just parameter count).
  • Spot / preemptible only for fault-tolerant batch; use taints + checkpointing.
  • MIG-capable SKUs if you need fractional GPUs on one card.
  • Align AMI (AL2023, Ubuntu) with GPU Operator driver bundles.

Symptom → cause

SignalLikely cause
Karpenter never launches GPU nodesEC2 quota = 0 for G/VT instances
Wrong GPU count in nvidia-smiMixed instance type or bare-metal profile mismatch

See also