GPU Hardware & Instance Matrices

TL;DR

Match workload to GPU generation (inference L4/T4 vs training A100/H100), then pick a cloud SKU with enough GPU memory, NVLink/PCIe topology, and network (EFA, high-bandwidth NICs). On EKS, standardize on families like g5, p4d, p5; validate quota and AMI compatibility before enabling GPU Operator.

NVIDIA GPU generations (quick matrix)

GPU	VRAM	Typical use	K8s note
L4 / T4	16–24 GB	Inference, smaller fine-tune	Cost-efficient; time-slicing common
A10 / A100	24–80 GB	Training & large inference	MIG on A100 for sharing
H100 / H200	80 GB+	LLM training, HPC	`p5` on AWS; driver ≥ 535 stacks

AWS EKS instance matrix (primary)

These are the families SRE teams most often wire into EKS node groups or Karpenter GPU NodePools.

Instance family	GPU	GPUs / node	When to choose
`g5.xlarge` – `g5.48xlarge`	NVIDIA A10G	1–8	General inference, CV, mid-size training
`g6*` (region-dependent)	L4 / newer	1–8	Efficient inference; check regional availability
`p4d.24xlarge`	A100 40GB	8	Multi-GPU training; EFA networking
`p5.48xlarge`	H100	8	Large-model training; highest cost
`p3*` (legacy)	V100	1–8	Avoid greenfield; driver/support sunset paths

bashaws-gpu-quota.sh

# Confirm GPU instance quota in target region before Karpenter scales.
aws service-quotas list-service-quotas --service-code ec2 --region us-east-1 \
  --query "Quotas[?contains(QuotaName, 'Running On-Demand G')].[QuotaName,Value]" --output table

GKE & AKS comparison columns

Cloud	Example GPU SKUs	Pool model	Docs anchor
GKE	`a2-highgpu-8g`, `a3-highgpu-8g`, `g2-standard-`*	GPU node pools + GKE Accelerator	GKE Deep Dive
AKS	`Standard_NC`, `Standard_ND`, `NCads_A100_v4`	GPU node pools + driver extensions	AKS Deep Dive
EKS	`g5`, `p4d`, `p5`	MNG or Karpenter + GPU Operator	EKS Deep Dive

Host topology matters

Figure 1 — Multi-GPU instances expose topology that affects distributed training—not just raw GPU count.

Selection checklist

Enough GPU memory for model + optimizer states (not just parameter count).
Spot / preemptible only for fault-tolerant batch; use taints + checkpointing.
MIG-capable SKUs if you need fractional GPUs on one card.
Align AMI (AL2023, Ubuntu) with GPU Operator driver bundles.

Symptom → cause

Signal	Likely cause
Karpenter never launches GPU nodes	EC2 quota = 0 for G/VT instances
Wrong GPU count in `nvidia-smi`	Mixed instance type or bare-metal profile mismatch

Hardware Fundamentals & Cloud Instance Matrices

NVIDIA GPU generations (quick matrix)

AWS EKS instance matrix (primary)

GKE & AKS comparison columns

Host topology matters

Selection checklist

Symptom → cause

See also

NVIDIA GPU generations (quick matrix)

AWS EKS instance matrix (primary)

GKE & AKS comparison columns

Host topology matters

Selection checklist

Symptom → cause

See also

Related