Hardware Fundamentals & Cloud Instance Matrices
TL;DR
Match workload to GPU generation (inference L4/T4 vs training A100/H100), then pick a cloud SKU with enough GPU memory, NVLink/PCIe topology, and network (EFA, high-bandwidth NICs). On EKS, standardize on families like g5, p4d, p5; validate quota and AMI compatibility before enabling GPU Operator.
NVIDIA GPU generations (quick matrix)
| GPU | VRAM | Typical use | K8s note |
|---|---|---|---|
| L4 / T4 | 16–24 GB | Inference, smaller fine-tune | Cost-efficient; time-slicing common |
| A10 / A100 | 24–80 GB | Training & large inference | MIG on A100 for sharing |
| H100 / H200 | 80 GB+ | LLM training, HPC | p5 on AWS; driver ≥ 535 stacks |
AWS EKS instance matrix (primary)
These are the families SRE teams most often wire into EKS node groups or Karpenter GPU NodePools.
| Instance family | GPU | GPUs / node | When to choose |
|---|---|---|---|
g5.xlarge – g5.48xlarge | NVIDIA A10G | 1–8 | General inference, CV, mid-size training |
g6* (region-dependent) | L4 / newer | 1–8 | Efficient inference; check regional availability |
p4d.24xlarge | A100 40GB | 8 | Multi-GPU training; EFA networking |
p5.48xlarge | H100 | 8 | Large-model training; highest cost |
p3* (legacy) | V100 | 1–8 | Avoid greenfield; driver/support sunset paths |
bashaws-gpu-quota.sh
# Confirm GPU instance quota in target region before Karpenter scales.
aws service-quotas list-service-quotas --service-code ec2 --region us-east-1 \
--query "Quotas[?contains(QuotaName, 'Running On-Demand G')].[QuotaName,Value]" --output tableGKE & AKS comparison columns
| Cloud | Example GPU SKUs | Pool model | Docs anchor |
|---|---|---|---|
| GKE | a2-highgpu-8g, a3-highgpu-8g, g2-standard-* | GPU node pools + GKE Accelerator | GKE Deep Dive |
| AKS | Standard_NC*, Standard_ND*, NCads_A100_v4 | GPU node pools + driver extensions | AKS Deep Dive |
| EKS | g5, p4d, p5 | MNG or Karpenter + GPU Operator | EKS Deep Dive |
Host topology matters
Figure 1 — Multi-GPU instances expose topology that affects distributed training—not just raw GPU count.
Selection checklist
- Enough GPU memory for model + optimizer states (not just parameter count).
- Spot / preemptible only for fault-tolerant batch; use taints + checkpointing.
- MIG-capable SKUs if you need fractional GPUs on one card.
- Align AMI (AL2023, Ubuntu) with GPU Operator driver bundles.
Symptom → cause
| Signal | Likely cause |
|---|---|
| Karpenter never launches GPU nodes | EC2 quota = 0 for G/VT instances |
Wrong GPU count in nvidia-smi | Mixed instance type or bare-metal profile mismatch |