Dynamic GPU Autoscaling — Karpenter v1
TL;DR
On EKS, Karpenter v1 NodePool + EC2NodeClass provisions GPU nodes when pending pods need nvidia.com/gpu. Constrain instance families (g, p), apply GPU taints, tag subnets/SGs for discovery, and cap aggregate CPU/GPU via Karpenter limits. Generic HPA behavior stays on workloads autoscaling.
Scale-out flow
Figure 1 — Node launch is only the first step; GPU Operator must become healthy before pods leave Pending.
Karpenter v1 production spec
yamlkarpenter-gpu-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-workloads
spec:
limits:
cpu: 1000 # Cap total vCPU from this pool — adjust per quota
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
template:
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodeclass
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["g", "p"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Spot for fault-tolerant training only
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
taints:
- key: sku
value: gpu
effect: NoSchedule
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-nodeclass
spec:
role: KarpenterNodeRole-prod-cluster
amiFamily: AL2023
amiSelectorTerms:
- alias: al2023@latest # Prefer EKS GPU-optimized AMIs when available in your pipeline
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: prod-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 200Gi # Large images + checkpoint scratch
volumeType: gp3
deleteOnTermination: trueOperate & debug
bashkarpenter-gpu-ops.sh
kubectl logs -n kube-system deployment/karpenter --tail=100
kubectl get nodepools,nodeclaims
kubectl describe nodepool gpu-workloadsGotchas
- Quota: GPU families hit
VcpuLimitExceeded— see instance matrix. - Startup latency: GPU nodes + driver install can exceed default cluster-autoscaler expectations.
- Spot reclaim: Training jobs need checkpointing + PDB strategy.