GPU Karpenter Autoscaling — K8s SRE Reference

TL;DR

On EKS, Karpenter v1 NodePool + EC2NodeClass provisions GPU nodes when pending pods need nvidia.com/gpu. Constrain instance families (g, p), apply GPU taints, tag subnets/SGs for discovery, and cap aggregate CPU/GPU via Karpenter limits. Generic HPA behavior stays on workloads autoscaling.

Scale-out flow

Figure 1 — Node launch is only the first step; GPU Operator must become healthy before pods leave Pending.

Karpenter v1 production spec

yamlkarpenter-gpu-nodepool.yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-workloads
spec:
  limits:
    cpu: 1000          # Cap total vCPU from this pool — adjust per quota
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-nodeclass
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g", "p"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]  # Spot for fault-tolerant training only
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      taints:
        - key: sku
          value: gpu
          effect: NoSchedule
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-nodeclass
spec:
  role: KarpenterNodeRole-prod-cluster
  amiFamily: AL2023
  amiSelectorTerms:
    - alias: al2023@latest  # Prefer EKS GPU-optimized AMIs when available in your pipeline
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-cluster
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi       # Large images + checkpoint scratch
        volumeType: gp3
        deleteOnTermination: true

Operate & debug

bashkarpenter-gpu-ops.sh

kubectl logs -n kube-system deployment/karpenter --tail=100
kubectl get nodepools,nodeclaims
kubectl describe nodepool gpu-workloads

Gotchas

!Quota: GPU families hit VcpuLimitExceeded — see instance matrix.
!Startup latency: GPU nodes + driver install can exceed default cluster-autoscaler expectations.
!Spot reclaim: Training jobs need checkpointing + PDB strategy.

Dynamic GPU Autoscaling — Karpenter v1

Scale-out flow

Karpenter v1 production spec

Operate & debug

Gotchas

See also

Scale-out flow

Karpenter v1 production spec

Operate & debug

Gotchas

See also

Related