TL;DR

GPU pods request nvidia.com/gpu (or MIG-specific resources). Isolate GPU nodes with taints and matching tolerations. Use node selectors or GFD labels for hardware profiles. Enable MIG only when the operator partitions cards and the device plugin exposes slice resources—otherwise schedule whole GPUs only.

Extended resources

After the device plugin registers GPUs, nodes show allocatable capacity:

yamlgpu-pod-resources.yaml
resources:
  limits:
    nvidia.com/gpu: 1   # Whole GPU — scheduler counts integer GPUs
  requests:
    nvidia.com/gpu: 1
Requests and limits for nvidia.com/gpu must match and be integers unless using time-slicing configs that explicitly allow sharing.

Taints & tolerations

Prevent generic workloads from landing on expensive GPU nodes—pair with scheduling & taints patterns.

yamlgpu-taint-toleration.yaml
# Node (often applied by Karpenter NodePool or MNG launch template)
spec:
  taints:
    - key: sku
      value: gpu
      effect: NoSchedule

---
# Pod
spec:
  tolerations:
    - key: sku
      operator: Equal
      value: gpu
      effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"  # From GPU Feature Discovery

Scheduling flow

Pending Pod Scheduler Filter: taint OK Score: free GPU GPU node

Figure 1 — Scheduler must pass taint/toleration gates before considering nvidia.com/gpu free capacity.

Multi-Instance GPU (MIG)

MIG splits one physical GPU into isolated instances. Requires MIG Manager + configured profiles on the node. Pods request MIG-specific resources (names depend on profile), e.g.:

yamlmig-pod-snippet.yaml
resources:
  limits:
    nvidia.com/mig-1g.5gb: 1  # Example — verify allocatable keys on your node
  requests:
    nvidia.com/mig-1g.5gb: 1
ModeProsCons
Whole GPUSimple; max performanceLow utilization for small models
MIGHard isolation between tenantsOps overhead; not all SKUs support MIG
Time-slicingShare one GPU among many podsNo memory isolation—noisy neighbor risk

Scheduling failures

EventFix
Insufficient nvidia.com/gpuScale GPU pool via Karpenter or reduce requests
Did not tolerate taint sku=gpuAdd toleration or remove stray taint
0 allocatable GPU on nodeFix device plugin / driver

See also