Scheduling, Affinity & Autoscaling

TL;DR

The scheduler places Pods on nodes based on resource requests, taints/tolerations, node selectors, affinity, topology spread, and volume constraints. Autoscaling adjusts replicas or node capacity, but it only works reliably when requests, probes, and disruption budgets are sane.

Scheduler Flow

Scheduling: filter eligible nodes, score them, then bind the Pod.

Requests, Limits, And QoS

CPU and memory requests drive scheduling. Limits drive runtime enforcement. Missing requests can cause bad placement, noisy neighbors, and poor autoscaling behavior.

yamlrequests-limits.yaml

resources:
  requests:
    cpu: 250m # Scheduler reserves this amount on a node.
    memory: 256Mi # Used for scheduling and QoS classification.
  limits:
    cpu: "1" # Container can be throttled above this.
    memory: 512Mi # Container can be OOMKilled above this.

QoS Class	When	Eviction Priority
Guaranteed	Every container has equal CPU/memory request and limit.	Most protected.
Burstable	At least one request/limit set, but not all equal.	Middle.
BestEffort	No requests or limits.	Evicted first.

Node Selection

yamlnode-selector-affinity.yaml

spec:
  nodeSelector:
    nodepool: apps # Simple hard requirement: node must have this label.
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a", "us-east-1b"] # Hard zone constraint.
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          preference:
            matchExpressions:
              - key: instance-type
                operator: In
                values: ["m6i.large"] # Soft preference.

Taints And Tolerations

Taints repel Pods from nodes. Tolerations allow Pods to run on tainted nodes; they do not force placement by themselves.

bashtaints.sh

# Add a NoSchedule taint to reserve a node for special workloads.
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule

# Remove the taint.
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule-

# Show node taints.
kubectl describe node <node-name> | grep -i taints

Effect	Scheduler	Already running Pods
NoSchedule	Hard block without matching toleration.	Unaffected unless eviction.
PreferNoSchedule	Soft avoid; scheduler may still pick node under pressure.	Unaffected.
NoExecute	Blocks new Pods lacking toleration.	Taint-based eviction after `tolerationSeconds` if set.

yamltoleration.yaml

tolerations:
  - key: dedicated
    operator: Equal
    value: gpu
    effect: NoSchedule # Allows scheduling onto nodes with dedicated=gpu:NoSchedule.

Pod Affinity And Anti-Affinity

yamlpod-anti-affinity.yaml

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: web-api
          topologyKey: kubernetes.io/hostname # Prefer spreading replicas across nodes.

yamlpod-affinity-required.yaml

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            tier: cache # Co-locate api Pod with cache Pods.
        topologyKey: kubernetes.io/hostname

Topology Spread

topologySpreadConstraints count matching Pods per topology domain (topologyKey). Use DoNotSchedule when outages must be avoided; ScheduleAnyway keeps the cluster schedulable while biasing balance. Pair with enough nodes per domain or skew remains mathematically impossible.

`whenUnsatisfiable`	Use when
DoNotSchedule	Hard multi-AZ balance for stateless workloads with headroom.
ScheduleAnyway	Burst traffic may pack one zone but operator gets metrics on skew.

yamltopology-spread.yaml

topologySpreadConstraints:
  - maxSkew: 1 # Difference allowed between topology domains.
    topologyKey: topology.kubernetes.io/zone # Spread across zones.
    whenUnsatisfiable: ScheduleAnyway # Use DoNotSchedule for hard enforcement.
    labelSelector:
      matchLabels:
        app: web-api

Workload Autoscaling

HPA scales replicas; VPA recommends or changes requests; Cluster Autoscaler/Karpenter adjusts node capacity. HPA needs metrics-server or another metrics source.

yamlhpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api # HPA modifies this Deployment's replica count.
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Debug Unschedulable Pods

bashscheduling-debug.sh

kubectl describe pod <pod> -n <namespace>
kubectl get events -n <namespace> --sort-by=.lastTimestamp | grep -i -E 'schedul|taint|affinity|insufficient|volume'
kubectl get nodes --show-labels
kubectl describe nodes | grep -E 'Name:|Taints:|Allocatable:|Allocated resources:' -A10
kubectl top nodes
kubectl top pods -n <namespace>