TL;DR

The scheduler places Pods on nodes based on resource requests, taints/tolerations, node selectors, affinity, topology spread, and volume constraints. Autoscaling adjusts replicas or node capacity, but it only works reliably when requests, probes, and disruption budgets are sane.

Scheduler Flow

Pending PodFilterfit constraintsScorerank nodesBindassign nodeFilter rejects nodes that cannot run the Pod; score chooses the best remaining node.

Scheduling: filter eligible nodes, score them, then bind the Pod.

Requests, Limits, And QoS

CPU and memory requests drive scheduling. Limits drive runtime enforcement. Missing requests can cause bad placement, noisy neighbors, and poor autoscaling behavior.

yamlrequests-limits.yaml
resources:
  requests:
    cpu: 250m # Scheduler reserves this amount on a node.
    memory: 256Mi # Used for scheduling and QoS classification.
  limits:
    cpu: "1" # Container can be throttled above this.
    memory: 512Mi # Container can be OOMKilled above this.
QoS ClassWhenEviction Priority
GuaranteedEvery container has equal CPU/memory request and limit.Most protected.
BurstableAt least one request/limit set, but not all equal.Middle.
BestEffortNo requests or limits.Evicted first.

Node Selection

yamlnode-selector-affinity.yaml
spec:
  nodeSelector:
    nodepool: apps # Simple hard requirement: node must have this label.
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a", "us-east-1b"] # Hard zone constraint.
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          preference:
            matchExpressions:
              - key: instance-type
                operator: In
                values: ["m6i.large"] # Soft preference.

Taints And Tolerations

Taints repel Pods from nodes. Tolerations allow Pods to run on tainted nodes; they do not force placement by themselves.

bashtaints.sh
# Add a NoSchedule taint to reserve a node for special workloads.
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule

# Remove the taint.
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule-

# Show node taints.
kubectl describe node <node-name> | grep -i taints
EffectSchedulerAlready running Pods
NoScheduleHard block without matching toleration.Unaffected unless eviction.
PreferNoScheduleSoft avoid; scheduler may still pick node under pressure.Unaffected.
NoExecuteBlocks new Pods lacking toleration.Taint-based eviction after tolerationSeconds if set.
yamltoleration.yaml
tolerations:
  - key: dedicated
    operator: Equal
    value: gpu
    effect: NoSchedule # Allows scheduling onto nodes with dedicated=gpu:NoSchedule.

Pod Affinity And Anti-Affinity

yamlpod-anti-affinity.yaml
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: web-api
          topologyKey: kubernetes.io/hostname # Prefer spreading replicas across nodes.
yamlpod-affinity-required.yaml
affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            tier: cache # Co-locate api Pod with cache Pods.
        topologyKey: kubernetes.io/hostname

Topology Spread

topologySpreadConstraints count matching Pods per topology domain (topologyKey). Use DoNotSchedule when outages must be avoided; ScheduleAnyway keeps the cluster schedulable while biasing balance. Pair with enough nodes per domain or skew remains mathematically impossible.

whenUnsatisfiableUse when
DoNotScheduleHard multi-AZ balance for stateless workloads with headroom.
ScheduleAnywayBurst traffic may pack one zone but operator gets metrics on skew.
yamltopology-spread.yaml
topologySpreadConstraints:
  - maxSkew: 1 # Difference allowed between topology domains.
    topologyKey: topology.kubernetes.io/zone # Spread across zones.
    whenUnsatisfiable: ScheduleAnyway # Use DoNotSchedule for hard enforcement.
    labelSelector:
      matchLabels:
        app: web-api

Workload Autoscaling

HPA scales replicas; VPA recommends or changes requests; Cluster Autoscaler/Karpenter adjusts node capacity. HPA needs metrics-server or another metrics source.

yamlhpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api # HPA modifies this Deployment's replica count.
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Debug Unschedulable Pods

bashscheduling-debug.sh
kubectl describe pod <pod> -n <namespace>
kubectl get events -n <namespace> --sort-by=.lastTimestamp | grep -i -E 'schedul|taint|affinity|insufficient|volume'
kubectl get nodes --show-labels
kubectl describe nodes | grep -E 'Name:|Taints:|Allocatable:|Allocated resources:' -A10
kubectl top nodes
kubectl top pods -n <namespace>