Scheduling, Affinity & Autoscaling
The scheduler places Pods on nodes based on resource requests, taints/tolerations, node selectors, affinity, topology spread, and volume constraints. Autoscaling adjusts replicas or node capacity, but it only works reliably when requests, probes, and disruption budgets are sane.
Scheduler Flow
Scheduling: filter eligible nodes, score them, then bind the Pod.
Requests, Limits, And QoS
CPU and memory requests drive scheduling. Limits drive runtime enforcement. Missing requests can cause bad placement, noisy neighbors, and poor autoscaling behavior.
resources:
requests:
cpu: 250m # Scheduler reserves this amount on a node.
memory: 256Mi # Used for scheduling and QoS classification.
limits:
cpu: "1" # Container can be throttled above this.
memory: 512Mi # Container can be OOMKilled above this.| QoS Class | When | Eviction Priority |
|---|---|---|
| Guaranteed | Every container has equal CPU/memory request and limit. | Most protected. |
| Burstable | At least one request/limit set, but not all equal. | Middle. |
| BestEffort | No requests or limits. | Evicted first. |
Node Selection
spec:
nodeSelector:
nodepool: apps # Simple hard requirement: node must have this label.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b"] # Hard zone constraint.
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: instance-type
operator: In
values: ["m6i.large"] # Soft preference.Taints And Tolerations
Taints repel Pods from nodes. Tolerations allow Pods to run on tainted nodes; they do not force placement by themselves.
# Add a NoSchedule taint to reserve a node for special workloads.
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule
# Remove the taint.
kubectl taint nodes <node-name> dedicated=gpu:NoSchedule-
# Show node taints.
kubectl describe node <node-name> | grep -i taints| Effect | Scheduler | Already running Pods |
|---|---|---|
| NoSchedule | Hard block without matching toleration. | Unaffected unless eviction. |
| PreferNoSchedule | Soft avoid; scheduler may still pick node under pressure. | Unaffected. |
| NoExecute | Blocks new Pods lacking toleration. | Taint-based eviction after tolerationSeconds if set. |
tolerations:
- key: dedicated
operator: Equal
value: gpu
effect: NoSchedule # Allows scheduling onto nodes with dedicated=gpu:NoSchedule.Pod Affinity And Anti-Affinity
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web-api
topologyKey: kubernetes.io/hostname # Prefer spreading replicas across nodes.affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
tier: cache # Co-locate api Pod with cache Pods.
topologyKey: kubernetes.io/hostnameTopology Spread
topologySpreadConstraints count matching Pods per topology domain (topologyKey). Use DoNotSchedule when outages must be avoided; ScheduleAnyway keeps the cluster schedulable while biasing balance. Pair with enough nodes per domain or skew remains mathematically impossible.
whenUnsatisfiable | Use when |
|---|---|
| DoNotSchedule | Hard multi-AZ balance for stateless workloads with headroom. |
| ScheduleAnyway | Burst traffic may pack one zone but operator gets metrics on skew. |
topologySpreadConstraints:
- maxSkew: 1 # Difference allowed between topology domains.
topologyKey: topology.kubernetes.io/zone # Spread across zones.
whenUnsatisfiable: ScheduleAnyway # Use DoNotSchedule for hard enforcement.
labelSelector:
matchLabels:
app: web-apiWorkload Autoscaling
HPA scales replicas; VPA recommends or changes requests; Cluster Autoscaler/Karpenter adjusts node capacity. HPA needs metrics-server or another metrics source.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-api
namespace: app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api # HPA modifies this Deployment's replica count.
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60Debug Unschedulable Pods
kubectl describe pod <pod> -n <namespace>
kubectl get events -n <namespace> --sort-by=.lastTimestamp | grep -i -E 'schedul|taint|affinity|insufficient|volume'
kubectl get nodes --show-labels
kubectl describe nodes | grep -E 'Name:|Taints:|Allocatable:|Allocated resources:' -A10
kubectl top nodes
kubectl top pods -n <namespace>