Cluster Maintenance Knowledge — K8s SRE Reference

Practice

Practice safe node maintenance, version upgrades, snapshots, certificate health, autoscaling, node pressure, and scheduling controls.

Maintenance work should be deliberate: protect workloads, preserve cluster state, make one node or version change at a time, and verify before returning capacity.

Questions

What is the purpose of draining a node?

Draining safely evicts non-DaemonSet Pods before maintenance such as upgrades, reboots, kernel patches, hardware work, or runtime changes. It helps move workloads elsewhere while respecting controllers and PodDisruptionBudgets.

How do you drain a node?

Use kubectl drain node01 --ignore-daemonsets --delete-emptydir-data when deleting emptyDir data is acceptable. Before running it, check PodDisruptionBudgets, local storage, replacement capacity, and critical workloads.

What is the difference between cordon and drain?

Cordon marks a node unschedulable so new Pods do not land there. Drain cordons the node and evicts existing evictable Pods. Drain is the stronger maintenance action.

How do you uncordon a node?

Use kubectl uncordon node01 after the maintenance work is complete and the node is healthy. Then watch new Pods schedule and become Ready.

What is the purpose of kubeadm upgrade?

kubeadm upgrade helps upgrade kubeadm-managed clusters in a controlled, version-compatible sequence. It plans and applies control-plane changes, then worker nodes are upgraded one at a time by updating kubelet and kubectl packages.

How do you check available Kubernetes upgrade versions?

Use kubeadm upgrade plan on a control-plane node. It checks the current version, available target versions, component config, and the upgrade path supported by kubeadm.

What is the recommended Kubernetes version skew?

Keep components within the supported Kubernetes version skew. A common rule is that kubelet may be up to one minor version older than the control plane in many upgrade paths, but always verify the official skew policy for the target version before changing production clusters.

What is the purpose of etcd snapshots?

etcd snapshots back up Kubernetes cluster state for recovery. They are critical for self-managed control planes because etcd holds objects, RBAC, Secrets, workload definitions, and controller state.

How do you restore an etcd snapshot at a high level?

Stop the API server or affected control-plane static Pods, restore the snapshot with etcdctl snapshot restore, point etcd at the restored data directory, reconfigure static Pod manifests if needed, then restart the control plane and validate cluster state.

What is the purpose of node taints during maintenance?

Taints repel Pods from nodes that should not accept normal workloads. During maintenance, a taint can prevent accidental scheduling while allowing only Pods with matching tolerations.

How do you view node conditions?

Use kubectl describe node <node> and inspect Conditions, Allocated resources, Events, labels, taints, capacity, and allocatable. Conditions such as Ready, DiskPressure, MemoryPressure, and NetworkUnavailable point to different layers.

What causes DiskPressure on a node?

DiskPressure means node disk or image filesystem space is low. Causes include too many images, large container logs, full ephemeral storage, hostPath usage, failed garbage collection, or application writes to node-backed storage.

What is an eviction threshold?

An eviction threshold is a kubelet setting that defines when Pods should be evicted because node resources are too low, such as memory, nodefs, imagefs, or PID availability. Thresholds can be hard or soft.

How do you manually delete a failed node from the cluster?

Use kubectl delete node <node> after confirming the machine is gone or intentionally removed. Also clean up cloud instances, load balancer attachments, volumes, and replacement capacity as needed.

What is the purpose of kubelet certificate rotation?

Kubelet certificate rotation automatically renews TLS client and serving certificates so kubelet can keep communicating securely with the API server without manual certificate replacement.

How do you check certificate expiration in a kubeadm cluster?

Use kubeadm certs check-expiration on a control-plane node. Review API server, controller-manager, scheduler, front-proxy, and etcd certificate dates before upgrades or planned maintenance windows.

What is the purpose of the cluster autoscaler?

Cluster autoscaler adds nodes when Pods cannot schedule due to capacity and removes underused nodes when workloads can be safely moved elsewhere. It reacts to scheduling pressure, not direct CPU usage.

What is the difference between horizontal and vertical autoscaling?

Horizontal autoscaling changes the number of Pods, such as HPA increasing replicas. Vertical autoscaling changes CPU and memory requests or limits for Pods, often through VPA recommendations or updates.

What is the purpose of node labels in maintenance?

Node labels group nodes for targeted scheduling, selection, and operational workflows. They help identify node pools, zones, hardware types, maintenance batches, ownership, or workload classes.

How do you safely reboot a node during active service?

Cordon the node, drain it while respecting disruption budgets, reboot, verify kubelet, runtime, CNI, CSI, node conditions, and system Pods, then uncordon. Watch workloads reschedule and become Ready before moving to the next node.

Questions

Keep going