Hardware Maintenance — K8s SRE Reference

TL;DR

For node hardware work, protect workloads first, then service the machine. Cordon, inspect, drain, perform maintenance, validate kubelet/runtime/CNI/CSI, uncordon, and watch workload recovery.

Maintenance Checklist

Step	Checkpoint	Rollback
Stakeholders	Maintenance window communicated; on-call aligned.	Defer or widen window.
Capacity	Remaining nodes can absorb evicted Pods (CPU/mem/disk quotas).	Scale nodes or reschedule work.
Data	emptyDir tolerated or copied; StatefulSets/Volumes reachable from other nodes.	Snapshot or delay drain.
PDB coverage	Drain will not deadlock on minAvailable/maxUnavailable.	Temporary PDB relax (policy-gated).
Ingress/backends	DaemonSets/load paths healthy when node goes away.	Prep alternate paths.
Evidence	Save `kubectl describe node` and pod list snapshots.	Compare post-maintenance.

Bare Metal Vs VM Steps

Concern	Bare metal	Virtual machine
Power/boot	IPMI/iDRAC/ILO for cold boot; BMC credentials ready.	Hypervisor console or cloud “reset instance” APIs.
Networking	Verify switch port, VLAN, NIC firmware after card swap.	vNIC attachment, SR-IOV/MTU quirks, security-group drift.
Disk	RAID rebuild time; SMART after drive swap.	Volume detach/attach, device rename on attach.
Certificates	Hostname stable; kubeadm/kubelet certs on disk.	Golden image clocks; DHCP vs static IPs may change SANs.
Isolation	Physical access windows; PDU steps.	Prefer cordon-before-snapshot workflows for fast rollback clones.

On VMs, prefer pausing workloads before hypervisor snapshots if your storage subsystem does quiescing; Kubernetes-level cordon/drain remains the portability layer either way.

Preflight

bashpreflight.sh

NODE=worker-1

kubectl get node "$NODE" -o wide
kubectl describe node "$NODE" # Conditions, taints, capacity, allocatable, events.
kubectl get pods -A --field-selector spec.nodeName="$NODE" -o wide
kubectl get pdb -A
kubectl top node "$NODE" # Requires metrics-server.

# Confirm the cluster has enough capacity before removing this node.
kubectl get nodes
kubectl get pods -A | grep -E 'Pending|CrashLoopBackOff|ImagePullBackOff'

Maintenance Flow

bashnode-maintenance.sh

kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=20m

# On the node, after workload evacuation:
sudo systemctl stop kubelet
sudo systemctl stop containerd # Or the runtime used by the client.

# Perform hardware/vendor work here: firmware, disk, memory, NIC, BIOS, reboot.
sudo reboot

# After node returns:
sudo systemctl status containerd --no-pager
sudo systemctl status kubelet --no-pager
journalctl -u kubelet -n 100 --no-pager

kubectl get node "$NODE"
kubectl uncordon "$NODE"

Post-Maintenance Checks

bashpost-checks.sh

kubectl get node "$NODE" -o wide
kubectl describe node "$NODE" | grep -A8 Conditions
kubectl get pods -A --field-selector spec.nodeName="$NODE" -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail -80

# Node-local checks if you have SSH:
crictl info # Runtime can talk to CRI.
crictl ps # Containers are running.
ip addr # NICs came back.
df -h # Disk pressure risk.
free -h # Memory visible after maintenance.

Failure Map

Symptom	Likely area	First move
Node NotReady after reboot	kubelet, runtime, CNI, cert, network.	Check kubelet logs.
Pods stuck ContainerCreating	CNI/CSI/image pull.	Describe Pod events.
Volumes will not attach	CSI, stale attachment, zone.	Check VolumeAttachment and CSI logs.
Node has DiskPressure	Logs/images/ephemeral storage.	Check disk usage and kubelet eviction.