Hardware Maintenance
TL;DR
For node hardware work, protect workloads first, then service the machine. Cordon, inspect, drain, perform maintenance, validate kubelet/runtime/CNI/CSI, uncordon, and watch workload recovery.
Maintenance Checklist
| Step | Checkpoint | Rollback |
|---|---|---|
| Stakeholders | Maintenance window communicated; on-call aligned. | Defer or widen window. |
| Capacity | Remaining nodes can absorb evicted Pods (CPU/mem/disk quotas). | Scale nodes or reschedule work. |
| Data | emptyDir tolerated or copied; StatefulSets/Volumes reachable from other nodes. | Snapshot or delay drain. |
| PDB coverage | Drain will not deadlock on minAvailable/maxUnavailable. | Temporary PDB relax (policy-gated). |
| Ingress/backends | DaemonSets/load paths healthy when node goes away. | Prep alternate paths. |
| Evidence | Save kubectl describe node and pod list snapshots. | Compare post-maintenance. |
Bare Metal Vs VM Steps
| Concern | Bare metal | Virtual machine |
|---|---|---|
| Power/boot | IPMI/iDRAC/ILO for cold boot; BMC credentials ready. | Hypervisor console or cloud “reset instance” APIs. |
| Networking | Verify switch port, VLAN, NIC firmware after card swap. | vNIC attachment, SR-IOV/MTU quirks, security-group drift. |
| Disk | RAID rebuild time; SMART after drive swap. | Volume detach/attach, device rename on attach. |
| Certificates | Hostname stable; kubeadm/kubelet certs on disk. | Golden image clocks; DHCP vs static IPs may change SANs. |
| Isolation | Physical access windows; PDU steps. | Prefer cordon-before-snapshot workflows for fast rollback clones. |
On VMs, prefer pausing workloads before hypervisor snapshots if your storage subsystem does quiescing; Kubernetes-level cordon/drain remains the portability layer either way.
Preflight
bashpreflight.sh
NODE=worker-1
kubectl get node "$NODE" -o wide
kubectl describe node "$NODE" # Conditions, taints, capacity, allocatable, events.
kubectl get pods -A --field-selector spec.nodeName="$NODE" -o wide
kubectl get pdb -A
kubectl top node "$NODE" # Requires metrics-server.
# Confirm the cluster has enough capacity before removing this node.
kubectl get nodes
kubectl get pods -A | grep -E 'Pending|CrashLoopBackOff|ImagePullBackOff'Maintenance Flow
bashnode-maintenance.sh
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=20m
# On the node, after workload evacuation:
sudo systemctl stop kubelet
sudo systemctl stop containerd # Or the runtime used by the client.
# Perform hardware/vendor work here: firmware, disk, memory, NIC, BIOS, reboot.
sudo reboot
# After node returns:
sudo systemctl status containerd --no-pager
sudo systemctl status kubelet --no-pager
journalctl -u kubelet -n 100 --no-pager
kubectl get node "$NODE"
kubectl uncordon "$NODE"Post-Maintenance Checks
bashpost-checks.sh
kubectl get node "$NODE" -o wide
kubectl describe node "$NODE" | grep -A8 Conditions
kubectl get pods -A --field-selector spec.nodeName="$NODE" -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail -80
# Node-local checks if you have SSH:
crictl info # Runtime can talk to CRI.
crictl ps # Containers are running.
ip addr # NICs came back.
df -h # Disk pressure risk.
free -h # Memory visible after maintenance.Failure Map
| Symptom | Likely area | First move |
|---|---|---|
| Node NotReady after reboot | kubelet, runtime, CNI, cert, network. | Check kubelet logs. |
| Pods stuck ContainerCreating | CNI/CSI/image pull. | Describe Pod events. |
| Volumes will not attach | CSI, stale attachment, zone. | Check VolumeAttachment and CSI logs. |
| Node has DiskPressure | Logs/images/ephemeral storage. | Check disk usage and kubelet eviction. |