TL;DR

For node hardware work, protect workloads first, then service the machine. Cordon, inspect, drain, perform maintenance, validate kubelet/runtime/CNI/CSI, uncordon, and watch workload recovery.

Maintenance Checklist

StepCheckpointRollback
StakeholdersMaintenance window communicated; on-call aligned.Defer or widen window.
CapacityRemaining nodes can absorb evicted Pods (CPU/mem/disk quotas).Scale nodes or reschedule work.
DataemptyDir tolerated or copied; StatefulSets/Volumes reachable from other nodes.Snapshot or delay drain.
PDB coverageDrain will not deadlock on minAvailable/maxUnavailable.Temporary PDB relax (policy-gated).
Ingress/backendsDaemonSets/load paths healthy when node goes away.Prep alternate paths.
EvidenceSave kubectl describe node and pod list snapshots.Compare post-maintenance.

Bare Metal Vs VM Steps

ConcernBare metalVirtual machine
Power/bootIPMI/iDRAC/ILO for cold boot; BMC credentials ready.Hypervisor console or cloud “reset instance” APIs.
NetworkingVerify switch port, VLAN, NIC firmware after card swap.vNIC attachment, SR-IOV/MTU quirks, security-group drift.
DiskRAID rebuild time; SMART after drive swap.Volume detach/attach, device rename on attach.
CertificatesHostname stable; kubeadm/kubelet certs on disk.Golden image clocks; DHCP vs static IPs may change SANs.
IsolationPhysical access windows; PDU steps.Prefer cordon-before-snapshot workflows for fast rollback clones.

On VMs, prefer pausing workloads before hypervisor snapshots if your storage subsystem does quiescing; Kubernetes-level cordon/drain remains the portability layer either way.

Preflight

bashpreflight.sh
NODE=worker-1

kubectl get node "$NODE" -o wide
kubectl describe node "$NODE" # Conditions, taints, capacity, allocatable, events.
kubectl get pods -A --field-selector spec.nodeName="$NODE" -o wide
kubectl get pdb -A
kubectl top node "$NODE" # Requires metrics-server.

# Confirm the cluster has enough capacity before removing this node.
kubectl get nodes
kubectl get pods -A | grep -E 'Pending|CrashLoopBackOff|ImagePullBackOff'

Maintenance Flow

bashnode-maintenance.sh
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=20m

# On the node, after workload evacuation:
sudo systemctl stop kubelet
sudo systemctl stop containerd # Or the runtime used by the client.

# Perform hardware/vendor work here: firmware, disk, memory, NIC, BIOS, reboot.
sudo reboot

# After node returns:
sudo systemctl status containerd --no-pager
sudo systemctl status kubelet --no-pager
journalctl -u kubelet -n 100 --no-pager

kubectl get node "$NODE"
kubectl uncordon "$NODE"

Post-Maintenance Checks

bashpost-checks.sh
kubectl get node "$NODE" -o wide
kubectl describe node "$NODE" | grep -A8 Conditions
kubectl get pods -A --field-selector spec.nodeName="$NODE" -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail -80

# Node-local checks if you have SSH:
crictl info # Runtime can talk to CRI.
crictl ps # Containers are running.
ip addr # NICs came back.
df -h # Disk pressure risk.
free -h # Memory visible after maintenance.

Failure Map

SymptomLikely areaFirst move
Node NotReady after rebootkubelet, runtime, CNI, cert, network.Check kubelet logs.
Pods stuck ContainerCreatingCNI/CSI/image pull.Describe Pod events.
Volumes will not attachCSI, stale attachment, zone.Check VolumeAttachment and CSI logs.
Node has DiskPressureLogs/images/ephemeral storage.Check disk usage and kubelet eviction.