Kubernetes Upgrades — K8s SRE Reference

TL;DR

Upgrade one minor version at a time, control plane first then workers, and always take an etcd backup before starting. Managed clusters (EKS/AKS/GKE) handle control plane upgrades for you — but you still own node group upgrades.

Version Skew Rules

Kubernetes has strict version compatibility rules that govern what you can and cannot skip during an upgrade.

Component	Max skew from API server	Rule
kubelet	-3 minor versions	Never ahead of API server; at most 3 minor versions behind
kube-proxy	-3 minor versions	Same as kubelet
kubectl	±1 minor version	kubectl can be one version ahead or behind the server
Upgrade path	One minor version at a time	1.27 → 1.28 → 1.29; never skip a minor version

Pre-upgrade Checklist

Run through this before any production upgrade; skipping steps is how upgrades cause unplanned downtime.

bashpre-upgrade.sh

# 1. Confirm current version
kubectl version --short
kubectl get nodes -o wide | awk '{print $1, $5}'

# 2. Read the changelog for your target version
# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.XX.md

# 3. Check for deprecated APIs in use (before upgrading)
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -I{} kubectl get {} -A --no-headers 2>/dev/null | wc -l
# Use: pluto detect-all-in-cluster (github.com/FairwindsOps/pluto)
pluto detect-all-in-cluster

# 4. Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd/pre-upgrade-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# 5. Check all nodes are Ready and no critical pods are degraded
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

# 6. Check PodDisruptionBudgets — ensure PDBs allow at least one eviction
kubectl get pdb -A
kubectl get pdb -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: allowed={.status.disruptionsAllowed}{"\n"}{end}'

kubeadm Upgrade (Self-managed)

kubeadm handles the control plane components; upgrade one control-plane node at a time, then drain and upgrade each worker node.

bashkubeadm-upgrade.sh

TARGET_VERSION=1.30.0-00   # replace with your target

# ─── On the first control-plane node ───
# Unhold and upgrade kubeadm
sudo apt-mark unhold kubeadm
sudo apt-get update
sudo apt-get install -y kubeadm=$TARGET_VERSION
sudo apt-mark hold kubeadm

# Plan and review changes
sudo kubeadm upgrade plan

# Apply the upgrade
sudo kubeadm upgrade apply v1.30.0  # strip the -00 suffix

# Upgrade kubelet and kubectl on THIS node
sudo apt-mark unhold kubelet kubectl
sudo apt-get install -y kubelet=$TARGET_VERSION kubectl=$TARGET_VERSION
sudo apt-mark hold kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# ─── On additional control-plane nodes ───
sudo kubeadm upgrade node
# Then upgrade kubelet and kubectl as above

# ─── On each worker node ───
# (run from a machine with kubectl access to the cluster)
NODE=worker-1
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --grace-period=60

# SSH to the worker node, then:
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get update
sudo apt-get install -y kubeadm=$TARGET_VERSION kubelet=$TARGET_VERSION kubectl=$TARGET_VERSION
sudo apt-mark hold kubeadm kubelet kubectl
sudo kubeadm upgrade node
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# Back on the admin machine:
kubectl uncordon "$NODE"
kubectl get nodes   # confirm node shows new version

EKS Upgrades (Managed)

EKS upgrades control plane and worker node groups separately; always upgrade the control plane first, then update the managed node group launch template to match.

basheks-upgrade.sh

CLUSTER=my-cluster
REGION=us-east-1
TARGET=1.30

# 1. Upgrade control plane (AWS-managed; takes ~10-15 min)
aws eks update-cluster-version \
  --name "$CLUSTER" --region "$REGION" \
  --kubernetes-version "$TARGET"

# Wait for completion
aws eks wait cluster-active --name "$CLUSTER" --region "$REGION"
aws eks describe-cluster --name "$CLUSTER" --query "cluster.version" --output text

# 2. Upgrade managed add-ons (vpc-cni, coredns, kube-proxy)
for ADDON in vpc-cni coredns kube-proxy; do
  LATEST=$(aws eks describe-addon-versions \
    --addon-name "$ADDON" --kubernetes-version "$TARGET" \
    --query "addons[0].addonVersions[0].addonVersion" --output text)
  aws eks update-addon --cluster-name "$CLUSTER" --addon-name "$ADDON" \
    --addon-version "$LATEST" --resolve-conflicts OVERWRITE
done

# 3. Upgrade managed node groups (replace AMI via rolling update)
aws eks update-nodegroup-version \
  --cluster-name "$CLUSTER" \
  --nodegroup-name workers \
  --kubernetes-version "$TARGET"

# Monitor node group status
aws eks describe-nodegroup --cluster-name "$CLUSTER" \
  --nodegroup-name workers --query "nodegroup.status"

AKS & GKE Upgrades (Summary)

bashaks-gke-upgrade.sh

# ─── AKS ───
# Upgrade control plane
az aks upgrade --resource-group my-rg --name my-cluster --kubernetes-version 1.30 --control-plane-only
# Upgrade a node pool
az aks nodepool upgrade --resource-group my-rg --cluster-name my-cluster \
  --name nodepool1 --kubernetes-version 1.30 --no-wait

# ─── GKE ───
# Upgrade control plane (can set to auto-upgrade in channel)
gcloud container clusters upgrade my-cluster --master --cluster-version 1.30 --zone us-central1-a
# Upgrade node pool
gcloud container clusters upgrade my-cluster \
  --node-pool default-pool --cluster-version 1.30 --zone us-central1-a

Post-upgrade Validation

Run these checks immediately after each upgrade phase (control plane, then each node group) before proceeding.

bashpost-upgrade-checks.sh

kubectl get nodes -o wide       # all nodes Ready, correct version
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded  # no stuck pods
kubectl get cs 2>/dev/null      # componentstatus (deprecated but useful sanity check)

# System pod health
kubectl get pods -n kube-system

# Check coredns and kube-proxy
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get ds -n kube-system kube-proxy

# Validate DNS resolution
kubectl run dns-test --image=busybox:1.35 --restart=Never --rm -it -- \
  nslookup kubernetes.default.svc.cluster.local

# Check all Deployments are at desired replicas
kubectl get deployments -A | awk '$3 != $4'  # columns: NAMESPACE NAME READY UP-TO-DATE

Gotchas

!Never skip minor versions. 1.27 → 1.29 is unsupported and may corrupt etcd or leave components in an inconsistent state.
!Deprecated APIs: check for removed APIs with pluto before upgrading — e.g., batch/v1beta1 CronJob was removed in 1.25.
!PodDisruptionBudgets: PDBs that block all evictions will cause kubectl drain to hang. Check .status.disruptionsAllowed first.
!Add-on compatibility: CNI, CSI drivers, Ingress controllers, and cert-manager all have K8s version compatibility matrices — upgrade them too.