Backup & Disaster Recovery
Back up etcd for cluster state, use Velero for namespaced workloads and PVC data, and schedule regular restore drills. An untested backup is not a backup. Always verify the backup and the restore procedure.
Backup Strategy
| What to protect | Tool | RPO target | Notes |
|---|---|---|---|
| Cluster state (all resources) | etcd snapshot | < 1 hour | Covers Deployments, Services, Secrets, CRDs, RBAC |
| Namespaced workloads | Velero backup | < 1 hour | Also backs up PVCs if configured |
| Persistent volumes | Velero + CSI snapshots | RPO depends on schedule | Use cloud-native snapshots for databases |
| Helm releases | Git (values + chart) | Git commit | Store Helm values in version control |
| Secrets | External Secrets / SOPS | Git commit | Never back up raw K8s Secret objects to S3 |
etcd Backup (Self-managed Clusters)
etcd stores all Kubernetes cluster state; a snapshot captures the full current state and is the single most important backup artifact for self-managed (kubeadm) clusters.
SNAP=/var/backups/etcd/snapshot-$(date +%Y%m%d-%H%M%S).db
# Take snapshot using etcdctl (paths are typical for a kubeadm cluster)
ETCDCTL_API=3 etcdctl snapshot save "$SNAP" \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify the snapshot is healthy before uploading
ETCDCTL_API=3 etcdctl snapshot status "$SNAP" --write-out=table
# Upload to S3
aws s3 cp "$SNAP" s3://my-etcd-backups/cluster-name/
# Automate via CronJob on control plane (or use a DaemonSet on control-plane nodes)
# Typical schedule: every 30 min to 1 hour for production clustersetcd Restore
Restore is a destructive operation that replaces all current cluster state; only use on a completely failed cluster or in a dedicated DR environment.
# DESTRUCTIVE — only run on a failed cluster or DR environment
# Stop kube-apiserver and other control-plane static pods first:
mv /etc/kubernetes/manifests/*.yaml /tmp/k8s-manifests/
# Restore snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--data-dir=/var/lib/etcd-restore \
--initial-cluster="master=https://127.0.0.1:2380" \
--initial-advertise-peer-urls="https://127.0.0.1:2380" \
--name=master
# Point etcd to new data dir
# Edit /etc/kubernetes/manifests/etcd.yaml:
# - hostPath.path: /var/lib/etcd-restore (was /var/lib/etcd)
# - volumes[0].hostPath.path: /var/lib/etcd-restore
# Restore kube-apiserver and other manifests
mv /tmp/k8s-manifests/*.yaml /etc/kubernetes/manifests/
# Verify cluster came back
kubectl get nodes
kubectl get pods -AVelero — Workload Backup
Velero backs up Kubernetes resources and optionally PVC data using CSI snapshots or restic/kopia; it is the standard tool for namespace-level backup and cross-cluster migration.
# Install Velero (AWS example)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket my-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero
# On-demand backup of a namespace
velero backup create ns-backup-$(date +%Y%m%d) \
--include-namespaces my-app \
--include-cluster-resources=false \
--wait
# Backup with PVC data (using CSI volumesnapshots)
velero backup create full-backup \
--include-namespaces my-app \
--snapshot-volumes \
--volume-snapshot-locations aws-us-east-1
# Scheduled backup (daily, retain 7 days)
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces my-app \
--ttl 168h
# List and inspect backups
velero backup get
velero backup describe ns-backup-20260523 --details
# Restore to same or different namespace
velero restore create --from-backup ns-backup-20260523
velero restore create --from-backup ns-backup-20260523 \
--namespace-mappings my-app:my-app-restore
velero restore get
velero restore describe <restore-name> --detailsPVC Data Backup Patterns
For databases, prefer application-level dumps (mysqldump, pg_dump) scheduled as Kubernetes CronJobs rather than relying solely on volume snapshots — they are consistent and portable.
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: data
spec:
schedule: "0 1 * * *" # 01:00 UTC daily
concurrencyPolicy: Forbid # don't start a new job if previous is still running
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: pg-backup
image: postgres:16
command:
- /bin/sh
- -c
- |
BACKUP_FILE="pg-backup-$(date +%Y%m%d-%H%M%S).sql.gz"
pg_dump -h postgres-svc -U "$PGUSER" "$PGDATABASE" | gzip \
> "/backup/$BACKUP_FILE"
echo "Backup complete: $BACKUP_FILE"
env:
- name: PGUSER
valueFrom:
secretKeyRef: {name: postgres-secret, key: username}
- name: PGPASSWORD
valueFrom:
secretKeyRef: {name: postgres-secret, key: password}
- name: PGDATABASE
value: mydb
volumeMounts:
- name: backup-pvc
mountPath: /backup
volumes:
- name: backup-pvc
persistentVolumeClaim:
claimName: postgres-backup-pvcDR Drill Checklist
Run this checklist quarterly; an untested restore procedure will fail at the worst possible moment.
- Verify etcd snapshot exists and is within RPO window.
- Restore etcd snapshot to a staging cluster and confirm
kubectl get nodesandkubectl get pods -Awork. - Run
velero backup describe --detailsand confirm PVC snapshots are present. - Restore a Velero backup to a test namespace and verify the application starts and data is intact.
- Restore a database dump to a test DB and validate row counts / schema.
- Document RTO (time to restore) measured during the drill.
- Update runbook with any discovered gaps.