Backup & Disaster Recovery — K8s SRE Reference

TL;DR

Back up etcd for cluster state, use Velero for namespaced workloads and PVC data, and schedule regular restore drills. An untested backup is not a backup. Always verify the backup and the restore procedure.

Backup Strategy

What to protect	Tool	RPO target	Notes
Cluster state (all resources)	etcd snapshot	< 1 hour	Covers Deployments, Services, Secrets, CRDs, RBAC
Namespaced workloads	Velero backup	< 1 hour	Also backs up PVCs if configured
Persistent volumes	Velero + CSI snapshots	RPO depends on schedule	Use cloud-native snapshots for databases
Helm releases	Git (values + chart)	Git commit	Store Helm values in version control
Secrets	External Secrets / SOPS	Git commit	Never back up raw K8s Secret objects to S3

etcd Backup (Self-managed Clusters)

etcd stores all Kubernetes cluster state; a snapshot captures the full current state and is the single most important backup artifact for self-managed (kubeadm) clusters.

bashetcd-backup.sh

SNAP=/var/backups/etcd/snapshot-$(date +%Y%m%d-%H%M%S).db

# Take snapshot using etcdctl (paths are typical for a kubeadm cluster)
ETCDCTL_API=3 etcdctl snapshot save "$SNAP" \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify the snapshot is healthy before uploading
ETCDCTL_API=3 etcdctl snapshot status "$SNAP" --write-out=table

# Upload to S3
aws s3 cp "$SNAP" s3://my-etcd-backups/cluster-name/

# Automate via CronJob on control plane (or use a DaemonSet on control-plane nodes)
# Typical schedule: every 30 min to 1 hour for production clusters

etcd Restore

Restore is a destructive operation that replaces all current cluster state; only use on a completely failed cluster or in a dedicated DR environment.

bashetcd-restore.sh

# DESTRUCTIVE — only run on a failed cluster or DR environment
# Stop kube-apiserver and other control-plane static pods first:
mv /etc/kubernetes/manifests/*.yaml /tmp/k8s-manifests/

# Restore snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --initial-cluster="master=https://127.0.0.1:2380" \
  --initial-advertise-peer-urls="https://127.0.0.1:2380" \
  --name=master

# Point etcd to new data dir
# Edit /etc/kubernetes/manifests/etcd.yaml:
#   - hostPath.path: /var/lib/etcd-restore  (was /var/lib/etcd)
#   - volumes[0].hostPath.path: /var/lib/etcd-restore

# Restore kube-apiserver and other manifests
mv /tmp/k8s-manifests/*.yaml /etc/kubernetes/manifests/

# Verify cluster came back
kubectl get nodes
kubectl get pods -A

Velero — Workload Backup

Velero backs up Kubernetes resources and optionally PVC data using CSI snapshots or restic/kopia; it is the standard tool for namespace-level backup and cross-cluster migration.

bashvelero.sh

# Install Velero (AWS example)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero

# On-demand backup of a namespace
velero backup create ns-backup-$(date +%Y%m%d) \
  --include-namespaces my-app \
  --include-cluster-resources=false \
  --wait

# Backup with PVC data (using CSI volumesnapshots)
velero backup create full-backup \
  --include-namespaces my-app \
  --snapshot-volumes \
  --volume-snapshot-locations aws-us-east-1

# Scheduled backup (daily, retain 7 days)
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces my-app \
  --ttl 168h

# List and inspect backups
velero backup get
velero backup describe ns-backup-20260523 --details

# Restore to same or different namespace
velero restore create --from-backup ns-backup-20260523
velero restore create --from-backup ns-backup-20260523 \
  --namespace-mappings my-app:my-app-restore

velero restore get
velero restore describe <restore-name> --details

PVC Data Backup Patterns

For databases, prefer application-level dumps (mysqldump, pg_dump) scheduled as Kubernetes CronJobs rather than relying solely on volume snapshots — they are consistent and portable.

yamlpostgres-backup-cronjob.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: data
spec:
  schedule: "0 1 * * *"          # 01:00 UTC daily
  concurrencyPolicy: Forbid       # don't start a new job if previous is still running
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: pg-backup
            image: postgres:16
            command:
            - /bin/sh
            - -c
            - |
              BACKUP_FILE="pg-backup-$(date +%Y%m%d-%H%M%S).sql.gz"
              pg_dump -h postgres-svc -U "$PGUSER" "$PGDATABASE" | gzip \
                > "/backup/$BACKUP_FILE"
              echo "Backup complete: $BACKUP_FILE"
            env:
            - name: PGUSER
              valueFrom:
                secretKeyRef: {name: postgres-secret, key: username}
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef: {name: postgres-secret, key: password}
            - name: PGDATABASE
              value: mydb
            volumeMounts:
            - name: backup-pvc
              mountPath: /backup
          volumes:
          - name: backup-pvc
            persistentVolumeClaim:
              claimName: postgres-backup-pvc

DR Drill Checklist

Run this checklist quarterly; an untested restore procedure will fail at the worst possible moment.

✓Verify etcd snapshot exists and is within RPO window.
✓Restore etcd snapshot to a staging cluster and confirm kubectl get nodes and kubectl get pods -A work.
✓Run velero backup describe --details and confirm PVC snapshots are present.
✓Restore a Velero backup to a test namespace and verify the application starts and data is intact.
✓Restore a database dump to a test DB and validate row counts / schema.
✓Document RTO (time to restore) measured during the drill.
✓Update runbook with any discovered gaps.