Storage Troubleshooting — K8s SRE Reference

TL;DR

Start with read-only evidence: PVC status, Pod events, PV details, StorageClass, VolumeAttachment, and CSI controller/node logs. Do not delete PVCs, PVs, VolumeAttachments, or cloud disks until you know reclaim policy, backup status, owner, and data criticality.

First Five Minutes

Storage incidents are easy to make worse. Capture the state before trying fixes. Your goal is to classify the failure: provisioning, scheduling/topology, attach, mount, permissions, expansion, backend health, or data lifecycle.

bashstorage-first-look.sh

NS=<namespace>
POD=<pod>
PVC=<pvc>

kubectl get pod "$POD" -n "$NS" -o wide
kubectl describe pod "$POD" -n "$NS"
kubectl get pvc "$PVC" -n "$NS" -o wide
kubectl describe pvc "$PVC" -n "$NS"
kubectl get pv
kubectl get storageclass
kubectl get events -n "$NS" --sort-by=.lastTimestamp | tail -n 80
kubectl get volumeattachment -o wide

Storage triage path from claim to backend.

PVC Pending

A Pending PVC means Kubernetes has not bound the claim to a PV. This can be normal with WaitForFirstConsumer until a Pod is scheduled, or it can indicate a broken provisioner, missing StorageClass, quota issue, or topology constraint.

bashpvc-pending.sh

NS=<namespace>
PVC=<pvc>

kubectl describe pvc "$PVC" -n "$NS"
SC=$(kubectl get pvc "$PVC" -n "$NS" -o jsonpath='{.spec.storageClassName}')
kubectl get storageclass "$SC" -o yaml
kubectl get resourcequota -n "$NS"
kubectl get events -n "$NS" --field-selector involvedObject.kind=PersistentVolumeClaim --sort-by=.lastTimestamp

# CSI controller/provisioner examples vary by driver; search first.
kubectl get pods -A | grep -Ei 'csi|provisioner|storage'
kubectl logs -n kube-system deploy/<csi-controller-deployment> --tail=100

ContainerCreating / FailedMount

If the PVC is Bound but the Pod is stuck in ContainerCreating, focus on attach and mount. Pod events usually say whether it is attach, mount, filesystem, permission, or timeout.

bashfailed-mount.sh

NS=<namespace>
POD=<pod>

kubectl describe pod "$POD" -n "$NS" | sed -n '/Events/,$p'
kubectl get pod "$POD" -n "$NS" -o wide
NODE=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.nodeName}')
kubectl describe node "$NODE" | grep -i -E 'attach|mount|volume|disk|pressure'

# CSI node plugin on the same node.
kubectl get pods -A -o wide | grep -Ei 'csi|storage' | grep "$NODE"

Multi-Attach And Stale VolumeAttachment

RWO block volumes can usually attach to one node at a time. If a Pod moves to a new node while the old attachment is still present, you may see Multi-Attach errors.

bashvolumeattachment-checks.sh

kubectl get volumeattachment -o wide
kubectl describe volumeattachment <volumeattachment-name>

kubectl get pod -A -o wide | grep <pvc-or-app-name>
kubectl get pvc -A -o wide
kubectl describe pv <pv-name> | grep -E 'Claim|StorageClass|Reclaim|VolumeHandle|Node Affinity'

Be carefulDo not delete VolumeAttachment objects or force-detach cloud disks unless the client runbook confirms the old node is gone or safe. Forced detach can corrupt filesystems for some workloads.

CSI Driver Health

CSI usually has controller Pods for provisioning/attach/resize/snapshot and node Pods for mount/unmount. Names vary by vendor, so search by csi, driver name, or platform docs.

bashcsi-health.sh

kubectl get csidriver
kubectl get pods -A | grep -Ei 'csi|ebs|efs|gce|azure|ceph|longhorn|rook|nfs'
kubectl get daemonset,deploy -A | grep -Ei 'csi|storage|longhorn|rook'

# Example patterns; adjust namespace/app labels for client driver.
kubectl logs -n kube-system deploy/<csi-controller> --tail=200
kubectl logs -n kube-system daemonset/<csi-node> --tail=200

Zone And Topology Problems

Zonal disks must attach to nodes in compatible zones. WaitForFirstConsumer helps avoid provisioning a disk in a zone where no scheduled Pod can run, but node selectors, affinity, taints, and capacity can still cause conflicts.

bashtopology-storage.sh

kubectl get nodes -L topology.kubernetes.io/zone
kubectl describe pv <pv-name> | sed -n '/Node Affinity/,$p'
kubectl describe pod <pod> -n <namespace> | grep -i -E 'node-selector|affinity|taint|toleration|zone|volume'
kubectl get storageclass <storageclass> -o yaml | grep volumeBindingMode

Permission Denied Inside Mounted Volume

If the Pod starts but the app cannot write, the issue is often filesystem ownership, securityContext, NFS export permissions, root-squash behavior, SELinux/AppArmor, or read-only mounts.

bashvolume-permissions.sh

kubectl exec -n <namespace> <pod> -- id
kubectl exec -n <namespace> <pod> -- mount | grep <mount-path>
kubectl exec -n <namespace> <pod> -- ls -la <mount-path>
kubectl get pod <pod> -n <namespace> -o yaml | grep -E 'fsGroup|runAsUser|readOnly|volumeMounts|securityContext' -n

Reclaim Policy And Data-Loss Investigation

If someone deleted a PVC or PV, immediately determine whether the backend storage was deleted, retained, snapshotted, or orphaned. Do not recreate objects blindly; you may accidentally bind to the wrong volume or overwrite data.

bashreclaim-investigation.sh

kubectl get pv -o custom-columns=NAME:.metadata.name,CLAIM:.spec.claimRef.name,RECLAIM:.spec.persistentVolumeReclaimPolicy,STATUS:.status.phase,STORAGECLASS:.spec.storageClassName
kubectl get pvc -A -o wide
kubectl get volumesnapshot -A 2>/dev/null
kubectl get events -A --sort-by=.lastTimestamp | grep -i -E 'deleted|persistentvolume|persistentvolumeclaim|snapshot'

# If PV still exists, inspect backend handle before touching anything.
kubectl describe pv <pv-name> | grep -E 'VolumeHandle|Reclaim Policy|Claim|StorageClass|Status'

NFS / CIFS Failure Cues

Symptom	Likely Cause	Check
Mount timeout	Firewall, route, server down, DNS issue.	Node reachability to server and port.
Permission denied	Export ACL, root squash, UID/GID mismatch, CIFS credentials.	Export policy and Pod user/group.
Stale file handle	NFS server/export changed under mounted clients.	Server logs and remount plan.
Slow writes	Network latency, server saturation, sync settings.	Storage metrics and node network.

MySQL / StatefulSet Incident Example

For MySQL, treat storage symptoms as data-risk symptoms. First establish whether the PVC is bound, which PV/backend disk it points to, whether backups exist, and whether the Pod is failing before or after mounting the volume.

bashmysql-storage-incident.sh

kubectl get sts,pod,svc,pvc -n data -l app=mysql -o wide
kubectl describe pod mysql-0 -n data
kubectl logs mysql-0 -n data --previous --tail=100
kubectl describe pvc data-mysql-0 -n data
kubectl get pv
kubectl get cronjob,job -n data | grep -i backup

Safe Vs Dangerous Commands

Usually Safe Read-Only	Dangerous Without Runbook
`kubectl get/describe pod,pvc,pv,storageclass,volumeattachment`	`kubectl delete pvc`
`kubectl get events`	`kubectl delete pv`
`kubectl logs` for CSI components	Force-detach cloud disk
Inspect snapshots/backups	Patch PV claimRef without recovery plan
Check StorageClass reclaim policy	Recreate StatefulSet with different claim names

Symptom To Cause

Symptom	Likely Cause	Check First
PVC Pending	StorageClass missing, quota, CSI provisioner, WaitForFirstConsumer.	PVC events and StorageClass.
FailedAttachVolume	Cloud/backend attach issue, stale attachment, wrong zone.	VolumeAttachment and CSI controller logs.
FailedMount	CSI node issue, filesystem, credentials, NFS/CIFS permissions.	Pod events and CSI node logs.
Multi-Attach error	RWO volume attached to another node.	Old Pod/node and VolumeAttachment.
Permission denied	UID/GID, fsGroup, export policy, read-only mount.	Pod securityContext and filesystem ownership.
Expansion stuck	StorageClass/driver unsupported, filesystem resize pending.	PVC conditions and resizer logs.
Data missing	Wrong PVC, deleted backend, empty re-created volume, app initialized new data dir.	PV handle, audit trail, backups.

Before Deleting Anything

1Identify owner: app team, platform team, Helm release, ArgoCD app, Terraform, or storage operator.
2Check reclaim policy: Delete may remove the backend disk when PVC/PV is deleted.
3Confirm backup: find latest successful backup or snapshot and verify restore path.
4Capture handles: PV name, PVC name, VolumeHandle, node, zone, cloud disk ID.
5Use client runbook: storage recovery often has platform-specific steps.

First Five Minutes

PVC Pending

ContainerCreating / FailedMount

Multi-Attach And Stale VolumeAttachment

CSI Driver Health

Zone And Topology Problems

Permission Denied Inside Mounted Volume

Reclaim Policy And Data-Loss Investigation

NFS / CIFS Failure Cues

MySQL / StatefulSet Incident Example

Safe Vs Dangerous Commands

Symptom To Cause

Before Deleting Anything

Related Pages