TL;DR

Start with read-only evidence: PVC status, Pod events, PV details, StorageClass, VolumeAttachment, and CSI controller/node logs. Do not delete PVCs, PVs, VolumeAttachments, or cloud disks until you know reclaim policy, backup status, owner, and data criticality.

First Five Minutes

Storage incidents are easy to make worse. Capture the state before trying fixes. Your goal is to classify the failure: provisioning, scheduling/topology, attach, mount, permissions, expansion, backend health, or data lifecycle.

bashstorage-first-look.sh
NS=<namespace>
POD=<pod>
PVC=<pvc>

kubectl get pod "$POD" -n "$NS" -o wide
kubectl describe pod "$POD" -n "$NS"
kubectl get pvc "$PVC" -n "$NS" -o wide
kubectl describe pvc "$PVC" -n "$NS"
kubectl get pv
kubectl get storageclass
kubectl get events -n "$NS" --sort-by=.lastTimestamp | tail -n 80
kubectl get volumeattachment -o wide
PVCPV / SCCSI ControllerCSI NodePod events · kubelet · backend storageProvisioning problems start at PVC/StorageClass; attach/mount problems show in Pod events and CSI logs.

Storage triage path from claim to backend.

PVC Pending

A Pending PVC means Kubernetes has not bound the claim to a PV. This can be normal with WaitForFirstConsumer until a Pod is scheduled, or it can indicate a broken provisioner, missing StorageClass, quota issue, or topology constraint.

bashpvc-pending.sh
NS=<namespace>
PVC=<pvc>

kubectl describe pvc "$PVC" -n "$NS"
SC=$(kubectl get pvc "$PVC" -n "$NS" -o jsonpath='{.spec.storageClassName}')
kubectl get storageclass "$SC" -o yaml
kubectl get resourcequota -n "$NS"
kubectl get events -n "$NS" --field-selector involvedObject.kind=PersistentVolumeClaim --sort-by=.lastTimestamp

# CSI controller/provisioner examples vary by driver; search first.
kubectl get pods -A | grep -Ei 'csi|provisioner|storage'
kubectl logs -n kube-system deploy/<csi-controller-deployment> --tail=100

ContainerCreating / FailedMount

If the PVC is Bound but the Pod is stuck in ContainerCreating, focus on attach and mount. Pod events usually say whether it is attach, mount, filesystem, permission, or timeout.

bashfailed-mount.sh
NS=<namespace>
POD=<pod>

kubectl describe pod "$POD" -n "$NS" | sed -n '/Events/,$p'
kubectl get pod "$POD" -n "$NS" -o wide
NODE=$(kubectl get pod "$POD" -n "$NS" -o jsonpath='{.spec.nodeName}')
kubectl describe node "$NODE" | grep -i -E 'attach|mount|volume|disk|pressure'

# CSI node plugin on the same node.
kubectl get pods -A -o wide | grep -Ei 'csi|storage' | grep "$NODE"

Multi-Attach And Stale VolumeAttachment

RWO block volumes can usually attach to one node at a time. If a Pod moves to a new node while the old attachment is still present, you may see Multi-Attach errors.

bashvolumeattachment-checks.sh
kubectl get volumeattachment -o wide
kubectl describe volumeattachment <volumeattachment-name>

kubectl get pod -A -o wide | grep <pvc-or-app-name>
kubectl get pvc -A -o wide
kubectl describe pv <pv-name> | grep -E 'Claim|StorageClass|Reclaim|VolumeHandle|Node Affinity'
!
Be carefulDo not delete VolumeAttachment objects or force-detach cloud disks unless the client runbook confirms the old node is gone or safe. Forced detach can corrupt filesystems for some workloads.

CSI Driver Health

CSI usually has controller Pods for provisioning/attach/resize/snapshot and node Pods for mount/unmount. Names vary by vendor, so search by csi, driver name, or platform docs.

bashcsi-health.sh
kubectl get csidriver
kubectl get pods -A | grep -Ei 'csi|ebs|efs|gce|azure|ceph|longhorn|rook|nfs'
kubectl get daemonset,deploy -A | grep -Ei 'csi|storage|longhorn|rook'

# Example patterns; adjust namespace/app labels for client driver.
kubectl logs -n kube-system deploy/<csi-controller> --tail=200
kubectl logs -n kube-system daemonset/<csi-node> --tail=200

Zone And Topology Problems

Zonal disks must attach to nodes in compatible zones. WaitForFirstConsumer helps avoid provisioning a disk in a zone where no scheduled Pod can run, but node selectors, affinity, taints, and capacity can still cause conflicts.

bashtopology-storage.sh
kubectl get nodes -L topology.kubernetes.io/zone
kubectl describe pv <pv-name> | sed -n '/Node Affinity/,$p'
kubectl describe pod <pod> -n <namespace> | grep -i -E 'node-selector|affinity|taint|toleration|zone|volume'
kubectl get storageclass <storageclass> -o yaml | grep volumeBindingMode

Permission Denied Inside Mounted Volume

If the Pod starts but the app cannot write, the issue is often filesystem ownership, securityContext, NFS export permissions, root-squash behavior, SELinux/AppArmor, or read-only mounts.

bashvolume-permissions.sh
kubectl exec -n <namespace> <pod> -- id
kubectl exec -n <namespace> <pod> -- mount | grep <mount-path>
kubectl exec -n <namespace> <pod> -- ls -la <mount-path>
kubectl get pod <pod> -n <namespace> -o yaml | grep -E 'fsGroup|runAsUser|readOnly|volumeMounts|securityContext' -n

Reclaim Policy And Data-Loss Investigation

If someone deleted a PVC or PV, immediately determine whether the backend storage was deleted, retained, snapshotted, or orphaned. Do not recreate objects blindly; you may accidentally bind to the wrong volume or overwrite data.

bashreclaim-investigation.sh
kubectl get pv -o custom-columns=NAME:.metadata.name,CLAIM:.spec.claimRef.name,RECLAIM:.spec.persistentVolumeReclaimPolicy,STATUS:.status.phase,STORAGECLASS:.spec.storageClassName
kubectl get pvc -A -o wide
kubectl get volumesnapshot -A 2>/dev/null
kubectl get events -A --sort-by=.lastTimestamp | grep -i -E 'deleted|persistentvolume|persistentvolumeclaim|snapshot'

# If PV still exists, inspect backend handle before touching anything.
kubectl describe pv <pv-name> | grep -E 'VolumeHandle|Reclaim Policy|Claim|StorageClass|Status'

NFS / CIFS Failure Cues

SymptomLikely CauseCheck
Mount timeoutFirewall, route, server down, DNS issue.Node reachability to server and port.
Permission deniedExport ACL, root squash, UID/GID mismatch, CIFS credentials.Export policy and Pod user/group.
Stale file handleNFS server/export changed under mounted clients.Server logs and remount plan.
Slow writesNetwork latency, server saturation, sync settings.Storage metrics and node network.

MySQL / StatefulSet Incident Example

For MySQL, treat storage symptoms as data-risk symptoms. First establish whether the PVC is bound, which PV/backend disk it points to, whether backups exist, and whether the Pod is failing before or after mounting the volume.

bashmysql-storage-incident.sh
kubectl get sts,pod,svc,pvc -n data -l app=mysql -o wide
kubectl describe pod mysql-0 -n data
kubectl logs mysql-0 -n data --previous --tail=100
kubectl describe pvc data-mysql-0 -n data
kubectl get pv
kubectl get cronjob,job -n data | grep -i backup

Safe Vs Dangerous Commands

Usually Safe Read-OnlyDangerous Without Runbook
kubectl get/describe pod,pvc,pv,storageclass,volumeattachmentkubectl delete pvc
kubectl get eventskubectl delete pv
kubectl logs for CSI componentsForce-detach cloud disk
Inspect snapshots/backupsPatch PV claimRef without recovery plan
Check StorageClass reclaim policyRecreate StatefulSet with different claim names

Symptom To Cause

SymptomLikely CauseCheck First
PVC PendingStorageClass missing, quota, CSI provisioner, WaitForFirstConsumer.PVC events and StorageClass.
FailedAttachVolumeCloud/backend attach issue, stale attachment, wrong zone.VolumeAttachment and CSI controller logs.
FailedMountCSI node issue, filesystem, credentials, NFS/CIFS permissions.Pod events and CSI node logs.
Multi-Attach errorRWO volume attached to another node.Old Pod/node and VolumeAttachment.
Permission deniedUID/GID, fsGroup, export policy, read-only mount.Pod securityContext and filesystem ownership.
Expansion stuckStorageClass/driver unsupported, filesystem resize pending.PVC conditions and resizer logs.
Data missingWrong PVC, deleted backend, empty re-created volume, app initialized new data dir.PV handle, audit trail, backups.

Before Deleting Anything

  • 1Identify owner: app team, platform team, Helm release, ArgoCD app, Terraform, or storage operator.
  • 2Check reclaim policy: Delete may remove the backend disk when PVC/PV is deleted.
  • 3Confirm backup: find latest successful backup or snapshot and verify restore path.
  • 4Capture handles: PV name, PVC name, VolumeHandle, node, zone, cloud disk ID.
  • 5Use client runbook: storage recovery often has platform-specific steps.