TL;DR

Operational playbooks for integration faults across CSI storage, CNI networking, Ingress / cloud load balancers, and cluster autoscaling. Works on EKS, GKE, AKS, and hybrid clusters—the Kubernetes objects and symptoms are the same; only the cloud API names differ.

Jump to a failure domain:

1. Storage Interface (CSI) Disruption

Focus: persistent volumes stuck in transition on any cloud or bare-metal CSI driver (AWS EBS, Azure Disk, GCP PD, Ceph RBD, etc.).

Multi-Attach deadlock (RWO volumes)

Scenario: A node dies suddenly or a Spot/Preemptible VM is reclaimed. The scheduler places the Stateful pod elsewhere, but it stays in ContainerCreating with a Multi-Attach error—the backend still believes the volume is attached to the dead node.

bashcsi-multi-attach-triage.sh
# 1) Locate the stuck cloud attachment object
kubectl get volumeattachments | grep <pvc-name>

# 2) Confirm scheduled node vs attachment node
kubectl get pod <pod-name> -n <ns> -o wide
kubectl describe pod <pod-name> -n <ns> | grep -A6 Events
Safe recovery: Delete the VolumeAttachment only after confirming the old node is terminated (not merely NotReady)—otherwise you risk filesystem corruption on shared block storage.
bashvolumeattachment-clear.sh
kubectl delete volumeattachment <attachment-id>

On AWS also verify the volume is not stuck in attaching via EC2/EBS console; on Azure/GCP use the respective disk attachment APIs if the Kubernetes object delete alone is insufficient.

Cross-zone topology conflict

Scenario: Pod Pending with 1 node(s) had volume node affinity conflict. Zone-bound block storage (EBS, Azure Disk, PD) lives in Zone A; the autoscaler added nodes in Zone B.

Fix: Align compute with storage AZ—topology spread, node selectors, or dedicated node pools per zone.

yamlzone-aligned-stateful.yaml
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: orders-db
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a"]  # Must match PV / disk AZ

More storage patterns: Storage Troubleshooting.

2. Network Interface (CNI) & Core Routing

IP address exhaustion

Scenario: Pods stuck in ContainerCreating or FailedCreatePodSandBox.

Core issue: Host-routing CNIs (AWS VPC CNI, Azure CNI with VNet IPs) consume first-class subnet IPs. Dense fleets empty the pool quickly.

bashcni-logs.sh
# Replace daemonset name: aws-node, azure-cns, calico-node, cilium, etc.
kubectl logs -n kube-system daemonset/<cni-daemonset-name> --tail=100

Platform alternatives: Overlay networks (Cilium, Calico) isolate pods on an internal CIDR—often one host IP per node. AWS-specific tuning: AWS VPC CNI.

iptables latency at scale

Scenario: Heavy inter-service timeouts as endpoint counts grow into the thousands.

Root cause: kube-proxy in default iptables mode does O(N) rule scans. Large Services degrade connection setup latency.

bashkube-proxy-mode.sh
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

Resolution blueprint: migrate to ipvs (hash-based lookups) or eBPF dataplane (Cilium kube-proxy replacement). Plan change windows—Dataplane shifts affect every Service path.

3. Cloud Load Balancer & Ingress Controller

Dangling / empty Ingress ADDRESS

Scenario: kubectl get ingress shows an empty ADDRESS field indefinitely.

Triad check (provider-agnostic):

  • Subnet / network tagging: Missing discoverability tags (e.g. AWS kubernetes.io/role/elb, GKE subnet delegation, AKS outbound rules).
  • Controller cloud permissions: Ingress controller SA lacks IAM/RBAC to create load balancers (AWS LB Controller IRSA, GKE SA, AKS cloud provider identity).
  • Webhook blocks: Admission webhooks for the controller cannot be reached—corporate firewall or SG rules between API server and webhook Service.
bashingress-controller-logs.sh
kubectl logs -n <ingress-namespace> deployment/<controller-deployment-name> --tail=50 -f

EKS ALB specifics: EKS — AWS Load Balancer Controller. Generic Service/LB behavior: Services & Load Balancers.

4. Dynamic Compute Provisioning (Autoscaler Failures)

Unschedulable pod — no new capacity

Scenario: Pods Pending but node count flatlines.

bashautoscaler-triage.sh
kubectl describe pod <pod-name> -n <ns> | grep -A 10 Events

# Cluster Autoscaler, Karpenter, or GKE/AKS autoscaler deployment
kubectl logs -n <scaling-namespace> deployment/<autoscaler-or-karpenter> --tail=100

Common universal causes:

  • Resource request imbalance: Pod requests 64Gi RAM; max instance type allows 32Gi.
  • Cloud quotas: API errors like VcpuLimitExceeded / InstanceLimitExceeded in scaler logs.
  • Unresolvable taints: Pod missing toleration for GPU/isolated pools (Karpenter GPU example).
  • Zone + volume affinity: Scaler adds nodes where disks cannot attach—see CSI zone conflict.

Quick symptom matrix

LayerKubernetes signalFirst command
CSIMulti-Attach, FailedMountkubectl get volumeattachments
CNIFailedCreatePodSandBoxCNI DaemonSet logs
IngressEmpty ADDRESS, 502/504Controller deployment logs
ScalerPending + Insufficient *kubectl describe pod + scaler logs

See also