Cloud Infrastructure Failures — K8s SRE Reference

TL;DR

Operational playbooks for integration faults across CSI storage, CNI networking, Ingress / cloud load balancers, and cluster autoscaling. Works on EKS, GKE, AKS, and hybrid clusters—the Kubernetes objects and symptoms are the same; only the cloud API names differ.

Jump to a failure domain:

1. Storage Interface (CSI) Disruption

Focus: persistent volumes stuck in transition on any cloud or bare-metal CSI driver (AWS EBS, Azure Disk, GCP PD, Ceph RBD, etc.).

Multi-Attach deadlock (RWO volumes)

Scenario: A node dies suddenly or a Spot/Preemptible VM is reclaimed. The scheduler places the Stateful pod elsewhere, but it stays in ContainerCreating with a Multi-Attach error—the backend still believes the volume is attached to the dead node.

bashcsi-multi-attach-triage.sh

# 1) Locate the stuck cloud attachment object
kubectl get volumeattachments | grep <pvc-name>

# 2) Confirm scheduled node vs attachment node
kubectl get pod <pod-name> -n <ns> -o wide
kubectl describe pod <pod-name> -n <ns> | grep -A6 Events

⚠

Safe recovery: Delete the VolumeAttachment only after confirming the old node is terminated (not merely NotReady)—otherwise you risk filesystem corruption on shared block storage.

bashvolumeattachment-clear.sh

kubectl delete volumeattachment <attachment-id>

On AWS also verify the volume is not stuck in attaching via EC2/EBS console; on Azure/GCP use the respective disk attachment APIs if the Kubernetes object delete alone is insufficient.

Cross-zone topology conflict

Scenario: Pod Pending with 1 node(s) had volume node affinity conflict. Zone-bound block storage (EBS, Azure Disk, PD) lives in Zone A; the autoscaler added nodes in Zone B.

Fix: Align compute with storage AZ—topology spread, node selectors, or dedicated node pools per zone.

yamlzone-aligned-stateful.yaml

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: orders-db
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a"]  # Must match PV / disk AZ

More storage patterns: Storage Troubleshooting.

2. Network Interface (CNI) & Core Routing

IP address exhaustion

Scenario: Pods stuck in ContainerCreating or FailedCreatePodSandBox.

Core issue: Host-routing CNIs (AWS VPC CNI, Azure CNI with VNet IPs) consume first-class subnet IPs. Dense fleets empty the pool quickly.

bashcni-logs.sh

# Replace daemonset name: aws-node, azure-cns, calico-node, cilium, etc.
kubectl logs -n kube-system daemonset/<cni-daemonset-name> --tail=100

Platform alternatives: Overlay networks (Cilium, Calico) isolate pods on an internal CIDR—often one host IP per node. AWS-specific tuning: AWS VPC CNI.

iptables latency at scale

Scenario: Heavy inter-service timeouts as endpoint counts grow into the thousands.

Root cause: kube-proxy in default iptables mode does O(N) rule scans. Large Services degrade connection setup latency.

bashkube-proxy-mode.sh

kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

Resolution blueprint: migrate to ipvs (hash-based lookups) or eBPF dataplane (Cilium kube-proxy replacement). Plan change windows—Dataplane shifts affect every Service path.

3. Cloud Load Balancer & Ingress Controller

Dangling / empty Ingress ADDRESS

Scenario: kubectl get ingress shows an empty ADDRESS field indefinitely.

Triad check (provider-agnostic):

Subnet / network tagging: Missing discoverability tags (e.g. AWS kubernetes.io/role/elb, GKE subnet delegation, AKS outbound rules).
Controller cloud permissions: Ingress controller SA lacks IAM/RBAC to create load balancers (AWS LB Controller IRSA, GKE SA, AKS cloud provider identity).
Webhook blocks: Admission webhooks for the controller cannot be reached—corporate firewall or SG rules between API server and webhook Service.

bashingress-controller-logs.sh

kubectl logs -n <ingress-namespace> deployment/<controller-deployment-name> --tail=50 -f

EKS ALB specifics: EKS — AWS Load Balancer Controller. Generic Service/LB behavior: Services & Load Balancers.

4. Dynamic Compute Provisioning (Autoscaler Failures)

Unschedulable pod — no new capacity

Scenario: Pods Pending but node count flatlines.

bashautoscaler-triage.sh

kubectl describe pod <pod-name> -n <ns> | grep -A 10 Events

# Cluster Autoscaler, Karpenter, or GKE/AKS autoscaler deployment
kubectl logs -n <scaling-namespace> deployment/<autoscaler-or-karpenter> --tail=100

Common universal causes:

Resource request imbalance: Pod requests 64Gi RAM; max instance type allows 32Gi.
Cloud quotas: API errors like VcpuLimitExceeded / InstanceLimitExceeded in scaler logs.
Unresolvable taints: Pod missing toleration for GPU/isolated pools (Karpenter GPU example).
Zone + volume affinity: Scaler adds nodes where disks cannot attach—see CSI zone conflict.

Quick symptom matrix

Layer	Kubernetes signal	First command
CSI	Multi-Attach, FailedMount	`kubectl get volumeattachments`
CNI	FailedCreatePodSandBox	CNI DaemonSet logs
Ingress	Empty ADDRESS, 502/504	Controller deployment logs
Scaler	Pending + Insufficient *	`kubectl describe pod` + scaler logs

Cloud Infrastructure Triage Engine

1. Storage Interface (CSI) Disruption

Multi-Attach deadlock (RWO volumes)

Cross-zone topology conflict

2. Network Interface (CNI) & Core Routing

IP address exhaustion

iptables latency at scale

3. Cloud Load Balancer & Ingress Controller

Dangling / empty Ingress ADDRESS

4. Dynamic Compute Provisioning (Autoscaler Failures)

Unschedulable pod — no new capacity

Quick symptom matrix

See also

1. Storage Interface (CSI) Disruption

Multi-Attach deadlock (RWO volumes)

Cross-zone topology conflict

2. Network Interface (CNI) & Core Routing

IP address exhaustion

iptables latency at scale

3. Cloud Load Balancer & Ingress Controller

Dangling / empty Ingress ADDRESS

4. Dynamic Compute Provisioning (Autoscaler Failures)

Unschedulable pod — no new capacity

Quick symptom matrix

See also

Related Pages