Cloud Infrastructure Triage Engine
Operational playbooks for integration faults across CSI storage, CNI networking, Ingress / cloud load balancers, and cluster autoscaling. Works on EKS, GKE, AKS, and hybrid clusters—the Kubernetes objects and symptoms are the same; only the cloud API names differ.
Jump to a failure domain:
1. Storage Interface (CSI) Disruption
Focus: persistent volumes stuck in transition on any cloud or bare-metal CSI driver (AWS EBS, Azure Disk, GCP PD, Ceph RBD, etc.).
Multi-Attach deadlock (RWO volumes)
Scenario: A node dies suddenly or a Spot/Preemptible VM is reclaimed. The scheduler places the Stateful pod elsewhere, but it stays in ContainerCreating with a Multi-Attach error—the backend still believes the volume is attached to the dead node.
# 1) Locate the stuck cloud attachment object
kubectl get volumeattachments | grep <pvc-name>
# 2) Confirm scheduled node vs attachment node
kubectl get pod <pod-name> -n <ns> -o wide
kubectl describe pod <pod-name> -n <ns> | grep -A6 EventsVolumeAttachment only after confirming the old node is terminated (not merely NotReady)—otherwise you risk filesystem corruption on shared block storage.kubectl delete volumeattachment <attachment-id>On AWS also verify the volume is not stuck in attaching via EC2/EBS console; on Azure/GCP use the respective disk attachment APIs if the Kubernetes object delete alone is insufficient.
Cross-zone topology conflict
Scenario: Pod Pending with 1 node(s) had volume node affinity conflict. Zone-bound block storage (EBS, Azure Disk, PD) lives in Zone A; the autoscaler added nodes in Zone B.
Fix: Align compute with storage AZ—topology spread, node selectors, or dedicated node pools per zone.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: orders-db
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a"] # Must match PV / disk AZMore storage patterns: Storage Troubleshooting.
2. Network Interface (CNI) & Core Routing
IP address exhaustion
Scenario: Pods stuck in ContainerCreating or FailedCreatePodSandBox.
Core issue: Host-routing CNIs (AWS VPC CNI, Azure CNI with VNet IPs) consume first-class subnet IPs. Dense fleets empty the pool quickly.
# Replace daemonset name: aws-node, azure-cns, calico-node, cilium, etc.
kubectl logs -n kube-system daemonset/<cni-daemonset-name> --tail=100Platform alternatives: Overlay networks (Cilium, Calico) isolate pods on an internal CIDR—often one host IP per node. AWS-specific tuning: AWS VPC CNI.
iptables latency at scale
Scenario: Heavy inter-service timeouts as endpoint counts grow into the thousands.
Root cause: kube-proxy in default iptables mode does O(N) rule scans. Large Services degrade connection setup latency.
kubectl get configmap kube-proxy -n kube-system -o yaml | grep modeResolution blueprint: migrate to ipvs (hash-based lookups) or eBPF dataplane (Cilium kube-proxy replacement). Plan change windows—Dataplane shifts affect every Service path.
3. Cloud Load Balancer & Ingress Controller
Dangling / empty Ingress ADDRESS
Scenario: kubectl get ingress shows an empty ADDRESS field indefinitely.
Triad check (provider-agnostic):
- Subnet / network tagging: Missing discoverability tags (e.g. AWS
kubernetes.io/role/elb, GKE subnet delegation, AKS outbound rules). - Controller cloud permissions: Ingress controller SA lacks IAM/RBAC to create load balancers (AWS LB Controller IRSA, GKE SA, AKS cloud provider identity).
- Webhook blocks: Admission webhooks for the controller cannot be reached—corporate firewall or SG rules between API server and webhook Service.
kubectl logs -n <ingress-namespace> deployment/<controller-deployment-name> --tail=50 -fEKS ALB specifics: EKS — AWS Load Balancer Controller. Generic Service/LB behavior: Services & Load Balancers.
4. Dynamic Compute Provisioning (Autoscaler Failures)
Unschedulable pod — no new capacity
Scenario: Pods Pending but node count flatlines.
kubectl describe pod <pod-name> -n <ns> | grep -A 10 Events
# Cluster Autoscaler, Karpenter, or GKE/AKS autoscaler deployment
kubectl logs -n <scaling-namespace> deployment/<autoscaler-or-karpenter> --tail=100Common universal causes:
- Resource request imbalance: Pod requests 64Gi RAM; max instance type allows 32Gi.
- Cloud quotas: API errors like
VcpuLimitExceeded/InstanceLimitExceededin scaler logs. - Unresolvable taints: Pod missing toleration for GPU/isolated pools (Karpenter GPU example).
- Zone + volume affinity: Scaler adds nodes where disks cannot attach—see CSI zone conflict.
Quick symptom matrix
| Layer | Kubernetes signal | First command |
|---|---|---|
| CSI | Multi-Attach, FailedMount | kubectl get volumeattachments |
| CNI | FailedCreatePodSandBox | CNI DaemonSet logs |
| Ingress | Empty ADDRESS, 502/504 | Controller deployment logs |
| Scaler | Pending + Insufficient * | kubectl describe pod + scaler logs |