Amazon EKS Deep Dive
EKS runs the Kubernetes control plane as a managed AWS service; you operate data plane compute (often managed node groups or Karpenter), VPC networking and security groups, IAM (IRSA for pods, instance profiles for nodes), cluster add-ons, and the AWS Load Balancer Controller. Automate foundations with IaC (Terraform); keep YAML for add-ons tuned to AWS limits and upgrade windows.
Architecture & Trust Boundaries
Unlike kubeadm, you never SSH to Kubernetes masters: AWS scales and patches the apiserver plane. Your responsibility is subnets, IAM, addons, workloads, and change windows during EKS platform version upgrades.
Mental split: AWS runs the apiserver/etcd stack; your VPC and IAM wire workers and cloud integrations.
Creating & Accessing Clusters
# Typical flow after Terraform or eksctl provisioning.
aws eks update-kubeconfig --name prod-platform --region us-east-1
# Inspect platform version vs Kubernetes minor (they differ — check AWS docs).
kubectl version -o yaml
kubectl get nodes -o wide
# STS caller identity confirms which IAM principal your kubeconfig wrapper uses.
aws sts get-caller-identityCompute: Node Groups & Alternatives
| Model | You manage | Operators like it when… |
|---|---|---|
| EKS managed node groups (MNG) | AMI family, sizing, subnets, IAM instance profile attached by EKS/LT | You want AWS to roll AMI patches with defined disruption budgets. |
| Self-managed Auto Scaling Groups | bootstrap script, AMI build, patching cadence | You need custom AMIs or launch templates beyond MNG ergonomics. |
| Fargate profiles | pod sizing, subnets, selectors only | Burst/low-ops workloads; no DaemonSet-heavy suites. |
| Karpenter / native CA | scaling rules, interruption handling, quotas | Rapid elasticity and bin-packing; pair with interruption awareness. |
# Terraform / eksctl equivalents set this; illustrative node labels/taints shape.
labels:
workload: general
topology.kubernetes.io/zone: "${AZ}" # Often set automatically from subnet.
taints:
- key: "nvidia.com/gpu"
value: "shared"
effect: "NoSchedule"IRSA — Pod IAM Without Static Keys
Map a Kubernetes ServiceAccount to an IAM role backed by your cluster OIDC issuer. Provision the IAM role and trust policy via Terraform IRSA pattern; annotate the SA in manifests or Helm.
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-sqs-consumer
namespace: payments
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/prod-app-sqs
eks.amazonaws.com/sts-regional-endpoints: "true" # Helps when STS global endpoint is flaky.Detailed trust boundaries and SG patterns: AWS IAM & Security Groups.
EKS Add-ons & Versioning
AWS distributes tested versions of VPC CNI, CoreDNS, kube-proxy, CSI drivers, Pod Identity Agent, etc. Decide who owns Helm vs EKS-managed add-ons to avoid duplication.
| Add-on domain | Examples | Notes |
|---|---|---|
| Networking / DNS | vpc-cni, kube-proxy, CoreDNS | Align versions with Kubernetes platform; plan upgrades with cluster lifecycle. |
| Identity | IAM Pod Identity Agent (optional alternative to IRSA) | Pick one dominant pod-AWS pattern org-wide. |
| Storage | EBS CSI, EFS CSI | Separate IAM/IRSA roles per driver; KMS for encryption contexts. |
| Ingress / external cloud | AWS Load Balancer Controller (Helm usual) | Needs IRSA permissions to manage ELBv2; interacts with subnets tagged for ELB (Terraform snippet). |
aws eks list-addons --cluster-name prod-platform --region us-east-1
aws eks describe-addon --cluster-name prod-platform --addon-name vpc-cni --region us-east-1AWS Load Balancer Controller
Implements Ingress (and Gateway API progress) against AWS elastic load balancing. Depends on subnets tagged per scheme (public/private internal), IRSA IAM policy, optional WAF integrations, target-type IP vs instance. See Kubernetes Service nuances in Services & Load Balancers.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
ingressClassName: alb
rules:
- host: web.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web
port:
number: 80Cluster Autoscaler Basics
Cluster Autoscaler inspects Pods stuck in Pending, consults Scheduler constraints, scales ASGs/MNGs within min/max, and drains nodes gracefully. Separate from Horizontal Pod Autoscaler (pods) and complements Karpenter for different org standards.
# RBAC-heavy component — confirm deployment args match ASG/MNG discovery tags/cloud provider.
kubectl -n kube-system logs deploy/cluster-autoscaler --tail=200
# Pods waiting for topology / resources — CA reacts only when scheduling truly fails scale-out.
kubectl get events -A --sort-by=.lastTimestamp | tail -40- IAM: controller needs ec2/describe/terminate plus autoscaling per AWS docs (often IRSA).
- Each node group exposes min/max/desired caps — CA cannot exceed AWS ASG boundaries.
- Cluster-wide upgrades happen control-plane-first; cordon/drain node groups thoughtfully.
Helm Shape: AWS Load Balancer Controller
Below is a representative values.yaml fragment—pin chart versions in your pipeline the same way you pin Terraform providers. IRSA role ARNs must exist before helm upgrade applies.
clusterName: prod-platform
region: us-east-1
vpcId: vpc-0123456789abcdef0
serviceAccount:
create: true
name: aws-load-balancer-controller
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/prod-alb-controller
enableServiceMutatorWebhook: true
ingressClassConfig:
default: true
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 512Mi
nodeSelector: {}
tolerations: []
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- aws-load-balancer-controller
topologyKey: kubernetes.io/hostname
defaultTags:
Environment: prod
Cluster: prod-platform
logLevel: info
enableShield: false
enableWaf: false
enableWafv2: falseVPC CNI & Prefix Delegation
Prefix delegation increases IP density per ENI—critical for dense pod counts before hitting ENI quotas. Coordinate warm pool settings with application burst patterns; mis-tuned settings show up as FailedCreatePodSandBox events while Services still appear healthy at the control plane.
| Setting family | Why it matters |
|---|---|
WARM_PREFIX_TARGET | Balances pre-allocated prefixes vs cold attach latency during scale-out. |
ENABLE_PREFIX_DELEGATION | Must align with subnet sizing and routable address expectations. |
| Security groups for Pods | Each SG rule multiply affects effective throughput—pair with security group page reviews. |
Fargate & Windows Footnotes
- Fargate: DaemonSets (CNI logging, node-exporter patterns) do not exist—shift observability sidecars into Deployments or adopt Fargate-aware agents only.
- Windows nodes: Separate MNG pools, distinct tolerations on workloads, patch cycles differ from Linux AMIs.
- Cluster Autoscaler still needs IAM awareness for each ASG—even if workloads are ephemeral Fargate, static MNG pools may coexist.
Operational Checklists
| Area | SRE checks |
|---|---|
| VPC hygiene | Subnet tagging for ELBs, NAT path for private pulls, SG rules between control-plane ENIs & workers. |
| Admission & API | APIServer unreachable often IAM auth or STS partition issues; webhook latency causes cascade failures. |
| Add-on drift | In-cluster Helm vs eksctl-managed vs EKS add-on — unify ownership. |
| Costs | Monitor idle MNG GPU nodes, orphaned ELBs/TargetGroups across namespaces. |
VPC, Subnets & Routing
Worker nodes commonly live on private subnets with NAT gateways for egress. Elastic load balancers for public Ingress may materialize either in subnets tagged kubernetes.io/role/elb or internal-only subnets tagged kubernetes.io/role/internal-elb (Terraform sample tags). Cross-AZ SG rules plus NACL pitfalls still apply—when NodePort or hostNetwork patterns appear during incidents, correlate with our Services guidance before blindly editing SG ingress.
| Decision | Recommendation |
|---|---|
| Single vs multi NAT | Prefer one NAT GW per AZ for HA data-plane egress paths; beware cost vs blast radius trade-offs. |
| IPv6-enabled VPC | Supports dual-stack Services and newer networking features; regression-test CNI & prefix delegation. |
| Restricted outbound | Allow ECR, STS, APIs your IRSA workloads require; egress proxy requires trust bundle injection on nodes. |
| Hybrid cloud routes | BGP/TGW must not overlap Pod CIDR; overlap produces silent half-open TCP sessions. |
# Compare AWS subnet tags consumed by CCM / LB controller automation.
aws ec2 describe-subnets \
--filters "Name=tag:kubernetes.io/cluster/prod-platform,Values=owned" \
--query 'Subnets[*].{ID:SubnetId,AZ:AvailabilityZone,Name:Tags[?Key==`Name`].Value|[0]}'APIServer Authorization & Access Entries
| Mechanism | When it appears | Operational note |
|---|---|---|
| aws-auth ConfigMap (legacy) | kubeadm-style IAM→.kubernetes RBAC bridging | Breaking YAML maps every engineer at once — prefer Git-reviewed changes. |
| EKS access entries API | IAM principal binds to Kubernetes groups / cluster-admin flags | Cleaner audit trails; aligns with SCP-governed principals. |
| Webhook authZ | Open Policy Agent / Kyverno / custom webhooks | Additive latency spikes become cluster-wide outages—watch apiserver etcd watch lag. |
# Validate effective RBAC independent of IAM wrapper (after kubeconfig merges).
kubectl auth can-i list secrets -n kube-system
kubectl auth can-i create pods --as=system:serviceaccount:default:debuggerKubernetes & Platform Upgrades
Advance one minor Kubernetes version per maintenance window whenever possible — skip versions only when AWS publishes explicit exemption guidance. Rotate node groups progressively: bootstrap new AMI groups, cordon+d older nodes while honoring PodDisruptionBudgets, shrink old ASGs only after DaemonSets report healthy replacements.
kubectl get apiservice | grep False
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl get pods -A | grep -vE 'Running|Completed' || true
kubectl describe nodes | grep -iE 'pressure|Kubelet' || trueNLB Shape For Service Kind LoadBalancer
Some teams prefer Kubernetes Service=LoadBalancer with NLB annotations while others standardize purely on Ingress. Keep annotations consistent cluster-wide (nlb-target-type, health probes, cross-zone).
apiVersion: v1
kind: Service
metadata:
name: edge-tcp
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
type: LoadBalancer
selector:
app: tcp-proxy
ports:
- name: tcp5443
port: 5443
targetPort: 5443Troubleshooting Matrix
| Signal | Hypothesis path | Evidence commands |
|---|---|---|
| nodes NotReady flood | APIServer outage, cgroup pressure, IMDS hops after IRSA regressions | journalctl -u kubelet via SSM/log aggregation; STS CloudTrail anomalies. |
| Pods wedged Pending | Insufficient ASG caps, selectors, DaemonSet starvation | kubectl describe pod + CA logs (Autoscaler). |
| Image pulls fail sporadically | ECR DENY from node role, STS throttling downstream of IRSA | ECR + STS metrics; widen node egress temporarily for triage. |
| Ingress timeouts only from internet | wrong ALB subnets, TG unhealthy, MTU path issues via VPN | kubectl describe ingress + AWS LB health tab. |
| Webhook TLS errors | Expired serving certs behind cert-manager outage | kubectl get apiservice, apiserver aggregated logs filter. |
Rolling Control Plane Upgrades (Shape)
#!/usr/bin/env bash
set -euo pipefail
CLUSTER="${CLUSTER:-prod-platform}"
REGION="${REGION:-us-east-1}"
# 1) advance control plane version after change window approval
aws eks update-cluster-version \
--name "$CLUSTER" --region "$REGION" \
--kubernetes-version "${TARGET_MINOR:-1.30}"
# 2) wait until ACTIVE between dependent steps — poll with backoff externally
until aws eks describe-cluster --name "$CLUSTER" --region "$REGION" \
--query 'cluster.status' --output text | grep -qx ACTIVE; do
echo "waiting control plane converge..."
sleep 30
done
# 3) refresh node AMI / kubelet per nodegroup name from IaC outputs
NODEGROUP=$(aws eks list-nodegroups --cluster-name "$CLUSTER" --region "$REGION" \
--query 'nodegroups[0]' --output text)
echo "planned rolling update targeting $NODEGROUP"
# 4) reconcile addons after nodes healthy — ensure compatibility matrix consulted
aws eks list-addons --cluster-name "$CLUSTER" --region "$REGION"
# Document manual verification gates (Ingress smoke, STS IRSA workloads) before declaring complete.AWS Quotas & Limits To Track
- ENI quotas per instance type interplay with Pods when prefix delegation disabled.
- ELB quotas per region—large ingress churn during testing exhausts quotas quickly.
- EC2 Auto Scaling API rate limits amplified by aggressive Cluster Autoscaler loops.
- Security group rule counts including cross-referenced LB + node SG combos.
- IAM roles per account when each micro-service demands unique IRSA role.
- Route53 ChangeResourceRecordSets throttling mirrored by ExternalDNS logs.
- CloudWatch Logs ingestion spikes when apiserver audit verbose.
- EKS addon API throttling surfaced as Terraform apply retries needing backoff.
- STS regional endpoint throughput during massive rollout events.
- EBS BurstBalance alarms when log-heavy nodes share gp2 pools.
- Target group Attachment limits per LB complicate multi-namespace ingress designs.
- API Discovery publish QPS spikes around CRD churn during Helm upgrades.
- WAF ACL association limits pairing with controllers toggling shields.
- Cross-AZ NAT Gateway bandwidth costs mistaken as application latency regressions.
- KMS requests per second when many pods concurrently decrypt envelopes.
- Service Quotas uplift tickets should reference FinOps stakeholder approval paths.
Surface limits early in sizing reviews alongside Terraform-driven IaC manifests so limits become code-reviewed constants.
Gotchas
- VPC mismatch: wrong subnets → nodes never join or LBs provision in the wrong SG.
- IRSA annotation typo: subtle namespace/SA mismatch → SDK falls back to node role (least surprise permissions).
- Security group sprawl: default cluster SG edits can break apiserver/worker signaling — track changes carefully.
- Add-on duplication: two CoreDNS controllers or vpc-cni versions cause hard-to-debug iptables/IPAM errors.
- CA vs PDB: aggressive PodDisruptionBudgets can block scale-down for long periods.
- Ingress LB pending: usually IAM/IRSA/subnet tags on the LB controller pod — correlate with Events.