AWS IAM & Security Groups — K8s SRE Reference

TL;DR

Use IRSA or Pod Identity for fine-grained pod AWS credentials; reserve node instance profiles for kubelet/AWS CNI/agent needs; lock security groups deliberately for apiserver signaling, workloads, load balancers, and ephemeral pod ENIs (VPC CNI). Model everything in IaC (Terraform) and correlate SG/IAM regressions when Services or Ingress stop reconciling.

IRSA Trust Flow (OIDC)

IRSA binds a Kubernetes ServiceAccount to an IAM role via the cluster OIDC issuer; STS mints scoped credentials inside the Pod.

Automate IAM role wiring with terraform-aws-modules/iam-role-for-service-accounts-eks (Terraform IRSA section). Always scope Federated: trust to :aud and sts:AssumeRoleWithWebIdentity Subject keys that include namespace/serviceaccount triples.

Node Instance Roles Vs Pod Roles

Principal	Attached IAM	Typical permissions
EC2 / MNG instance profile	IAM role baked into LT	Pull from ECR, describe ENIs/asgs limited set, KMS decrypt for node volumes.
Cluster addons (LB controller, CSI, CA)	Dedicated IAM roles via IRSA	ELBv2 mutate, AttachVolume/ebs CSI, DescribeAutoScalingGroups for CA.
Workload pods	Explicit IRSA annotations	SQS, Dynamo, Secrets Manager — never widen node profile to mimic app IAM.

bashwhoami-pod.sh

# Exec into problematic pod — confirm env AWS_ROLE_ARN + token file mounts exist.
kubectl -n payments exec deploy/app -- env | grep '^AWS_' || true

# STS decode with caller identity mirrors what SDK will assume.
kubectl -n payments exec deploy/app -- aws sts get-caller-identity

Security Groups — Nodes, API, LB, Pod ENIs

Boundary	Usually attached to…	Things to nail
Cluster / control-plane SG	Managed ENIs bridging api → nodes	Preserve AWS-managed rules for apiserver/kubelet chatter; minimize human edits.
Node SG	Workers (and sometimes managed ENIs)	Expose only needed app ports intra-VPC; allow control-plane inbound on 443/10250 patterns per AWS baseline.
LB SG	ALB / NLB created via controller	Ingress health checks need correct target groups; correlate with subnets tagged kubernetes.io/role/elb.
Pod SG (VPC CNI feature)	Pods with dedicated SG refs	Understand IP prefix/SG interplay; egress to AWS APIs passes node/NAT routing.

yamlsg-survey-shape.yaml

# Inspect ENIs tying SGs ↔ nodes — illustrative workflow; SG IDs differ per VPC.
metadata:
  nodeSelector:
    eks.amazonaws.com/nodegroup: ingress-heavy
annotations: {}
# Cross-check actual ENIs via AWS CLI / console when debugging SG egress drops.

💡

LB pending When Service type LoadBalancer hangs, correlate SG + subnet tagging + LB controller IAM; see Services & Load Balancers for kube-level checks.

Common EKS IAM Mistakes

Symptom	Root cause pitfall	Rapid remediation
`AccessDenied` from pod calling AWS APIs	IAM policy missing kms:Decrypt / wrong resource ARN or missing IRSA annotations	Inspect pod SA annotations + CloudTrail `AssumeRole`; fix trust `:sub` string mismatches.
Creds silently too powerful	Workload uses node IAM role fallback	Annotate SA, disable metadata hop where safe, tighten node policy.
Webhook / aws-auth breakage	Stale aws-auth mapping or SSO role rename	Use EKS Access Entries where available; reconcile aws-auth carefully.
STS throttling globally	Huge fleets assume pod roles without caching	Regional STS endpoints (`sts-regional-endpoints` annotation); SDK timeouts.
IAM role quotas	Many micro-services each with IAM role churn	Reuse roles with tighter policy partitioning or attribute-based conditioning.
Broken OIDC discovery	Deleting provider while IRSA workloads still rollout	Freeze deploys until provider restored; workloads restart to pick STS errors.

KMS CMKs & Secrets

Encrypt etcd secrets at rest with AWS KMS CMKs referenced in cluster config — platform teams own key policies for EKS principal use. Pods rarely talk to KMS unless app logic requires decrypt; tie IRSA principals with restrictive key policies.

bashtrail-assumerole.sh

# CloudTrail Insights / event lookup — filter iam.amazonaws.com AssumeRole.
aws logs filter-log-events \
  --log-group-name CloudTrail/Default \
  --filter-pattern "AssumeRole"

IAM Trust Relationship Shape (Conceptual)

Generated JSON below matches what Terraform IAM modules emit — double-check issuer host matches your cluster’s OIDC URL (no stray trailing slashes).

texttrust-policy-shape.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:aud": "sts.amazonaws.com",
          "oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:sub": "system:serviceaccount:payments:app-sqs-consumer"
        }
      }
    }
  ]
}

Translating Humans Into apiserver Principals

Mechanism	Use when…	Pitfalls
`aws-auth`	Legacy clusters still mapping IAMRole → kubernetes groups	Syntax errors strand every admin offline until emergency SSM/fix.
Access Entries	New defaults for IAM SSO roles / break-glass users	Stale entries after IdP renaming — audit quarterly.
Separate admin role per env	Finance wants tight SCP boundaries prod vs dev	Copy/paste SCP statements missing `kms:` encrypt contexts.

bashaws-auth-backup.sh

# Snapshot before edits — restores must be deterministic.
kubectl -n kube-system get configmap aws-auth -o yaml > /tmp/aws-auth-backup.yaml

aws-auth ConfigMap Shape (Danger Zone)

Prefer Access Entries/EKS APIs for greenfield installs; retained purely for comprehension when navigating brownfield outages.

yamlaws-auth-illustrative.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::123456789012:role/sso-infra-platform
      username: sso-infra-{{SessionName}}
      groups:
        - system:masters  # illustrative — tighten in real environments
        - platform-crds-maintainers
    - rolearn: arn:aws:iam::123456789012:role/eks-worker-node-role
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes
    - rolearn: arn:aws:iam::123456789012:role/legacy-batch
      username: batch-role
      groups:
        - batch-readonly-ns
  mapUsers: |
    - userarn: arn:aws:iam::123456789012:user/break-glass-vault
      username: break-glass
      groups:
        - emergency-cluster-admin-temporary

Every merge here should run through peer review correlating SSO group rename tickets and Kubernetes auth primitives.

Network Paths Pods Use For AWS APIs

AssumeRole flows leave the VPC via NAT or Interface VPC endpoints. Private clusters often mandate STS and ECR VPC endpoints plus route tables pointing to those ENIs instead of quad-zero internet routing—misroutes appear as flaky IRSA timeouts during rollout storms.

Endpoint	Consumed by…	Operational signal
com.amazonaws.region.sts	IRSA, Pod Identity, External Secrets	APIServer timeouts when endpoint SG blocks node SG.
ECR dk/api	Image pulls accelerated without internet	403 versus throttling differentiated by AWS support metrics.
Elastic Load Balancing / EC2 APIs	Controllers creating LoadBalancer backends	Insufficient ec2:* permissions surface as Ingress Events only.

SCPs & Permission Boundaries For Platform Roles

Artifact	Impact on EKS tooling
Explicit deny on `iam:PassRole`	Breaks CD pipelines handing controller roles unless exception path.
Regional restrictions	Secrets replication + multi-region STS assumptions fail.
Boundary on node role	Autoscaler terminates fail even when inline policy permits.
Wildcard deny on unmanaged ARNs	New IRSA roles may be blocked silently until exception tickets land.

Document cross-links for clusters built via Terraform; execution roles there must reconcile with organization guardrails ahead of merges.

Ingress / LB Security Group Hygiene

Concern	Recommendation
Health check flap	Ensure node/SG ingress allows LB SG on targetPort—even when health checks originate from AWS IPs.
Sticky `0.0.0.0/0`	Replace iterative debugging allowances with tightened CIDR automation.
NLB preserving client IP vs proxy protocol	Align `externalTrafficPolicy` semantics with LB type per Services.
Cross-account ACM	Validate SAN + trust chain controllers expect; Ingress secret mismatch blocks listener creation.

Audit & Forensics Cheat Sheet

bashcis-audit-shape.sh

# Inventory IRSA-managed roles referencing your cluster issuer.
aws iam list-roles --query 'Roles[?contains(AssumeRolePolicyDocument,`:oidc-provider/eks`)].[RoleName]' --output table

aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRoleWithWebIdentity \
  --max-results 10

Narrow IAM Policy Attachment Example

Use condition keys to avoid bucket-wide exposures when mapping IRSA workload roles authored via Terraform modules.

textiam-inline-policy-s3-read.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ListPrefix",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::prod-artifacts",
      "Condition": {
        "StringLike": { "s3:prefix": ["releases/*"] }
      }
    },
    {
      "Sid": "ObjectRW",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::prod-artifacts/releases/*"
    }
  ]
}

IAM Pod Identity Vs IRSA Snapshot

Topic	IRSA (classic)	Pod Identity agent
Trust source	OIDC federation on role	Agent-mediated short-lived creds
Debugging muscle memory	Mature community examples	Newer playbook; align versions with EKS add-ons
Rotation	SDK refresh via web identity file	Agent watches ServiceAccount mapping objects
Migration	Default for most brownfield clusters	Plan cutover windows; avoid dual annotate

Regardless of path, never store long-lived access keys in Secrets—breaks rotation and contradicts IaC auditing.

Cross-Account Resource Policies

When pods in account A pull from ECR or decrypt CMKs in account B, BOTH identity policy (IRSA role) and resource policy (ECR repo, KMS key) must acknowledge the foreign principal. Symptoms look like IAM “allowed in policy simulator” yet runtime 403.

bashecr-resource-policy-check.sh

aws ecr get-repository-policy \
  --repository-name platform/base-image \
  --region us-east-1

aws kms get-key-policy \
  --key-id alias/prod-etcd \
  --policy-name default

Break-Glass Roles & Session Controls

Time-limited break-glass role with mandatory MFA for human actions on node groups or IRSA remediation.
CloudTrail data events on security-sensitive S3/Terraform state buckets— correlate with EKS incident timestamps.
Session policies for support vendor roles to auto-expire lateral movement potential.

IAM Policy Simulation Snippets

bashsimulate-principal-shape.sh

# Human IAM role — deterministic unit tests before widening prod privileges.
POLICY="$(cat policy-document.json)"

aws iam simulate-principal-policy \
  --policy-source-arn "$ROLE_ARN" \
  --action-names ecs:DescribeServices \
  --resource-arns "arn:aws:ecs:REGION:123456789012:service/cluster/svc"

# SCP overlay awareness — org trail account must run org-level simulation separately.
AWS_PROFILE=management-org aws organizations list-policies --filter SERVICE_CONTROL_POLICY

# Quick inline deny hunting for node autoscaler regressions — spot missing autoscaling verbs.
ACTIONS=(
  ec2:DescribeLaunchTemplateVersions
  autoscaling:SetDesiredCapacity
  autoscaling:DescribeAutoScalingGroups
  autoscaling:TerminateInstanceInAutoScalingGroup
)
for verb in "${ACTIONS[@]}"; do
  echo "checking $verb"
  aws iam simulate-principal-policy --policy-source-arn "$ROLE_ARN" \
    --action-names "$verb" || true
done

Operational Metrics To Dashboard

STS AssumeRole success vs failure rate tagged by IAM role ARN.
ECR image pull durations split by repo and node AZ.
ELB IdleTimeout resets correlating HTTP 504 customer reports.
Security group DENY CloudWatch Logs insights saved queries.
IRSA JWT audience mismatch occurrences via controller logs scraping.
Cross-account KMS DecryptThrottle events alerting FinOps dashboards.
apiserver aggregated admission webhook latency percentile budgets.
Node kubelet IAM credential provider plugin failures when adopting Pod Identity hybrid.
NLB unhealthy target counts matching cluster upgrade windows.
Route53 RRSet change backlog length for noisy ExternalDNS deploys.
Cluster Autoscaler evicted PDB-blocked drains counting stuck scale-down loops.
Pod Security labeled namespace coverage percentage.
WAF blocked requests juxtaposed vs application 4xx KPI deflection expectations.
CloudTrail anomaly detection suppressed reasons audit trail completeness.
Permission boundary denies captured per CI pipeline role for IaC regressions tying to Terraform CI ownership.
Ingress controller reconcile queue depth alerting before AWS API storm.

SCP Illustration (Non-Authoritative)

Coordinates with organization SCP guidance—simulate in management account before activating.

textscp-illustrative-deny-shape.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyDestructiveIamWithoutElevation",
      "Effect": "Deny",
      "Action": ["iam:*"],
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": { "iam:PassedToService": "" },
        "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" }
      }
    },
    {
      "Sid": "RequireApprovedPathsForNetworking",
      "Effect": "Deny",
      "Action": ["ec2:AuthorizeSecurityGroupIngress"],
      "Resource": "*",
      "Condition": {
        "ArnNotLikeIfExists": { "ec2:Vpc": "arn:aws:ec2:REGION:123456789012:vpc/enforced-training-vpc*" }
      }
    }
  ]
}

Gotchas

!Widening the node IAM role masks missing IRSA and creates blast-radius incidents.
!Copy-paste trust policies from another cluster without updating issuer URL → silent failures.
!Removing default cluster SG egress blocks image pulls hitting private registries over unexpected paths.
!IAM policy simulator does not emulate IRSA JWT claims fully — rely on STS + CloudTrail trails.
!Overlapping LB SG rules + stray 0.0.0.0/0 exposures often come from iterative debugging — audit regularly.
!Pod SG per Pod without IP prefix tuning can exhaust ENI/SG quotas in large fleets.

IRSA Trust Flow (OIDC)

Node Instance Roles Vs Pod Roles

Security Groups — Nodes, API, LB, Pod ENIs

Common EKS IAM Mistakes

KMS CMKs & Secrets

IAM Trust Relationship Shape (Conceptual)

Translating Humans Into apiserver Principals

aws-auth ConfigMap Shape (Danger Zone)

Network Paths Pods Use For AWS APIs

SCPs & Permission Boundaries For Platform Roles

Ingress / LB Security Group Hygiene

Audit & Forensics Cheat Sheet

Narrow IAM Policy Attachment Example

IAM Pod Identity Vs IRSA Snapshot

Cross-Account Resource Policies

Break-Glass Roles & Session Controls

IAM Policy Simulation Snippets

Operational Metrics To Dashboard

SCP Illustration (Non-Authoritative)

Gotchas

Related Pages