Terraform From Scratch — K8s SRE Reference

TL;DR

Terraform owns durable cloud infrastructure — VPC, EKS, IAM, DNS, load balancers. App Deployments belong in Helm or ArgoCD. Always plan in CI, apply through approved pipelines, and treat remote state like production admin access.

Scope & Ownership

On most K8s SRE contracts, Terraform provisions the platform layer. Workloads and day-to-day releases are managed elsewhere. Draw the boundary early to avoid two tools fighting over the same Kubernetes object.

Platform layers and tool ownership — never let two tools manage the same Kubernetes object.

Layer	Tool	Examples
Cloud foundation	Terraform	VPC, subnets, EKS, IAM roles, Route53, ACM certs
Cluster add-ons	Terraform or Helm	AWS Load Balancer Controller, ExternalDNS, metrics-server
App workloads	Helm / ArgoCD	Deployments, Services, Ingress, HPA, ConfigMaps

Project Layout

Most client repos split reusable modules from environment-specific roots. Each environment gets its own state file.

text repo-tree.txt

platform-infra/
  modules/
    vpc/              # Reusable VPC module.
    eks/              # Reusable EKS module.
  envs/
    dev/
      main.tf         # Calls modules; wires outputs to inputs.
      variables.tf
      terraform.tfvars
      backend.tf      # Remote state config for dev.
    prod/
      main.tf
      terraform.tfvars
      backend.tf      # Separate state file — never share with dev.

Daily Workflow

Run locally for exploration; run plan/apply in CI when the client requires audit trails and approvals.

bash terraform-workflow.sh

terraform fmt -recursive          # Format all .tf files before commit.
terraform init                      # Download providers; configure backend.
terraform validate                  # Static check — no cloud API calls.
terraform plan -out=tfplan          # Save reviewable plan artifact.
terraform show tfplan               # Read plan before apply.
terraform apply tfplan              # Apply exactly what was reviewed.

# State inspection and drift.
terraform state list
terraform state show aws_eks_cluster.this
terraform plan -refresh-only         # Detect drift without changing anything.
terraform output                    # Print module outputs (cluster name, endpoint, etc.).

# Recovery and targeted ops — use sparingly in prod.
terraform force-unlock LOCK_ID      # Only after confirming no apply is running.
terraform apply -replace=aws_eks_node_group.general  # Force recreate one resource.
terraform apply -target=module.eks   # Emergency scope — document why.

Remote State Backend

Remote state is the source of truth for what Terraform manages. Lock it, encrypt it, and restrict access to platform engineers only.

hcl backend.tf

terraform {
  backend "s3" {
    bucket         = "client-prod-terraform-state"
    key            = "eks/prod/terraform.tfstate"  # One key per env/module root.
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"               # Prevents concurrent applies.
    encrypt        = true
  }
}

Variables & Outputs

Pass values through terraform.tfvars per environment. Wire module outputs into downstream modules — VPC IDs into EKS, cluster OIDC issuer into IRSA roles.

hcl envs/prod/main.tf

variable "environment" {
  type = string
}

variable "cluster_version" {
  type    = string
  default = "1.30"
}

module "vpc" {
  source      = "../../modules/vpc"
  environment = var.environment
  cidr        = "10.0.0.0/16"
}

module "eks" {
  source          = "../../modules/eks"
  cluster_name    = "${var.environment}-platform"
  cluster_version = var.cluster_version
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets
}

output "cluster_name" {
  value = module.eks.cluster_name
}

output "oidc_provider_arn" {
  value       = module.eks.oidc_provider_arn
  description = "Pass to IRSA role modules."
}

Environments

Pattern	When to use	State
Directory per env (`envs/dev`, `envs/prod`)	Most common in enterprise repos; clearest blast radius.	Separate backend key per directory.
Workspaces (`terraform workspace select prod`)	Small teams, identical config with different tfvars.	One backend, workspace-prefixed state.
Terragrunt	Many accounts/regions with DRY backend and provider config.	Generated backend blocks per leaf module.

VPC & EKS Modules

EKS needs private subnets for worker nodes and public subnets (or private + NAT) for load balancers. Tag subnets so the AWS Load Balancer Controller can discover them.

hcl vpc.tf

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "prod-platform"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false  # One NAT per AZ for HA.

  public_subnet_tags = {
    "kubernetes.io/role/elb" = "1"
  }
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = "1"
  }
}

This EKS module consumes the VPC outputs and creates the managed control plane plus node group. Use it after the VPC module so worker nodes land in the intended private subnets.

hcl eks.tf

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-platform"
  cluster_version = "1.30"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  enable_irsa = true  # Required for pod-level AWS permissions.

  eks_managed_node_groups = {
    general = {
      instance_types = ["m6i.large"]
      min_size       = 3
      desired_size   = 4
      max_size       = 8
      labels = { workload = "general" }
    }
  }
}

IRSA (IAM Roles for Service Accounts)

IRSA lets a Kubernetes ServiceAccount assume an AWS IAM role via OIDC. This is the standard pattern for ExternalDNS, AWS LB Controller, and any pod that calls AWS APIs.

hcl irsa-external-dns.tf

module "external_dns_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.0"

  role_name = "prod-external-dns"

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:external-dns"]
    }
  }

  attach_external_dns_policy = true
}

# Helm chart values reference the role ARN:
# serviceAccount.annotations.eks.amazonaws.com/role-arn: <role_arn>

Terraform in CI/CD

Plan on every PR; apply only from protected branches. Use OIDC to assume an IAM role — no long-lived AWS keys in CI secrets.

yaml .github/workflows/terraform.yaml

name: terraform
on:
  pull_request:
    paths: ["envs/**"]
  push:
    branches: [main]
    paths: ["envs/**"]

permissions:
  id-token: write   # Required for OIDC.
  contents: read

jobs:
  plan:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: envs/prod
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-terraform
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init -input=false
      - run: terraform validate
      - run: terraform plan -input=false -out=tfplan
      - uses: actions/upload-artifact@v4
        if: github.event_name == 'pull_request'
        with:
          name: tfplan
          path: envs/prod/tfplan

  apply:
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    needs: plan
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval in GitHub.
    defaults:
      run:
        working-directory: envs/prod
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-terraform
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init -input=false
      - run: terraform apply -input=false -auto-approve

Import & Brownfield

When adopting Terraform on existing infrastructure, write the resource block first, then import. Use moved blocks (TF 1.1+) to refactor without destroy/recreate.

bash import.sh

# Classic import — write matching resource block first.
terraform import aws_iam_role.external_dns external-dns-prod
terraform plan   # Expect diffs; tune attributes until plan is clean.

# TF 1.5+ import block (declarative).
# import { to = aws_iam_role.external_dns id = "external-dns-prod" }

# Refactor without changing real infrastructure.
# moved { from = aws_iam_role.old to = module.dns.aws_iam_role.external_dns }

terraform state mv aws_iam_role.old module.dns.aws_iam_role.external_dns
terraform state rm aws_instance.decommissioned  # Removes from state only — does NOT destroy.

Troubleshooting

Symptom	Likely cause	Fix
`Error acquiring the state lock`	Stale lock or crashed apply	Confirm no running apply, then `terraform force-unlock ID`
Plan shows unexpected destroys	Renamed resource without `moved` block	Add `moved` block or `state mv`
Provider auth failure in CI	Wrong role, expired OIDC trust	Check IAM trust policy matches repo/branch
Drift on every plan	Manual console changes or autoscaler edits	`plan -refresh-only`; add `lifecycle { ignore_changes }`
Helm and TF conflict on same K8s object	Split ownership unclear	Pick one owner; remove from the other tool

Gotchas

!Never commit .tfstate or .terraform/ to Git — remote backend only.
!State files contain secrets. Restrict S3/DynamoDB access like production admin.
!Pin provider versions in required_providers — unpinned providers break plans silently.
!Do not run apply from your laptop if the client requires CI-based approvals.
!When Terraform and Helm/ArgoCD both manage the same Kubernetes object, ownership conflicts are likely.
!Use -target only for emergencies — it skips dependency checks and causes drift.
!terraform state rm removes tracking only; the real resource stays running until you destroy it manually.

💡

Provider boundary Use the AWS provider for cloud resources and the Kubernetes/Helm provider only when the client centralizes add-on installs in Terraform. Most teams install add-ons via Helm and keep Terraform for AWS-layer resources.

Scope & Ownership

Project Layout

Daily Workflow

Remote State Backend

Variables & Outputs

Environments

VPC & EKS Modules

IRSA (IAM Roles for Service Accounts)

Terraform in CI/CD

Import & Brownfield

Troubleshooting

Gotchas

Related Pages