Terraform From Scratch
Terraform owns durable cloud infrastructure — VPC, EKS, IAM, DNS, load balancers. App Deployments belong in Helm or ArgoCD. Always plan in CI, apply through approved pipelines, and treat remote state like production admin access.
Scope & Ownership
On most K8s SRE contracts, Terraform provisions the platform layer. Workloads and day-to-day releases are managed elsewhere. Draw the boundary early to avoid two tools fighting over the same Kubernetes object.
Platform layers and tool ownership — never let two tools manage the same Kubernetes object.
| Layer | Tool | Examples |
|---|---|---|
| Cloud foundation | Terraform | VPC, subnets, EKS, IAM roles, Route53, ACM certs |
| Cluster add-ons | Terraform or Helm | AWS Load Balancer Controller, ExternalDNS, metrics-server |
| App workloads | Helm / ArgoCD | Deployments, Services, Ingress, HPA, ConfigMaps |
Project Layout
Most client repos split reusable modules from environment-specific roots. Each environment gets its own state file.
platform-infra/
modules/
vpc/ # Reusable VPC module.
eks/ # Reusable EKS module.
envs/
dev/
main.tf # Calls modules; wires outputs to inputs.
variables.tf
terraform.tfvars
backend.tf # Remote state config for dev.
prod/
main.tf
terraform.tfvars
backend.tf # Separate state file — never share with dev.
Daily Workflow
Run locally for exploration; run plan/apply in CI when the client requires audit trails and approvals.
terraform fmt -recursive # Format all .tf files before commit.
terraform init # Download providers; configure backend.
terraform validate # Static check — no cloud API calls.
terraform plan -out=tfplan # Save reviewable plan artifact.
terraform show tfplan # Read plan before apply.
terraform apply tfplan # Apply exactly what was reviewed.
# State inspection and drift.
terraform state list
terraform state show aws_eks_cluster.this
terraform plan -refresh-only # Detect drift without changing anything.
terraform output # Print module outputs (cluster name, endpoint, etc.).
# Recovery and targeted ops — use sparingly in prod.
terraform force-unlock LOCK_ID # Only after confirming no apply is running.
terraform apply -replace=aws_eks_node_group.general # Force recreate one resource.
terraform apply -target=module.eks # Emergency scope — document why.
Remote State Backend
Remote state is the source of truth for what Terraform manages. Lock it, encrypt it, and restrict access to platform engineers only.
terraform {
backend "s3" {
bucket = "client-prod-terraform-state"
key = "eks/prod/terraform.tfstate" # One key per env/module root.
region = "us-east-1"
dynamodb_table = "terraform-locks" # Prevents concurrent applies.
encrypt = true
}
}
Variables & Outputs
Pass values through terraform.tfvars per environment. Wire module outputs into downstream modules — VPC IDs into EKS, cluster OIDC issuer into IRSA roles.
variable "environment" {
type = string
}
variable "cluster_version" {
type = string
default = "1.30"
}
module "vpc" {
source = "../../modules/vpc"
environment = var.environment
cidr = "10.0.0.0/16"
}
module "eks" {
source = "../../modules/eks"
cluster_name = "${var.environment}-platform"
cluster_version = var.cluster_version
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
}
output "cluster_name" {
value = module.eks.cluster_name
}
output "oidc_provider_arn" {
value = module.eks.oidc_provider_arn
description = "Pass to IRSA role modules."
}
Environments
| Pattern | When to use | State |
|---|---|---|
Directory per env (envs/dev, envs/prod) | Most common in enterprise repos; clearest blast radius. | Separate backend key per directory. |
Workspaces (terraform workspace select prod) | Small teams, identical config with different tfvars. | One backend, workspace-prefixed state. |
| Terragrunt | Many accounts/regions with DRY backend and provider config. | Generated backend blocks per leaf module. |
VPC & EKS Modules
EKS needs private subnets for worker nodes and public subnets (or private + NAT) for load balancers. Tag subnets so the AWS Load Balancer Controller can discover them.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "prod-platform"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # One NAT per AZ for HA.
public_subnet_tags = {
"kubernetes.io/role/elb" = "1"
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = "1"
}
}
This EKS module consumes the VPC outputs and creates the managed control plane plus node group. Use it after the VPC module so worker nodes land in the intended private subnets.
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "prod-platform"
cluster_version = "1.30"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
enable_irsa = true # Required for pod-level AWS permissions.
eks_managed_node_groups = {
general = {
instance_types = ["m6i.large"]
min_size = 3
desired_size = 4
max_size = 8
labels = { workload = "general" }
}
}
}
IRSA (IAM Roles for Service Accounts)
IRSA lets a Kubernetes ServiceAccount assume an AWS IAM role via OIDC. This is the standard pattern for ExternalDNS, AWS LB Controller, and any pod that calls AWS APIs.
module "external_dns_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.0"
role_name = "prod-external-dns"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["kube-system:external-dns"]
}
}
attach_external_dns_policy = true
}
# Helm chart values reference the role ARN:
# serviceAccount.annotations.eks.amazonaws.com/role-arn: <role_arn>
Terraform in CI/CD
Plan on every PR; apply only from protected branches. Use OIDC to assume an IAM role — no long-lived AWS keys in CI secrets.
name: terraform
on:
pull_request:
paths: ["envs/**"]
push:
branches: [main]
paths: ["envs/**"]
permissions:
id-token: write # Required for OIDC.
contents: read
jobs:
plan:
runs-on: ubuntu-latest
defaults:
run:
working-directory: envs/prod
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-terraform
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
- run: terraform init -input=false
- run: terraform validate
- run: terraform plan -input=false -out=tfplan
- uses: actions/upload-artifact@v4
if: github.event_name == 'pull_request'
with:
name: tfplan
path: envs/prod/tfplan
apply:
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
needs: plan
runs-on: ubuntu-latest
environment: production # Requires manual approval in GitHub.
defaults:
run:
working-directory: envs/prod
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-terraform
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
- run: terraform init -input=false
- run: terraform apply -input=false -auto-approve
Import & Brownfield
When adopting Terraform on existing infrastructure, write the resource block first, then import. Use moved blocks (TF 1.1+) to refactor without destroy/recreate.
# Classic import — write matching resource block first.
terraform import aws_iam_role.external_dns external-dns-prod
terraform plan # Expect diffs; tune attributes until plan is clean.
# TF 1.5+ import block (declarative).
# import { to = aws_iam_role.external_dns id = "external-dns-prod" }
# Refactor without changing real infrastructure.
# moved { from = aws_iam_role.old to = module.dns.aws_iam_role.external_dns }
terraform state mv aws_iam_role.old module.dns.aws_iam_role.external_dns
terraform state rm aws_instance.decommissioned # Removes from state only — does NOT destroy.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
Error acquiring the state lock | Stale lock or crashed apply | Confirm no running apply, then terraform force-unlock ID |
| Plan shows unexpected destroys | Renamed resource without moved block | Add moved block or state mv |
| Provider auth failure in CI | Wrong role, expired OIDC trust | Check IAM trust policy matches repo/branch |
| Drift on every plan | Manual console changes or autoscaler edits | plan -refresh-only; add lifecycle { ignore_changes } |
| Helm and TF conflict on same K8s object | Split ownership unclear | Pick one owner; remove from the other tool |
Gotchas
- Never commit
.tfstateor.terraform/to Git — remote backend only. - State files contain secrets. Restrict S3/DynamoDB access like production admin.
- Pin provider versions in
required_providers— unpinned providers break plans silently. - Do not run
applyfrom your laptop if the client requires CI-based approvals. - When Terraform and Helm/ArgoCD both manage the same Kubernetes object, ownership conflicts are likely.
- Use
-targetonly for emergencies — it skips dependency checks and causes drift. terraform state rmremoves tracking only; the real resource stays running until you destroy it manually.