Terraform Troubleshooting — K8s SRE Reference

TL;DR

Before fixing Terraform, capture evidence: exact command, workspace/root path, backend key, plan output, state address, cloud resource ID, and who last applied. Never force-unlock, state rm, import, replace, or destroy until you know whether another apply is running and what owns the resource.

First Five Minutes

Run these commands first when a Terraform issue is reported. They confirm the working directory, Terraform version, workspace, providers, validation status, and visible state before you touch locks or apply changes.

bashterraform-first-look.sh

pwd
terraform version
terraform workspace show
terraform providers
terraform init -reconfigure
terraform validate
terraform plan -no-color
terraform state list | head -50

Symptom To Cause

Symptom	Likely cause	First check
State lock error	Another apply or crashed job	CI run history, lock table/blob metadata
Plan wants to destroy many resources	Wrong workspace/backend/vars/provider account	Current account, backend key, tfvars
Provider auth failure	Expired token, wrong OIDC role, missing env vars	Cloud identity command and CI role
Resource already exists	Existing client resource not imported	Write block then terraform import
Perpetual diff	Cloud default, provider bug, ignored field needed	Provider docs and live resource values
Dependency cycle	Over-coupled module references	Graph dependencies and outputs
Plan recreates EKS/AKS/GKE	Immutable field changed	Check exact attribute forcing replacement

State Lock

Use this when Terraform says it cannot acquire the state lock. Confirm no CI job or engineer is actively applying before using force-unlock, because unlocking a real apply can corrupt state.

bashlock-check.sh

# Do not force-unlock first. Confirm no pipeline or human apply is active.
terraform plan

# If the lock is stale and approved:
terraform force-unlock LOCK_ID

Wrong Account Or Backend

Use these checks when a plan looks wildly wrong, especially if it wants to destroy many resources. Most scary Terraform plans come from the wrong cloud account, project, subscription, workspace, backend key, or tfvars file.

bashidentity-check.sh

# AWS
aws sts get-caller-identity

# Azure
az account show

# GCP
gcloud auth list
gcloud config get-value project

# Terraform context
terraform workspace show
terraform state pull | head

Dangerous Commands

Command	Risk	Safer first step
`terraform destroy`	Deletes managed infrastructure	Use plan review and approval
`terraform state rm`	Orphans resource from Terraform	Understand why state/code disagree
`terraform import`	Can attach wrong real resource to code	Verify resource ID and address
`terraform force-unlock`	Can corrupt concurrent apply	Check active CI/human runs
`terraform apply -replace`	Recreates resource, may cause outage	Confirm replacement blast radius
`terraform apply -target`	Can skip dependencies	Use only for documented emergency scope

Drift Investigation

Use refresh-only planning to see how live cloud resources differ from state without applying a functional change. After that, decide whether to update Terraform code, revert the manual change, or formally accept the drift.

bashdrift.sh

terraform plan -refresh-only -out=refresh.tfplan
terraform show refresh.tfplan

# If drift is expected, update Terraform code.
# If drift is emergency manual work, decide whether to keep it or revert it.

Import Recovery

Use import recovery when a real resource exists but Terraform does not track it, or when state was lost for a known object. Verify the resource ID and Terraform address carefully before importing.

bashimport-recovery.sh

terraform state list | rg app_lb
terraform import 'module.network.aws_security_group.app_lb' sg-0123456789abcdef0
terraform plan

# Tune code until plan does not try to replace the imported object.

Provider Errors

Error	Check
AWS `InvalidClientTokenId`	Expired/wrong AWS credentials, STS denied, wrong profile
Azure authorization failed	Missing role assignment, wrong subscription, stale login
GCP 403	API disabled, service account missing IAM role, wrong project
Kubernetes provider connection refused	Cluster not created yet, kubeconfig missing, private endpoint unreachable
Helm provider timeout	Chart resources unhealthy; inspect Kubernetes events and Helm release

Before Applying In Production

1Confirm the backend key, workspace, cloud account, region, and tfvars are correct.
2Read the plan for destroy/replace lines, not just the summary count.
3Check whether the resource is owned by Terraform, Helm, ArgoCD, a cloud console process, or another repo.
4Take extra care with VPCs, subnets, IAM roles, DNS zones, state buckets, cluster resources, and databases.