Terraform Troubleshooting
Before fixing Terraform, capture evidence: exact command, workspace/root path, backend key, plan output, state address, cloud resource ID, and who last applied. Never force-unlock, state rm, import, replace, or destroy until you know whether another apply is running and what owns the resource.
First Five Minutes
Run these commands first when a Terraform issue is reported. They confirm the working directory, Terraform version, workspace, providers, validation status, and visible state before you touch locks or apply changes.
pwd
terraform version
terraform workspace show
terraform providers
terraform init -reconfigure
terraform validate
terraform plan -no-color
terraform state list | head -50Symptom To Cause
| Symptom | Likely cause | First check |
|---|---|---|
| State lock error | Another apply or crashed job | CI run history, lock table/blob metadata |
| Plan wants to destroy many resources | Wrong workspace/backend/vars/provider account | Current account, backend key, tfvars |
| Provider auth failure | Expired token, wrong OIDC role, missing env vars | Cloud identity command and CI role |
| Resource already exists | Existing client resource not imported | Write block then terraform import |
| Perpetual diff | Cloud default, provider bug, ignored field needed | Provider docs and live resource values |
| Dependency cycle | Over-coupled module references | Graph dependencies and outputs |
| Plan recreates EKS/AKS/GKE | Immutable field changed | Check exact attribute forcing replacement |
State Lock
Use this when Terraform says it cannot acquire the state lock. Confirm no CI job or engineer is actively applying before using force-unlock, because unlocking a real apply can corrupt state.
# Do not force-unlock first. Confirm no pipeline or human apply is active.
terraform plan
# If the lock is stale and approved:
terraform force-unlock LOCK_IDWrong Account Or Backend
Use these checks when a plan looks wildly wrong, especially if it wants to destroy many resources. Most scary Terraform plans come from the wrong cloud account, project, subscription, workspace, backend key, or tfvars file.
# AWS
aws sts get-caller-identity
# Azure
az account show
# GCP
gcloud auth list
gcloud config get-value project
# Terraform context
terraform workspace show
terraform state pull | headDangerous Commands
| Command | Risk | Safer first step |
|---|---|---|
terraform destroy | Deletes managed infrastructure | Use plan review and approval |
terraform state rm | Orphans resource from Terraform | Understand why state/code disagree |
terraform import | Can attach wrong real resource to code | Verify resource ID and address |
terraform force-unlock | Can corrupt concurrent apply | Check active CI/human runs |
terraform apply -replace | Recreates resource, may cause outage | Confirm replacement blast radius |
terraform apply -target | Can skip dependencies | Use only for documented emergency scope |
Drift Investigation
Use refresh-only planning to see how live cloud resources differ from state without applying a functional change. After that, decide whether to update Terraform code, revert the manual change, or formally accept the drift.
terraform plan -refresh-only -out=refresh.tfplan
terraform show refresh.tfplan
# If drift is expected, update Terraform code.
# If drift is emergency manual work, decide whether to keep it or revert it.Import Recovery
Use import recovery when a real resource exists but Terraform does not track it, or when state was lost for a known object. Verify the resource ID and Terraform address carefully before importing.
terraform state list | rg app_lb
terraform import 'module.network.aws_security_group.app_lb' sg-0123456789abcdef0
terraform plan
# Tune code until plan does not try to replace the imported object.Provider Errors
| Error | Check |
|---|---|
AWS InvalidClientTokenId | Expired/wrong AWS credentials, STS denied, wrong profile |
| Azure authorization failed | Missing role assignment, wrong subscription, stale login |
| GCP 403 | API disabled, service account missing IAM role, wrong project |
| Kubernetes provider connection refused | Cluster not created yet, kubeconfig missing, private endpoint unreachable |
| Helm provider timeout | Chart resources unhealthy; inspect Kubernetes events and Helm release |
Before Applying In Production
- Confirm the backend key, workspace, cloud account, region, and tfvars are correct.
- Read the plan for destroy/replace lines, not just the summary count.
- Check whether the resource is owned by Terraform, Helm, ArgoCD, a cloud console process, or another repo.
- Take extra care with VPCs, subnets, IAM roles, DNS zones, state buckets, cluster resources, and databases.