Pod Lifecycle
A Pod moves from API creation to scheduling, image pull, container start, readiness, running, termination, and cleanup. Most production symptoms map to one stage: Pending, ContainerCreating, ImagePullBackOff, CrashLoopBackOff, failed probes, or stuck termination.
Mental Model
A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share network, storage volumes, and lifecycle. Controllers such as Deployments, StatefulSets, DaemonSets, and Jobs usually create Pods for you; you debug the Pod to understand what the controller is failing to run.
Pod lifecycle stages and the common failure states attached to each stage.
Phases And Conditions
Pod phase is a high-level summary. Conditions and container statuses give the useful details. Always inspect describe, events, and container statuses before changing anything.
| Signal | Meaning | Where To Look |
|---|---|---|
Pending |
Pod accepted by API but not fully running. Could be unscheduled or preparing. | Scheduler events, node selectors, taints, PVCs, image pulls, CNI. |
Running |
At least one container is running or starting. | Readiness, restarts, logs, conditions. |
Succeeded |
All containers exited successfully. Normal for Jobs. | Job completion, logs, exit code 0. |
Failed |
All containers stopped and at least one failed. | Exit codes, termination reason, previous logs. |
Unknown |
API server cannot determine Pod state, usually node communication issue. | Node readiness, kubelet, network partition. |
First-Look Pod Debugging
# Replace <namespace> with the application namespace.
kubectl get pods -n <namespace> -o wide
# Show the Pod's events, conditions, containers, volumes, node, and failure messages.
kubectl describe pod <pod-name> -n <namespace>
# Show container statuses including waiting/running/terminated reasons.
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{range .status.containerStatuses[*]}{.name} ready={.ready} restarts={.restartCount} state={.state}{"\n"}{end}'
# Get current container logs. Add -c <container-name> for multi-container Pods.
kubectl logs <pod-name> -n <namespace> --tail=100
# Get logs from the previous crashed container instance.
kubectl logs <pod-name> -n <namespace> --previous --tail=100
# Watch Pod state while testing a fix or rollout.
kubectl get pod <pod-name> -n <namespace> --watch
Pending Pods
A Pending Pod may not have a node yet, or it may have a node but kubelet is still preparing it. The first question is: did the scheduler bind it to a node?
# If NODE is empty, the scheduler has not placed the Pod.
kubectl get pod <pod-name> -n <namespace> -o wide
# Read scheduler messages such as insufficient CPU, taints, affinity, or PVC binding.
kubectl describe pod <pod-name> -n <namespace>
# Check node capacity and pressure.
kubectl get nodes -o wide
kubectl describe nodes | grep -E 'Name:|Taints:|Allocatable:|Allocated resources:' -A8
# Check whether an unbound PVC is blocking scheduling.
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
- Unschedulable: look for insufficient CPU/memory, taints without tolerations, nodeSelector mismatch, affinity rules, topology spread, or unbound PVCs.
- Node assigned but still Pending: look for image pull, CNI, volume mount, or runtime sandbox errors in Pod events.
ContainerCreating
ContainerCreating means kubelet is preparing the Pod sandbox or containers. Common causes include image pull delay, CNI failure, volume mount failure, secret/configmap missing, or runtime issues.
# Events usually name the failed step: FailedMount, FailedCreatePodSandBox, FailedPull, etc.
kubectl describe pod <pod-name> -n <namespace>
# Check referenced ConfigMaps and Secrets exist in the same namespace as the Pod.
kubectl get configmap,secret -n <namespace>
# Check CNI and kube-proxy style components in kube-system.
kubectl get pods -n kube-system -o wide | grep -E 'calico|cilium|flannel|aws-node|azure|kube-proxy'
# Check the node that owns this Pod.
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}{"\n"}'
kubectl describe node <node-name>
ImagePullBackOff
ImagePullBackOff and ErrImagePull mean kubelet could not pull the image. The backoff is Kubernetes waiting longer between retries.
| Event Message | Likely Cause | Fix Direction |
|---|---|---|
not found |
Wrong image name or tag. | Correct image reference in Deployment/Helm values. |
unauthorized |
Missing or wrong registry credentials. | Check imagePullSecrets and registry Secret. |
i/o timeout |
Node cannot reach registry. | Check egress, proxy, DNS, firewall, private endpoint. |
rate limit |
Registry throttling. | Use authenticated pulls or mirrored registry. |
apiVersion: v1
kind: Secret
metadata:
name: registry-creds # Referenced by imagePullSecrets in the Pod spec.
namespace: app # Must be in the same namespace as the Pod.
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: BASE64_ENCODED_DOCKER_CONFIG_JSON # Usually created by kubectl create secret docker-registry.
---
apiVersion: v1
kind: Pod
metadata:
name: private-image-demo
namespace: app
spec:
imagePullSecrets:
- name: registry-creds # Kubelet uses this Secret when pulling private images.
containers:
- name: app
image: private-registry.example.com/team/app:1.0.0 # Replace with approved registry/image/tag.
ports:
- containerPort: 8080
CrashLoopBackOff
CrashLoopBackOff means the container starts, exits, and Kubernetes keeps retrying with increasing delay. The root cause is inside the container or its dependencies most of the time.
# Current logs may be short if the process exits quickly.
kubectl logs <pod-name> -n <namespace> -c <container-name> --tail=100
# Previous logs are usually the key for CrashLoopBackOff.
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous --tail=200
# Show exit code and reason from the last terminated container.
kubectl get pod <pod-name> -n <namespace> \
-o jsonpath='{range .status.containerStatuses[*]}{.name} lastState={.lastState}{"\n"}{end}'
# Common signal: exitCode 137 often means the process was killed, frequently OOMKilled.
kubectl describe pod <pod-name> -n <namespace> | grep -A8 -E 'Last State|Reason|Exit Code|OOMKilled'
- Exit code 1: application error, bad config, missing env var, failed migration, dependency unavailable.
- Exit code 137: killed by the system, often memory limit exceeded. Check
OOMKilledand memory limits. - Exit code 143: SIGTERM. Often normal during rollout or drain if the app exits cleanly.
Startup, Readiness, And Liveness Probes
Probes tell kubelet how to manage traffic and restarts. A bad probe can create an outage even when the application is mostly healthy.
| Probe | Purpose | Failure Effect |
|---|---|---|
startupProbe |
Gives slow-starting apps time before liveness checks begin. | Container is killed if startup never succeeds. |
readinessProbe |
Controls whether the Pod receives Service traffic. | Pod stays running but is removed from endpoints. |
livenessProbe |
Detects a stuck process that should be restarted. | Container is restarted by kubelet. |
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
terminationGracePeriodSeconds: 30 # Time allowed after SIGTERM before SIGKILL.
containers:
- name: web
image: nginx:1.27
ports:
- containerPort: 80
startupProbe:
httpGet:
path: /healthz/startup # Use an endpoint that succeeds only after startup is complete.
port: 80
failureThreshold: 30 # 30 failures * 2s = up to 60s startup allowance.
periodSeconds: 2
readinessProbe:
httpGet:
path: /ready # Use dependency-aware readiness, not just process-alive.
port: 80
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz/live # Keep this lightweight; avoid deep dependency checks.
port: 80
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
Termination And Graceful Shutdown
When a Pod is deleted, Kubernetes removes it from Service endpoints, sends SIGTERM to containers, waits for terminationGracePeriodSeconds, then sends SIGKILL if the process is still running. Apps should stop accepting new work, finish in-flight requests, and exit cleanly.
apiVersion: v1
kind: Pod
metadata:
name: graceful-demo
namespace: app
spec:
terminationGracePeriodSeconds: 45 # Match this to app shutdown time and load balancer drain behavior.
containers:
- name: app
image: nginx:1.27
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 10 # Gives endpoint removal/load balancer drain a short buffer before process exit.
ports:
- containerPort: 80
# Check if a stuck terminating Pod has finalizers.
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.finalizers}{"\n"}'
# Check deletion timestamp and grace period.
kubectl get pod <pod-name> -n <namespace> -o jsonpath='deleting={.metadata.deletionTimestamp} grace={.metadata.deletionGracePeriodSeconds}{"\n"}'
# Do not force delete until you understand storage/network side effects.
# Force delete removes the API object; it does not guarantee the process stopped instantly on the node.
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force
Production Pod Spec Pattern
In real systems this usually lives inside a Deployment, StatefulSet, Job, or CronJob. The important lifecycle controls are the same.
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
namespace: app
labels:
app: lifecycle-demo # Service selectors and troubleshooting commands can use this label.
spec:
restartPolicy: Always # Always for long-running app Pods; Jobs usually use Never or OnFailure.
terminationGracePeriodSeconds: 30
containers:
- name: app
image: nginx:1.27
imagePullPolicy: IfNotPresent # Use Always for mutable tags; prefer immutable tags or digests in prod.
ports:
- name: http
containerPort: 80
env:
- name: ENVIRONMENT
value: production # Example inline env var; Secrets/ConfigMaps are better for shared config.
readinessProbe:
httpGet:
path: /
port: http
periodSeconds: 10
livenessProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 20
periodSeconds: 10
resources:
requests:
cpu: 100m # Scheduler reserves this amount.
memory: 128Mi
limits:
cpu: 500m # Container is throttled above this CPU limit.
memory: 256Mi # Container can be OOMKilled above this memory limit.
Quick Runbook By Symptom
- Pending: describe Pod, check scheduler events, resources, taints, affinity, PVCs, and node availability.
- ContainerCreating: inspect events for CNI, volume, Secret, ConfigMap, image, or runtime failures.
- ImagePullBackOff: verify image name/tag, registry credentials, imagePullSecrets, DNS, egress, and registry limits.
- CrashLoopBackOff: read previous logs, exit code, env/config, OOMKilled status, and dependency availability.
- Running but no traffic: check readiness, Service selector, Endpoints/EndpointSlices, NetworkPolicy, and Ingress/LB health checks.
- Stuck Terminating: check finalizers, volume detach, node health, grace period, and whether force delete is safe.