Kubernetes Internals
Kubernetes is a reconciliation system. Users submit desired state to the API server, admission controls validate or mutate it, etcd stores it, controllers watch for changes, and node agents make the real world match the desired state.
Mental Model
Think of Kubernetes as a database-backed API with many controllers attached. The API server is the front door, etcd is the durable state store, and every controller is a loop that watches objects and takes action when actual state differs from desired state.
Desired state
What you ask for in YAML, Helm, Terraform, ArgoCD, or kubectl.
Observed state
What the cluster reports back through status fields, events, metrics, and logs.
Reconciliation
The repeated loop that compares desired and observed state, then corrects drift.
API Request Path
Almost every Kubernetes operation follows the same control-plane path. If a create, update, or delete fails, walk this path in order.
API write path and reconciliation flow. The API server is the only component that should directly read/write etcd.
What Happens During kubectl apply
- The client sends an HTTPS request to the API server.
- The API server authenticates who you are.
- RBAC authorizes whether you can perform that verb on that resource.
- Admission controllers mutate or validate the object.
- The object is persisted in etcd.
- Controllers and kubelets observe changes through watches and reconcile.
# Show the active kubeconfig context so you do not inspect the wrong cluster.
kubectl config current-context
# Confirm the API server endpoint used by the current context.
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}'; echo
# Ask the API server what versions and resources it serves.
kubectl version --short
kubectl api-resources | head -n 30
# Check if your current identity can perform a specific action.
# Replace <namespace> with the target namespace before running this.
kubectl auth can-i create deployments -n <namespace>
kubectl auth can-i patch deployments -n <namespace>
kubectl auth can-i get secrets -n <namespace>
etcd
etcd is the strongly consistent key-value database that stores Kubernetes cluster state. If etcd is unhealthy, the cluster may still run existing workloads for a while, but new scheduling, updates, leader elections, and control-plane writes become unreliable or unavailable.
| Symptom | Likely Area | First Checks |
|---|---|---|
| Writes fail or hang | API server to etcd path | API server health, etcd leader, disk latency, quorum. |
| Objects disappear after restart | etcd persistence | etcd data dir, disk mount, backup/restore process. |
| High API latency | etcd or API overload | apiserver request metrics, etcd fsync latency, watch count. |
| Control plane loses quorum | etcd cluster membership | member list, leader status, network between control-plane nodes. |
# For kubeadm-style clusters where control-plane components run as static pods.
# Managed Kubernetes providers usually hide these pods from you.
kubectl get pods -n kube-system -l component=etcd -o wide
kubectl get pods -n kube-system -l component=kube-apiserver -o wide
# Look for restart loops, failed probes, certificate errors, or disk errors.
kubectl describe pod -n kube-system -l component=etcd
kubectl logs -n kube-system -l component=etcd --tail=100
# Read recent control-plane events. Events often reveal probe failures or node pressure.
kubectl get events -n kube-system --sort-by=.lastTimestamp | tail -n 40
etcd Snapshot Pattern
This is the shape of an etcd snapshot command for self-managed clusters. Treat certificate paths and endpoints as environment-specific.
# Run only on a self-managed control-plane node with etcdctl installed.
# ETCDCTL_API=3 selects the v3 API used by modern Kubernetes clusters.
export ETCDCTL_API=3
# Save a point-in-time backup of etcd state.
# Replace certificate paths if the cluster does not use kubeadm defaults.
etcdctl snapshot save /var/backups/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot file before trusting it.
etcdctl snapshot status /var/backups/etcd-snapshot.db --write-out=table
Admission Control
Admission happens after authentication and authorization but before persistence to etcd. Mutating admission can change the object. Validating admission can reject it. This is where policies, defaulting, image rules, sidecar injection, and pod security checks often happen.
# Example only: requires ValidatingAdmissionPolicy support in the cluster.
# Purpose: reject Pods that do not define CPU and memory limits.
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: require-container-resource-limits
spec:
failurePolicy: Fail # Fail closed: reject if this policy cannot be evaluated.
matchConstraints:
resourceRules:
- apiGroups: [""] # Empty string means the core API group.
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["pods"]
validations:
- expression: "object.spec.containers.all(c, has(c.resources) && has(c.resources.limits) && has(c.resources.limits.cpu) && has(c.resources.limits.memory))"
message: "Every container must set CPU and memory limits."
# List admission webhooks. These are common sources of failed creates and updates.
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations
# Inspect a specific webhook for timeoutSeconds, failurePolicy, namespaceSelector, and service name.
kubectl describe validatingwebhookconfiguration <webhook-name>
# Server-side dry run asks the API server and admission chain to validate the file.
# It does not persist the object if validation succeeds.
kubectl apply --dry-run=server -f <file.yaml>
Watches And Controllers
Controllers do not constantly scan the whole cluster. They watch API objects, receive change notifications, enqueue work, and reconcile resources. A Deployment controller creates ReplicaSets; a ReplicaSet controller creates Pods; the scheduler binds Pods to Nodes; kubelet starts containers and reports status.
# Replace <namespace> and <deployment> with the app you are tracing.
kubectl get deployment <deployment> -n <namespace> -o wide
# Show the ReplicaSets created by the Deployment.
kubectl get rs -n <namespace> -l app=<app-label> -o wide
# Show Pods owned by those ReplicaSets.
kubectl get pods -n <namespace> -l app=<app-label> -o wide
# Describe reveals conditions, events, rollout strategy, and failure messages.
kubectl describe deployment <deployment> -n <namespace>
# Watch status changes as the controllers reconcile.
kubectl get deploy,rs,pods -n <namespace> -l app=<app-label> --watch
Resource Versions
Every Kubernetes object has metadata such as uid, generation, and resourceVersion. Controllers use these fields to detect changes, avoid stale updates, and know whether status has caught up to spec.
| Field | Meaning | SRE Use |
|---|---|---|
metadata.uid |
Stable identity for this specific object instance. | Useful when names are reused after delete/recreate. |
metadata.resourceVersion |
Storage version used by watches and optimistic concurrency. | Useful for understanding watch behavior, not usually manually edited. |
metadata.generation |
Increments when desired spec changes. | Compare with observedGeneration to see if a controller caught up. |
status.observedGeneration |
The latest generation the controller has processed. | If lower than generation, the controller has not reconciled the latest spec. |
# Show key metadata and controller progress for a Deployment.
kubectl get deploy <deployment> -n <namespace> \
-o jsonpath='uid={.metadata.uid} generation={.metadata.generation} observed={.status.observedGeneration} resourceVersion={.metadata.resourceVersion}{"\n"}'
# Print conditions in a readable table.
kubectl get deploy <deployment> -n <namespace> \
-o custom-columns='TYPE:.status.conditions[*].type,STATUS:.status.conditions[*].status,REASON:.status.conditions[*].reason'
Debugging The Control Plane
Use these checks when the symptom looks bigger than one workload: failed API writes, many namespaces affected, stuck scheduling, admission timeouts, or cluster-wide delays.
# 1. Basic reachability. If this fails, check kubeconfig, VPN, DNS, or API endpoint.
kubectl cluster-info
kubectl get --raw='/readyz?verbose'
# 2. Control-plane pods for kubeadm/self-managed clusters.
kubectl get pods -n kube-system -o wide | grep -E 'kube-apiserver|kube-controller-manager|kube-scheduler|etcd'
# 3. APIService health. Broken aggregated APIs can make kubectl slow or noisy.
kubectl get apiservice | grep -v True
# 4. Webhook health. Look for services that no longer exist or very short timeouts.
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations
# 5. Recent kube-system events.
kubectl get events -n kube-system --sort-by=.lastTimestamp | tail -n 50
Common Failure Patterns
- Admission webhook timeout: object create/update hangs or fails because the webhook service is down, DNS is broken, or the webhook has
failurePolicy: Fail. - RBAC denial: API request reaches the server but is rejected with
Forbidden. Usekubectl auth can-iwith the same user or service account. - Controller lag: spec changed but status does not move. Check controller-manager health, events, and
observedGeneration. - etcd pressure: many unrelated writes slow down, leader changes increase, or apiserver request latency rises.
Manifest Anatomy
This small Deployment shows the fields Kubernetes internals care about most: identity, desired state, selectors, pod template, and status produced by controllers.
apiVersion: apps/v1 # API group and version handled by the API server.
kind: Deployment # Resource type watched by the Deployment controller.
metadata:
name: web-demo # Object name inside the namespace.
namespace: demo # Namespace boundary for this object.
labels:
app: web-demo # Labels help humans, selectors, and tooling find this object.
spec:
replicas: 3 # Desired number of Pods. Controller reconciles actual count to this.
selector:
matchLabels:
app: web-demo # Must match template labels; changing this later is not allowed.
strategy:
type: RollingUpdate # Update Pods gradually instead of deleting all at once.
rollingUpdate:
maxUnavailable: 1 # At most one replica can be unavailable during rollout.
maxSurge: 1 # At most one extra replica can exist during rollout.
template:
metadata:
labels:
app: web-demo # ReplicaSet uses this label to manage Pods.
spec:
containers:
- name: web
image: nginx:1.27 # Replace with your approved application image.
ports:
- containerPort: 80 # Informational for humans and some tooling.
readinessProbe:
httpGet:
path: / # Endpoint that proves the Pod can receive traffic.
port: 80
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: 100m # Scheduler uses requests to place Pods on Nodes.
memory: 128Mi
limits:
cpu: 500m # Kubelet/container runtime enforces limits.
memory: 256Mi