Kubernetes Internals — K8s SRE Reference

TL;DR

Kubernetes is a reconciliation system. Users submit desired state to the API server, admission controls validate or mutate it, etcd stores it, controllers watch for changes, and node agents make the real world match the desired state.

Mental Model

Think of Kubernetes as a database-backed API with many controllers attached. The API server is the front door, etcd is the durable state store, and every controller is a loop that watches objects and takes action when actual state differs from desired state.

Desired state

What you ask for in YAML, Helm, Terraform, ArgoCD, or kubectl.

Observed state

What the cluster reports back through status fields, events, metrics, and logs.

Reconciliation

The repeated loop that compares desired and observed state, then corrects drift.

API Request Path

Almost every Kubernetes operation follows the same control-plane path. If a create, update, or delete fails, walk this path in order.

API write path and reconciliation flow. The API server is the only component that should directly read/write etcd.

What Happens During `kubectl apply`

The client sends an HTTPS request to the API server.
The API server authenticates who you are.
RBAC authorizes whether you can perform that verb on that resource.
Admission controllers mutate or validate the object.
The object is persisted in etcd.
Controllers and kubelets observe changes through watches and reconcile.

bash trace-api-access.sh

# Show the active kubeconfig context so you do not inspect the wrong cluster.
kubectl config current-context

# Confirm the API server endpoint used by the current context.
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}'; echo

# Ask the API server what versions and resources it serves.
kubectl version --short
kubectl api-resources | head -n 30

# Check if your current identity can perform a specific action.
# Replace <namespace> with the target namespace before running this.
kubectl auth can-i create deployments -n <namespace>
kubectl auth can-i patch deployments -n <namespace>
kubectl auth can-i get secrets -n <namespace>

etcd

etcd is the strongly consistent key-value database that stores Kubernetes cluster state. If etcd is unhealthy, the cluster may still run existing workloads for a while, but new scheduling, updates, leader elections, and control-plane writes become unreliable or unavailable.

Production cautionDo not exec into etcd or run snapshot commands in a client cluster unless you are following that client's runbook. Managed clusters such as EKS, GKE, and AKS usually do not expose etcd directly.

Symptom	Likely Area	First Checks
Writes fail or hang	API server to etcd path	API server health, etcd leader, disk latency, quorum.
Objects disappear after restart	etcd persistence	etcd data dir, disk mount, backup/restore process.
High API latency	etcd or API overload	apiserver request metrics, etcd fsync latency, watch count.
Control plane loses quorum	etcd cluster membership	member list, leader status, network between control-plane nodes.

bash etcd-control-plane-checks.sh

# For kubeadm-style clusters where control-plane components run as static pods.
# Managed Kubernetes providers usually hide these pods from you.
kubectl get pods -n kube-system -l component=etcd -o wide
kubectl get pods -n kube-system -l component=kube-apiserver -o wide

# Look for restart loops, failed probes, certificate errors, or disk errors.
kubectl describe pod -n kube-system -l component=etcd
kubectl logs -n kube-system -l component=etcd --tail=100

# Read recent control-plane events. Events often reveal probe failures or node pressure.
kubectl get events -n kube-system --sort-by=.lastTimestamp | tail -n 40

etcd Snapshot Pattern

This is the shape of an etcd snapshot command for self-managed clusters. Treat certificate paths and endpoints as environment-specific.

bash snapshot-example.sh

# Run only on a self-managed control-plane node with etcdctl installed.
# ETCDCTL_API=3 selects the v3 API used by modern Kubernetes clusters.
export ETCDCTL_API=3

# Save a point-in-time backup of etcd state.
# Replace certificate paths if the cluster does not use kubeadm defaults.
etcdctl snapshot save /var/backups/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot file before trusting it.
etcdctl snapshot status /var/backups/etcd-snapshot.db --write-out=table

Admission Control

Admission happens after authentication and authorization but before persistence to etcd. Mutating admission can change the object. Validating admission can reject it. This is where policies, defaulting, image rules, sidecar injection, and pod security checks often happen.

yaml validating-admission-policy.yaml

# Example only: requires ValidatingAdmissionPolicy support in the cluster.
# Purpose: reject Pods that do not define CPU and memory limits.
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: require-container-resource-limits
spec:
  failurePolicy: Fail # Fail closed: reject if this policy cannot be evaluated.
  matchConstraints:
    resourceRules:
      - apiGroups: [""] # Empty string means the core API group.
        apiVersions: ["v1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["pods"]
  validations:
    - expression: "object.spec.containers.all(c, has(c.resources) && has(c.resources.limits) && has(c.resources.limits.cpu) && has(c.resources.limits.memory))"
      message: "Every container must set CPU and memory limits."

bash admission-debug.sh

# List admission webhooks. These are common sources of failed creates and updates.
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations

# Inspect a specific webhook for timeoutSeconds, failurePolicy, namespaceSelector, and service name.
kubectl describe validatingwebhookconfiguration <webhook-name>

# Server-side dry run asks the API server and admission chain to validate the file.
# It does not persist the object if validation succeeds.
kubectl apply --dry-run=server -f <file.yaml>

Watches And Controllers

Controllers do not constantly scan the whole cluster. They watch API objects, receive change notifications, enqueue work, and reconcile resources. A Deployment controller creates ReplicaSets; a ReplicaSet controller creates Pods; the scheduler binds Pods to Nodes; kubelet starts containers and reports status.

bash follow-reconciliation.sh

# Replace <namespace> and <deployment> with the app you are tracing.
kubectl get deployment <deployment> -n <namespace> -o wide

# Show the ReplicaSets created by the Deployment.
kubectl get rs -n <namespace> -l app=<app-label> -o wide

# Show Pods owned by those ReplicaSets.
kubectl get pods -n <namespace> -l app=<app-label> -o wide

# Describe reveals conditions, events, rollout strategy, and failure messages.
kubectl describe deployment <deployment> -n <namespace>

# Watch status changes as the controllers reconcile.
kubectl get deploy,rs,pods -n <namespace> -l app=<app-label> --watch

Resource Versions

Every Kubernetes object has metadata such as uid, generation, and resourceVersion. Controllers use these fields to detect changes, avoid stale updates, and know whether status has caught up to spec.

Field	Meaning	SRE Use
`metadata.uid`	Stable identity for this specific object instance.	Useful when names are reused after delete/recreate.
`metadata.resourceVersion`	Storage version used by watches and optimistic concurrency.	Useful for understanding watch behavior, not usually manually edited.
`metadata.generation`	Increments when desired spec changes.	Compare with observedGeneration to see if a controller caught up.
`status.observedGeneration`	The latest generation the controller has processed.	If lower than generation, the controller has not reconciled the latest spec.

bash metadata-check.sh

# Show key metadata and controller progress for a Deployment.
kubectl get deploy <deployment> -n <namespace> \
  -o jsonpath='uid={.metadata.uid} generation={.metadata.generation} observed={.status.observedGeneration} resourceVersion={.metadata.resourceVersion}{"\n"}'

# Print conditions in a readable table.
kubectl get deploy <deployment> -n <namespace> \
  -o custom-columns='TYPE:.status.conditions[*].type,STATUS:.status.conditions[*].status,REASON:.status.conditions[*].reason'

Debugging The Control Plane

Use these checks when the symptom looks bigger than one workload: failed API writes, many namespaces affected, stuck scheduling, admission timeouts, or cluster-wide delays.

bash control-plane-triage.sh

# 1. Basic reachability. If this fails, check kubeconfig, VPN, DNS, or API endpoint.
kubectl cluster-info
kubectl get --raw='/readyz?verbose'

# 2. Control-plane pods for kubeadm/self-managed clusters.
kubectl get pods -n kube-system -o wide | grep -E 'kube-apiserver|kube-controller-manager|kube-scheduler|etcd'

# 3. APIService health. Broken aggregated APIs can make kubectl slow or noisy.
kubectl get apiservice | grep -v True

# 4. Webhook health. Look for services that no longer exist or very short timeouts.
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations

# 5. Recent kube-system events.
kubectl get events -n kube-system --sort-by=.lastTimestamp | tail -n 50

Common Failure Patterns

!Admission webhook timeout: object create/update hangs or fails because the webhook service is down, DNS is broken, or the webhook has failurePolicy: Fail.
!RBAC denial: API request reaches the server but is rejected with Forbidden. Use kubectl auth can-i with the same user or service account.
!Controller lag: spec changed but status does not move. Check controller-manager health, events, and observedGeneration.
!etcd pressure: many unrelated writes slow down, leader changes increase, or apiserver request latency rises.

Manifest Anatomy

This small Deployment shows the fields Kubernetes internals care about most: identity, desired state, selectors, pod template, and status produced by controllers.

yaml deployment-internals-demo.yaml

apiVersion: apps/v1 # API group and version handled by the API server.
kind: Deployment # Resource type watched by the Deployment controller.
metadata:
  name: web-demo # Object name inside the namespace.
  namespace: demo # Namespace boundary for this object.
  labels:
    app: web-demo # Labels help humans, selectors, and tooling find this object.
spec:
  replicas: 3 # Desired number of Pods. Controller reconciles actual count to this.
  selector:
    matchLabels:
      app: web-demo # Must match template labels; changing this later is not allowed.
  strategy:
    type: RollingUpdate # Update Pods gradually instead of deleting all at once.
    rollingUpdate:
      maxUnavailable: 1 # At most one replica can be unavailable during rollout.
      maxSurge: 1 # At most one extra replica can exist during rollout.
  template:
    metadata:
      labels:
        app: web-demo # ReplicaSet uses this label to manage Pods.
    spec:
      containers:
        - name: web
          image: nginx:1.27 # Replace with your approved application image.
          ports:
            - containerPort: 80 # Informational for humans and some tooling.
          readinessProbe:
            httpGet:
              path: / # Endpoint that proves the Pod can receive traffic.
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            requests:
              cpu: 100m # Scheduler uses requests to place Pods on Nodes.
              memory: 128Mi
            limits:
              cpu: 500m # Kubelet/container runtime enforces limits.
              memory: 256Mi

Mental Model

Desired state

Observed state

Reconciliation

API Request Path

What Happens During kubectl apply

etcd

etcd Snapshot Pattern

Admission Control

Watches And Controllers

Resource Versions

Debugging The Control Plane

Common Failure Patterns

Manifest Anatomy

Related Pages

What Happens During `kubectl apply`