Kubernetes Architecture — K8s SRE Reference

TL;DR

Kubernetes is a distributed system split into a Control Plane (the brain — manages state, scheduling, and the API) and Worker Nodes (the muscle — run your actual workloads via kubelet + container runtime). Everything communicates through the API server; all desired state lives in etcd.

High-Level Architecture

The diagram below shows a production-grade multi-node cluster. The control plane components are typically co-located (or run as pods in managed offerings like EKS), while worker nodes can number in the thousands.

Figure 1 — Kubernetes cluster: Control Plane components and two Worker Nodes. Dashed lines = internal API server fan-out. Solid blue = kubelet communication.

Control Plane Components

The control plane is the set of processes that make global decisions about the cluster — scheduling, detecting and responding to cluster events. In production, control plane components are replicated across multiple nodes (odd number: 3 or 5) for HA.

kube-apiserver

The front door to the cluster. All communication (internal and external) goes through it. Validates and processes REST requests, then writes to etcd. The only component that talks directly to etcd.

etcd

Distributed, consistent key-value store. Everything in Kubernetes — pods, services, configmaps, secrets, RBAC rules — is persisted here. Backing this up is critical.

kube-scheduler

Watches for newly created pods with no assigned node. Selects a node based on resource requirements, affinity rules, taints/tolerations, and available capacity.

controller-manager

Runs all built-in controllers as goroutines in a single binary — Node, Deployment, ReplicaSet, Endpoints, ServiceAccount, and more. Each controller reconciles actual vs desired state.

cloud-controller-manager

Separates cloud-specific logic from core Kubernetes. Manages Node lifecycle, Route configuration, and LoadBalancer provisioning via the cloud provider's API (AWS, GCP, Azure).

Worker Node Components

Worker nodes do the actual work — they run your containers. Each node runs three critical processes:

Component	Role	Port	Notes
`kubelet`	Node agent. Receives PodSpecs from apiserver, ensures containers are running and healthy. Reports node status back.	10250	Also exposes `/metrics` for Prometheus
`kube-proxy`	Network proxy. Maintains iptables / ipvs rules so Service ClusterIPs route to correct pod endpoints.	10256	Can be replaced by Cilium in eBPF mode
Container Runtime	Actually pulls images and runs containers (OCI spec). Talks to kubelet via CRI (Container Runtime Interface).	—	containerd (default), CRI-O

What Happens When You Run `kubectl apply`

Understanding this flow is essential for debugging and for interviews:

kubectl serializes your manifest to JSON and sends a POST /apis/apps/v1/namespaces/default/deployments request to the API server over TLS.
kube-apiserver authenticates (cert / token / OIDC), authorizes (RBAC), and runs admission controllers (e.g. ValidatingWebhookConfiguration).
The validated object is persisted to etcd. At this point the API server returns 201 Created to kubectl.
Deployment controller (inside controller-manager) watches etcd via informer and detects the new Deployment. It creates a ReplicaSet.
ReplicaSet controller creates Pod objects (still unscheduled — nodeName: "").
kube-scheduler watches for pending pods, scores nodes, and writes the chosen nodeName back to the pod via the API server.
kubelet on the target node watches its assigned pods, calls containerd via CRI to pull the image and start the container.
kubelet reports container status back to the API server; kubectl get pods now shows Running.

💡

Interview tip This end-to-end flow is one of the most common SRE/platform interview questions. Be able to walk through it from kubectl to container running — mentioning informers, the watch mechanism, and admission controllers sets you apart.

Key Commands

Inspect Control Plane Health

bash

# Check control plane component status
kubectl get componentstatuses        # deprecated but still works on older clusters
kubectl get pods -n kube-system      # all control plane pods

# API server health
kubectl get --raw /healthz
kubectl get --raw /readyz
kubectl get --raw /livez

# etcd health (from inside the etcd pod)
etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Check node conditions
kubectl describe node <node-name> | grep -A5 Conditions

Inspect the Cluster

bash

# Get cluster info
kubectl cluster-info
kubectl cluster-info dump > cluster-dump.txt   # full diagnostic dump

# List nodes with resource info
kubectl get nodes -o wide
kubectl top nodes                               # requires metrics-server

# View all resources in a namespace
kubectl get all -n <namespace>

# Get events (sorted by time — great for debugging)
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>

# Force delete stuck namespace (use with caution)
kubectl get namespace <ns> -o json \
  | tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" \
  | kubectl replace --raw /api/v1/namespaces/<ns>/finalize -f -

etcd — Backup & Restore

etcd holds all cluster state. Losing it without a backup means losing your entire cluster configuration. In production, automate this daily.

⚠️

Critical etcd backup is your disaster recovery plan. On-prem clusters especially — managed offerings like EKS handle this for you, but you should verify the retention policy.

bash etcd-backup.sh

#!/bin/bash
# etcd snapshot backup
ETCD_CERTS="/etc/kubernetes/pki/etcd"
BACKUP_PATH="/backup/etcd-$(date +%Y%m%d-%H%M%S).db"

etcdctl snapshot save "$BACKUP_PATH" \
  --endpoints=https://127.0.0.1:2379 \
  --cacert="${ETCD_CERTS}/ca.crt" \
  --cert="${ETCD_CERTS}/server.crt" \
  --key="${ETCD_CERTS}/server.key"

etcdctl snapshot status "$BACKUP_PATH" --write-out=table

# Restore (run on all control plane nodes, then restart etcd)
etcdctl snapshot restore "$BACKUP_PATH" \
  --data-dir=/var/lib/etcd-restore \
  --name=master \
  --initial-cluster=master=https://<node-ip>:2380 \
  --initial-advertise-peer-urls=https://<node-ip>:2380

High-Availability Control Plane

Production clusters run an odd number of control plane nodes (3 or 5) behind a load balancer. etcd uses the Raft consensus algorithm — you need a quorum of (n/2)+1 nodes to elect a leader.

Figure 2 — HA Control Plane with 3 nodes. etcd Raft leader (blue) replicates to followers. All apiservers are active — the LB distributes kubectl traffic.

Common Gotchas

🔴
etcd disk I/O is your bottleneck. etcd is extremely sensitive to slow disk. Use SSDs, monitor etcd_disk_wal_fsync_duration_seconds in Prometheus. High latency here causes API server timeouts and cascading failures.
🔴
Clock skew kills etcd clusters. etcd uses Raft which requires synchronized clocks. Run NTP (chrony) on all control plane nodes. Skew > 500ms can cause leader election storms.
🟡
kube-scheduler is not a daemon. Only one scheduler instance is active at a time (leader election). If your leader crashes, there's a brief period where pods won't be scheduled — not a silent failure, but expect delays.
🟡
Admission webhook timeouts block deploys. If a ValidatingWebhookConfiguration or MutatingWebhookConfiguration endpoint is down and failurePolicy: Fail, all requests to that resource type will fail cluster-wide.
🟢
The API server is stateless. It reads/writes etcd and caches via informers. You can restart it freely — it recovers quickly. etcd is what you can't afford to lose.

High-Level Architecture

Control Plane Components

kube-apiserver

etcd

kube-scheduler

controller-manager

cloud-controller-manager

Worker Node Components

What Happens When You Run kubectl apply

Key Commands

Inspect Control Plane Health

Inspect the Cluster

etcd — Backup & Restore

High-Availability Control Plane

Common Gotchas

Related Pages

What Happens When You Run `kubectl apply`