Kubernetes Architecture & Control Plane
Kubernetes is a distributed system split into a Control Plane (the brain — manages state, scheduling, and the API) and Worker Nodes (the muscle — run your actual workloads via kubelet + container runtime). Everything communicates through the API server; all desired state lives in etcd.
High-Level Architecture
The diagram below shows a production-grade multi-node cluster. The control plane components are typically co-located (or run as pods in managed offerings like EKS), while worker nodes can number in the thousands.
Figure 1 — Kubernetes cluster: Control Plane components and two Worker Nodes. Dashed lines = internal API server fan-out. Solid blue = kubelet communication.
Control Plane Components
The control plane is the set of processes that make global decisions about the cluster — scheduling, detecting and responding to cluster events. In production, control plane components are replicated across multiple nodes (odd number: 3 or 5) for HA.
kube-apiserver
The front door to the cluster. All communication (internal and external) goes through it. Validates and processes REST requests, then writes to etcd. The only component that talks directly to etcd.
etcd
Distributed, consistent key-value store. Everything in Kubernetes — pods, services, configmaps, secrets, RBAC rules — is persisted here. Backing this up is critical.
kube-scheduler
Watches for newly created pods with no assigned node. Selects a node based on resource requirements, affinity rules, taints/tolerations, and available capacity.
controller-manager
Runs all built-in controllers as goroutines in a single binary — Node, Deployment, ReplicaSet, Endpoints, ServiceAccount, and more. Each controller reconciles actual vs desired state.
cloud-controller-manager
Separates cloud-specific logic from core Kubernetes. Manages Node lifecycle, Route configuration, and LoadBalancer provisioning via the cloud provider's API (AWS, GCP, Azure).
Worker Node Components
Worker nodes do the actual work — they run your containers. Each node runs three critical processes:
| Component | Role | Port | Notes |
|---|---|---|---|
kubelet |
Node agent. Receives PodSpecs from apiserver, ensures containers are running and healthy. Reports node status back. | 10250 | Also exposes /metrics for Prometheus |
kube-proxy |
Network proxy. Maintains iptables / ipvs rules so Service ClusterIPs route to correct pod endpoints. | 10256 | Can be replaced by Cilium in eBPF mode |
| Container Runtime | Actually pulls images and runs containers (OCI spec). Talks to kubelet via CRI (Container Runtime Interface). | — | containerd (default), CRI-O |
What Happens When You Run kubectl apply
Understanding this flow is essential for debugging and for interviews:
- kubectl serializes your manifest to JSON and sends a
POST /apis/apps/v1/namespaces/default/deploymentsrequest to the API server over TLS. - kube-apiserver authenticates (cert / token / OIDC), authorizes (RBAC), and runs admission controllers (e.g. ValidatingWebhookConfiguration).
- The validated object is persisted to etcd. At this point the API server returns
201 Createdto kubectl. - Deployment controller (inside controller-manager) watches etcd via informer and detects the new Deployment. It creates a ReplicaSet.
- ReplicaSet controller creates Pod objects (still unscheduled —
nodeName: ""). - kube-scheduler watches for pending pods, scores nodes, and writes the chosen
nodeNameback to the pod via the API server. - kubelet on the target node watches its assigned pods, calls containerd via CRI to pull the image and start the container.
- kubelet reports container status back to the API server;
kubectl get podsnow showsRunning.
Key Commands
Inspect Control Plane Health
# Check control plane component status
kubectl get componentstatuses # deprecated but still works on older clusters
kubectl get pods -n kube-system # all control plane pods
# API server health
kubectl get --raw /healthz
kubectl get --raw /readyz
kubectl get --raw /livez
# etcd health (from inside the etcd pod)
etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check node conditions
kubectl describe node <node-name> | grep -A5 Conditions
Inspect the Cluster
# Get cluster info
kubectl cluster-info
kubectl cluster-info dump > cluster-dump.txt # full diagnostic dump
# List nodes with resource info
kubectl get nodes -o wide
kubectl top nodes # requires metrics-server
# View all resources in a namespace
kubectl get all -n <namespace>
# Get events (sorted by time — great for debugging)
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>
# Force delete stuck namespace (use with caution)
kubectl get namespace <ns> -o json \
| tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" \
| kubectl replace --raw /api/v1/namespaces/<ns>/finalize -f -
etcd — Backup & Restore
etcd holds all cluster state. Losing it without a backup means losing your entire cluster configuration. In production, automate this daily.
#!/bin/bash
# etcd snapshot backup
ETCD_CERTS="/etc/kubernetes/pki/etcd"
BACKUP_PATH="/backup/etcd-$(date +%Y%m%d-%H%M%S).db"
etcdctl snapshot save "$BACKUP_PATH" \
--endpoints=https://127.0.0.1:2379 \
--cacert="${ETCD_CERTS}/ca.crt" \
--cert="${ETCD_CERTS}/server.crt" \
--key="${ETCD_CERTS}/server.key"
etcdctl snapshot status "$BACKUP_PATH" --write-out=table
# Restore (run on all control plane nodes, then restart etcd)
etcdctl snapshot restore "$BACKUP_PATH" \
--data-dir=/var/lib/etcd-restore \
--name=master \
--initial-cluster=master=https://<node-ip>:2380 \
--initial-advertise-peer-urls=https://<node-ip>:2380
High-Availability Control Plane
Production clusters run an odd number of control plane nodes (3 or 5) behind a load balancer. etcd uses the Raft consensus algorithm — you need a quorum of (n/2)+1 nodes to elect a leader.
Figure 2 — HA Control Plane with 3 nodes. etcd Raft leader (blue) replicates to followers. All apiservers are active — the LB distributes kubectl traffic.
Common Gotchas
-
etcd disk I/O is your bottleneck. etcd is extremely sensitive to slow disk. Use SSDs, monitor
etcd_disk_wal_fsync_duration_secondsin Prometheus. High latency here causes API server timeouts and cascading failures. -
Clock skew kills etcd clusters. etcd uses Raft which requires synchronized clocks. Run NTP (chrony) on all control plane nodes. Skew > 500ms can cause leader election storms.
-
kube-scheduler is not a daemon. Only one scheduler instance is active at a time (leader election). If your leader crashes, there's a brief period where pods won't be scheduled — not a silent failure, but expect delays.
-
Admission webhook timeouts block deploys. If a ValidatingWebhookConfiguration or MutatingWebhookConfiguration endpoint is down and
failurePolicy: Fail, all requests to that resource type will fail cluster-wide. -
The API server is stateless. It reads/writes etcd and caches via informers. You can restart it freely — it recovers quickly. etcd is what you can't afford to lose.