Services & Networking Troubleshooting
Follow the path: client Pod → DNS → Service → EndpointSlice → target Pod IP/port → app listener → NetworkPolicy/CNI → Ingress or LoadBalancer.
Symptom Decision Tree
Narrow dataplane failures by comparing Pod IP versus Service VIP versus north-south path.
Service Path Checks
NS=app
SVC=web-api
kubectl get svc "$SVC" -n "$NS" -o wide # Service type, ClusterIP, ports, selector.
kubectl describe svc "$SVC" -n "$NS" # Events and endpoint summary.
kubectl get endpoints "$SVC" -n "$NS" -o wide # Legacy endpoint object.
kubectl get endpointslice -n "$NS" -l kubernetes.io/service-name="$SVC" -o wide # Modern endpoint source.
kubectl get pods -n "$NS" -l app=web-api -o wide # Verify selector finds ready Pods.DNS Debug Pod
kubectl run netshoot -n app --rm -it --image=nicolaka/netshoot -- bash
# Inside the debug Pod:
dig web-api.app.svc.cluster.local # DNS should return the Service ClusterIP.
curl -sv http://web-api.app.svc.cluster.local:8080/health # Tests DNS + Service + backend app.
curl -sv http://10.42.3.21:8080/health # Tests direct Pod IP; bypasses Service routing.
nc -vz web-api.app.svc.cluster.local 8080 # Quick TCP connectivity test.| Failure mode | Interpretation | Mitigate |
|---|---|---|
NXDOMAIN for fqdn | Name typo or Namespace mismatch. | Prefer fully-qualified service DNS names during incidents. |
| timeouts to CoreDNS | CPU throttle or iptables-loop. | kubectl top pod -n kube-system -l k8s-app=kube-dns. |
| intermittent 5xx only via Service | Stale conntrack or differing paths. | CNI/kube-proxy resets. |
| hosts file overrides | Privileged sidecars mutated /etc/resolv.conf. | inspect projected resolv automation. |
Pods using huge ndots search | external DNS accidentally traversing cluster search domains. | Tune dnsConfig explicitly on latency-sensitive Pods. |
kubectl logs -n kube-system deploy/coredns --tail=200
kubectl exec -it deploy/misc-tools -n app -- cat /etc/resolv.conf # confirm nameserver + search domains
# Stub upstream resolver health (clusters vary).
kubectl get cm coredns -n kube-system -o yamlCNI Dataplane
East-west breakage before Services misbehave hints at encapsulated overlay drops or eBPF attachment failures.
| Signal | Means | Evidence |
|---|---|---|
WIREGUARD_TUNNEL_DOWN style logs | Overlay peer lost. | CNI daemon logs. |
| XDP_PROG_LOAD errors | Kernel too old/new mismatch. | cilium status / Calico equivalents. |
| iptables race after reboot | kube-proxy resync backlog. | iptables-save counters ballooning. |
| VXLAN mtu black holes | Jumbo mismatched across racks. | ICMP sweep from netshoot. |
| Stale ARP caches | Metal L2 moved MAC without GARP. | Tcpdump on uplink interfaces. |
kubectl get pods -n kube-system -o wide | grep -iE 'cilium|calico|weave|cni|kube-proxy'
# Inspect programming lag on dataplane-heavy nodes only if SSH allowed.
kubectl get nodes worker-1 -o jsonpath='{.metadata.labels}'Kube-Proxy Operating Modes
| Mode | Operational note |
|---|---|
| iptables | Large Services hit rule-list latency; shard Services or migrate IPVS/eBPF modes. |
| IPVS | Masquerading nuances with externalTrafficPolicy=Local. |
| eBPF (KPNG/Cilium) | Debugging differs; prefer Hubble flows over kube-proxy logs. |
| Disabled (some clouds) | Redirects handled by hyperscaler CCM— treat NodePort issues as provider tickets sooner. |
NetworkPolicy Checks
kubectl get networkpolicy -A # Any policy can change traffic behavior in its namespace.
kubectl describe networkpolicy allow-frontend-to-api -n app
kubectl get pod web-api-abc123 -n app --show-labels # Labels must match policy podSelector.
kubectl get namespace app --show-labels # Namespace labels matter for namespaceSelector.
# Temporarily test with a controlled allow rule only if incident policy allows it.
# Prefer reviewing existing policies before adding broad allow-all rules.Ingress And LoadBalancer
kubectl get ingress -A
kubectl describe ingress web-api -n app # Rules, TLS secret, class, and events.
kubectl get svc -n ingress-nginx # LoadBalancer Service for the controller.
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100
# Test host routing without relying on public DNS.
curl -sv -H 'Host: api.example.com' http://<load-balancer-ip>/health| Signal | Interpretation | Mitigating action |
|---|---|---|
| 502/504 from Ingress only | No healthy upstream Endpoints matching backend. | Check Service name/port + readiness. |
| Certificate verify failed | Secret stale or mismatched SAN. | kubectl describe certificate if cert-manager-backed. |
| GRPC failing through HTTP Ingress | Needs backend protocol annotation or Gateway API GRPCRoute. | Migrate manifests. |
| WebSocket resets | timeouts without upgrade headers forwarded. | Tune upstream keepalive/timeouts globally. |
| Regional LB partial outage | Stale target group registration. | Recycle controller Pods after node repairs. |
kubectl get gateways.gateway.networking.k8s.io -A 2>/dev/null
kubectl describe httproute web -n app 2>/dev/nullFailure Map
| Symptom | Likely issue | Check |
|---|---|---|
| Service has no endpoints | Selector mismatch or Pods not Ready. | kubectl get pod --show-labels |
| DNS fails | CoreDNS down, wrong namespace/name, search path issue. | kubectl logs -n kube-system deploy/coredns |
| Direct Pod works, Service fails | Service port/targetPort mismatch or kube-proxy/CNI issue. | Service YAML and EndpointSlice ports |
| Inside cluster works, outside fails | Ingress, LB, DNS, firewall, TLS, or security group. | Ingress/LB events and controller logs |
| Only some clients fail | NetworkPolicy, node routing, zone, or source allowlist. | Policy selectors and source node |
| Hairpin failures | Hairpin NAT disabled on bridge CNI combos. | Prefer cluster DNS + Service VIP over node IP looping. |
| Latency cliffs at scale | iptables rule explosions / conntrack table pressure. | Observe kube-proxy metrics or migrate dataplane modes. |
Packet Capture Ethics
Prefer Hubble/Skupper-style metadata before resorting to tcpdump on production nodes; scrub customer payloads and obtain legal approval when inspecting payload-bearing flows.
kubectl debug node/<node> -it --image=nicolaka/netshoot -- tcpdump -ni eth0 tcp port 443 -c40IPv4/IPv6 Dual-Stack Quirks
| Scenario | Interpretation |
|---|---|
| Pods resolving AAAA ahead of broken v6 egress | Prefer single-stack Services temporarily. |
| Ingress emits only IPv4 LB | Upstream DNS still publishes AAAA to clients. |
| Uptime checks pass IPv4-only | Synthetic monitors must mirror customer paths. |
Headless Services And Stateful Routing
Headless clusters (clusterIP: None) return Pod A records straight from Endpoints enabling peer discovery. When Pods appear “healthy” behind a normal Service but flaky headless lookups race, suspects include readiness gating emitting endpoints too early versus application-level bootstrap ordering.
kubectl get svc kafka -n data -o wide
dig kafka.data.svc.cluster.local +tcp +short # expect multiple Pod IPs behind headless svcexternalTrafficPolicy And Source IP
| Setting | Semantics | Operational trade-off |
|---|---|---|
Cluster default | SNAT hides client IPs from Pods. | Simpler routing; harder forensics inside apps. |
Local | Preserves original source when possible. | Ensure health checks terminate on Nodes with Pods or LB marks them out. |
| BGP MetalLB combos | Symmetric return requirements. | Otherwise partial connectivity. |
Service Mesh Edge Cases
Envoy-linked meshes shift failures from iptables DNAT chains to xDS staleness symptoms—401 loops when rotated certs or clusters lose RDS subscription.
| Message | Interpretation |
|---|---|
| Upstream overflow (UO) | Cluster-wide circuit opens after backend brownout. |
| x509 unknown authority | istio-ca root mismatch mid rotation. |
| HTTP 426 Upgrade Required | Protocol sniffing mis-detected HTTP2. |
| Blackhole cluster | VirtualService host mismatch. |
WireGuard / Encapsulation Math
Each overlay header reduces effective MSS; document expected overhead per CNI release so MTU playbooks stay accurate after upgrades.
- Geneve header adds ~50B depending on options—validate with vendor tables.
- IP-in-IP double encapsulation can trigger PMTU black holes between regions.
- Userspace tunnels (some dev clusters) cap throughput and appear as “random latency spikes”.
- Kernel GSO/GRO toggles sometimes regress on specific NIC firmware pairings.
- Keep a living doc of jumbo capabilities per rack-ID for bare metal estates.
Managed Load Balancer Checks
Hyperscaler LBs reconcile slower than Pods—always compare controller events with cloud console target health timelines.
kubectl describe svc public-api -n edge | grep -i event -A40
kubectl get endpointslices -n edge -o wide # ensure healthy ports match LB listenerWindows Node Caveats
Windows HNS policies diverge from Linux netstack—NetworkPolicy support depends on platform version; treat cross-OS Service connectivity as a first-class test matrix item.
| Symptom | Likely cause |
|---|---|
| Pods on Windows cannot reach linux Pod IP directly | Incomplete overlay peer programming. |
| DNS works but ICMP fails | Frequently expected—prefer TCP probes. |
| Host-gateway mode routing loops | Misconfigured CNI binaries across versions. |
Session Affinity Debugging
sessionAffinity: ClientIP leverages conntrack but breaks when Pods churn frequently; annotate expectations for testers.
kubectl get svc cart -n shop -o yaml | grep -i sessionCorporate Egress And HTTP Proxies
Clusters forced through explicit proxies need HTTP_PROXY variables synchronized across init containers; missing CA bundles surface as TLS verify errors indistinguishable from remote outages.
- Confirm
NO_PROXYexcludes in-cluster Kubernetes API hostnames. - Validate CONNECT tunnel MTU behaves identically across AZs.
- Rotate proxy credentials centrally; avoid bespoke Secret copies per Deployment.
- Watch for Envoy-based sidecars doubling proxy chaining latency.
CDN / WAF Layers
When Ingress returns 200 internally but CDN serves stale 503, capture X-Request-Id correlations across CDN vendor logs versus upstream Ingress logs rather than chasing kube-proxy first.
Choosing Probes For Network Tickets
Blend L3/L4/L7 evidence so you do not mistake ICMP success for working HTTP/2 ALPN handshakes.
| Probe | Covers | Blind spots |
|---|---|---|
ping | Basic L3 reachability. | Blocked intentionally on many clouds. |
nc -vz | TCP port open. | TLS application errors still possible. |
openssl s_client | Cert chain + SNI path. | Does not prove HTTP routing rules. |
curl -v --http2 | Full request path. | Hide differences when Host header missing. |
grpcurl | RPC services. | Requires proto reflection or imports. |
eBPF Map Pressure
Cilium and similar CNIs expose map fill levels—sudden policy import storms can exhaust LRU maps and silently drop flows.
- Watch for
map entry update failedwarnings in agent logs during bulk NetworkPolicy applies. - Correlate with GitOps reconcilers re-applying identical policies in tight loops.
- Resize maps only with vendor guidance; arbitrary sysctl bumps may destabilize nodes.
- Prefer incremental policy changes over wholesale namespace dumps.
- Validate kernel lockdown settings when loading BPF programs after secure-boot enablement.
Egress SNAT Pools
Centralized NAT gateways exhaust ephemeral ports faster than kube-proxy metrics reveal—track five-tuple reuse patterns when API fan-out spikes.
# On Linux node with privileges (use sparingly):
sudo conntrack -S | headLatency Budget Checklist
East-west microseconds matter when autoscaled fleets chatter; treat each hop as borrowed time.
- Quantify DNS lookup + TCP handshake + TLS (if applicable) independently.
- subtract queueing delays visible in Ingress controller histograms.
- compare client-side timings with Envoy upstream_rq_time equivalents when meshed.
- ensure synthetic monitors match regional customer populations, not just nearest AWS region.
- document expected additional RTT when traffic hairpins through security inspection VPCs.
- record baseline when cluster healthy to spot slow regression during minor CNI upgrades.
Service Defaults And Headless Edge Cases
| Field | Default gotcha |
|---|---|
publishNotReadyAddresses | Traffic hits bootstrapping Pods—use only with aware clients. |
externalName | DNS CNAME only; easy to mis-read as proxy. |
loadBalancerClass | Wrong class leaves Service pending forever on multi-controller clusters. |
IPv6 dual-stack ipFamilies | Ordering matters for legacy clients. |
Closing Network Incidents
Before resolving the ticket, capture the delta between initial symptoms and final fix so future regressions can bisect faster.
- Paste exact working
curlinvocation with redacted tokens. - Note whether change was config-only or required node reboots.
- List monitoring gaps that delayed detection.
- Confirm canaries still cover the failure class post-fix.
Link follow-up tasks to the originating SLO burn if customer impact occurred; network incidents often require cross-team backlog grooming.
Traceroute Hygiene
| Tool | When to trust |
|---|---|
mtr | Persistent loss on specific hop. |
tracert (Windows) | Limited ICMP may hide real path. |
paris-traceroute | ECMP fan-out investigations. |
| Cloud reachability analyzers | Symmetric path vs on-prem last mile. |
| Service mesh tap | L7 attributes while L3 looks clean. |
| Synthetic global agents | Geo-specific DNS responses. |
kubectl port-forward | Debug without exposing Service. |
| Sidecar admin ports | Envoy /clusters stale hosts. |
| LB access logs | Rejected TLS handshakes counted. |
| PCAP on hypervisor | East-west invisible to Pods. |
Document which hop produced each artifact so future shifts do not re-litigate network vs app ownership.
- Tag PCAP files with cluster, namespace, and directional arrow (client→Ingress vs Pod→Pod).
- Mention clock skew when stitching multi-system logs.
- Redact customer headers before uploading to ticket systems.
One-Line Smoke Tests
kubectl run netcheck --rm -i --restart=Never --image=curlimages/curl -- \
curl -sS -o /dev/null -w "%{http_code}\n" http://kubernetes.default.svc.cluster.local
# Expect 403/401 from apiserver — proves Service routing + DNS alive.