Services & Networking Troubleshooting

TL;DR

Follow the path: client Pod → DNS → Service → EndpointSlice → target Pod IP/port → app listener → NetworkPolicy/CNI → Ingress or LoadBalancer.

Symptom Decision Tree

Narrow dataplane failures by comparing Pod IP versus Service VIP versus north-south path.

Service Path Checks

bashservice-path.sh

NS=app
SVC=web-api

kubectl get svc "$SVC" -n "$NS" -o wide # Service type, ClusterIP, ports, selector.
kubectl describe svc "$SVC" -n "$NS" # Events and endpoint summary.
kubectl get endpoints "$SVC" -n "$NS" -o wide # Legacy endpoint object.
kubectl get endpointslice -n "$NS" -l kubernetes.io/service-name="$SVC" -o wide # Modern endpoint source.
kubectl get pods -n "$NS" -l app=web-api -o wide # Verify selector finds ready Pods.

DNS Debug Pod

bashdns-debug.sh

kubectl run netshoot -n app --rm -it --image=nicolaka/netshoot -- bash

# Inside the debug Pod:
dig web-api.app.svc.cluster.local # DNS should return the Service ClusterIP.
curl -sv http://web-api.app.svc.cluster.local:8080/health # Tests DNS + Service + backend app.
curl -sv http://10.42.3.21:8080/health # Tests direct Pod IP; bypasses Service routing.
nc -vz web-api.app.svc.cluster.local 8080 # Quick TCP connectivity test.

Failure mode	Interpretation	Mitigate
`NXDOMAIN` for fqdn	Name typo or Namespace mismatch.	Prefer fully-qualified service DNS names during incidents.
timeouts to CoreDNS	CPU throttle or iptables-loop.	`kubectl top pod -n kube-system -l k8s-app=kube-dns`.
intermittent 5xx only via Service	Stale conntrack or differing paths.	CNI/kube-proxy resets.
hosts file overrides	Privileged sidecars mutated /etc/resolv.conf.	inspect projected resolv automation.
Pods using huge `ndots` search	external DNS accidentally traversing cluster search domains.	Tune dnsConfig explicitly on latency-sensitive Pods.

bashcoredns-trace.sh

kubectl logs -n kube-system deploy/coredns --tail=200
kubectl exec -it deploy/misc-tools -n app -- cat /etc/resolv.conf # confirm nameserver + search domains

# Stub upstream resolver health (clusters vary).
kubectl get cm coredns -n kube-system -o yaml

CNI Dataplane

East-west breakage before Services misbehave hints at encapsulated overlay drops or eBPF attachment failures.

Signal	Means	Evidence
`WIREGUARD_TUNNEL_DOWN` style logs	Overlay peer lost.	CNI daemon logs.
XDP_PROG_LOAD errors	Kernel too old/new mismatch.	`cilium status` / Calico equivalents.
iptables race after reboot	kube-proxy resync backlog.	iptables-save counters ballooning.
VXLAN mtu black holes	Jumbo mismatched across racks.	ICMP sweep from netshoot.
Stale ARP caches	Metal L2 moved MAC without GARP.	Tcpdump on uplink interfaces.

bashcni-health.sh

kubectl get pods -n kube-system -o wide | grep -iE 'cilium|calico|weave|cni|kube-proxy'

# Inspect programming lag on dataplane-heavy nodes only if SSH allowed.
kubectl get nodes worker-1 -o jsonpath='{.metadata.labels}'

Kube-Proxy Operating Modes

Mode	Operational note
iptables	Large Services hit rule-list latency; shard Services or migrate IPVS/eBPF modes.
IPVS	Masquerading nuances with externalTrafficPolicy=Local.
eBPF (KPNG/Cilium)	Debugging differs; prefer Hubble flows over kube-proxy logs.
Disabled (some clouds)	Redirects handled by hyperscaler CCM— treat NodePort issues as provider tickets sooner.

NetworkPolicy Checks

bashnetwork-policy-debug.sh

kubectl get networkpolicy -A # Any policy can change traffic behavior in its namespace.
kubectl describe networkpolicy allow-frontend-to-api -n app
kubectl get pod web-api-abc123 -n app --show-labels # Labels must match policy podSelector.
kubectl get namespace app --show-labels # Namespace labels matter for namespaceSelector.

# Temporarily test with a controlled allow rule only if incident policy allows it.
# Prefer reviewing existing policies before adding broad allow-all rules.

Ingress And LoadBalancer

bashingress-lb.sh

kubectl get ingress -A
kubectl describe ingress web-api -n app # Rules, TLS secret, class, and events.
kubectl get svc -n ingress-nginx # LoadBalancer Service for the controller.
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100

# Test host routing without relying on public DNS.
curl -sv -H 'Host: api.example.com' http://<load-balancer-ip>/health

Signal	Interpretation	Mitigating action
502/504 from Ingress only	No healthy upstream Endpoints matching backend.	Check Service name/port + readiness.
Certificate verify failed	Secret stale or mismatched SAN.	`kubectl describe certificate` if cert-manager-backed.
GRPC failing through HTTP Ingress	Needs backend protocol annotation or Gateway API GRPCRoute.	Migrate manifests.
WebSocket resets	timeouts without upgrade headers forwarded.	Tune upstream keepalive/timeouts globally.
Regional LB partial outage	Stale target group registration.	Recycle controller Pods after node repairs.

bashgateway-api.sh

kubectl get gateways.gateway.networking.k8s.io -A 2>/dev/null
kubectl describe httproute web -n app 2>/dev/null

Failure Map

Symptom	Likely issue	Check
Service has no endpoints	Selector mismatch or Pods not Ready.	`kubectl get pod --show-labels`
DNS fails	CoreDNS down, wrong namespace/name, search path issue.	`kubectl logs -n kube-system deploy/coredns`
Direct Pod works, Service fails	Service port/targetPort mismatch or kube-proxy/CNI issue.	Service YAML and EndpointSlice ports
Inside cluster works, outside fails	Ingress, LB, DNS, firewall, TLS, or security group.	Ingress/LB events and controller logs
Only some clients fail	NetworkPolicy, node routing, zone, or source allowlist.	Policy selectors and source node
Hairpin failures	Hairpin NAT disabled on bridge CNI combos.	Prefer cluster DNS + Service VIP over node IP looping.
Latency cliffs at scale	iptables rule explosions / conntrack table pressure.	Observe kube-proxy metrics or migrate dataplane modes.

Packet Capture Ethics

Prefer Hubble/Skupper-style metadata before resorting to tcpdump on production nodes; scrub customer payloads and obtain legal approval when inspecting payload-bearing flows.

bashephemeral-tcpdump.yaml

kubectl debug node/<node> -it --image=nicolaka/netshoot -- tcpdump -ni eth0 tcp port 443 -c40

IPv4/IPv6 Dual-Stack Quirks

Scenario	Interpretation
Pods resolving AAAA ahead of broken v6 egress	Prefer single-stack Services temporarily.
Ingress emits only IPv4 LB	Upstream DNS still publishes AAAA to clients.
Uptime checks pass IPv4-only	Synthetic monitors must mirror customer paths.

Headless Services And Stateful Routing

Headless clusters (clusterIP: None) return Pod A records straight from Endpoints enabling peer discovery. When Pods appear “healthy” behind a normal Service but flaky headless lookups race, suspects include readiness gating emitting endpoints too early versus application-level bootstrap ordering.

bashheadless-dns.sh

kubectl get svc kafka -n data -o wide
dig kafka.data.svc.cluster.local +tcp +short # expect multiple Pod IPs behind headless svc

externalTrafficPolicy And Source IP

Setting	Semantics	Operational trade-off
`Cluster` default	SNAT hides client IPs from Pods.	Simpler routing; harder forensics inside apps.
`Local`	Preserves original source when possible.	Ensure health checks terminate on Nodes with Pods or LB marks them out.
BGP MetalLB combos	Symmetric return requirements.	Otherwise partial connectivity.

Service Mesh Edge Cases

Envoy-linked meshes shift failures from iptables DNAT chains to xDS staleness symptoms—401 loops when rotated certs or clusters lose RDS subscription.

Message	Interpretation
Upstream overflow (UO)	Cluster-wide circuit opens after backend brownout.
x509 unknown authority	istio-ca root mismatch mid rotation.
HTTP 426 Upgrade Required	Protocol sniffing mis-detected HTTP2.
Blackhole cluster	VirtualService host mismatch.

WireGuard / Encapsulation Math

Each overlay header reduces effective MSS; document expected overhead per CNI release so MTU playbooks stay accurate after upgrades.

Geneve header adds ~50B depending on options—validate with vendor tables.
IP-in-IP double encapsulation can trigger PMTU black holes between regions.
Userspace tunnels (some dev clusters) cap throughput and appear as “random latency spikes”.
Kernel GSO/GRO toggles sometimes regress on specific NIC firmware pairings.
Keep a living doc of jumbo capabilities per rack-ID for bare metal estates.

Managed Load Balancer Checks

Hyperscaler LBs reconcile slower than Pods—always compare controller events with cloud console target health timelines.

bashlb-events.sh

kubectl describe svc public-api -n edge | grep -i event -A40
kubectl get endpointslices -n edge -o wide # ensure healthy ports match LB listener

Windows Node Caveats

Windows HNS policies diverge from Linux netstack—NetworkPolicy support depends on platform version; treat cross-OS Service connectivity as a first-class test matrix item.

Symptom	Likely cause
Pods on Windows cannot reach linux Pod IP directly	Incomplete overlay peer programming.
DNS works but ICMP fails	Frequently expected—prefer TCP probes.
Host-gateway mode routing loops	Misconfigured CNI binaries across versions.

Session Affinity Debugging

sessionAffinity: ClientIP leverages conntrack but breaks when Pods churn frequently; annotate expectations for testers.

bashaffinity-check.sh

kubectl get svc cart -n shop -o yaml | grep -i session

Corporate Egress And HTTP Proxies

Clusters forced through explicit proxies need HTTP_PROXY variables synchronized across init containers; missing CA bundles surface as TLS verify errors indistinguishable from remote outages.

Confirm NO_PROXY excludes in-cluster Kubernetes API hostnames.
Validate CONNECT tunnel MTU behaves identically across AZs.
Rotate proxy credentials centrally; avoid bespoke Secret copies per Deployment.
Watch for Envoy-based sidecars doubling proxy chaining latency.

CDN / WAF Layers

When Ingress returns 200 internally but CDN serves stale 503, capture X-Request-Id correlations across CDN vendor logs versus upstream Ingress logs rather than chasing kube-proxy first.

Choosing Probes For Network Tickets

Blend L3/L4/L7 evidence so you do not mistake ICMP success for working HTTP/2 ALPN handshakes.

Probe	Covers	Blind spots
`ping`	Basic L3 reachability.	Blocked intentionally on many clouds.
`nc -vz`	TCP port open.	TLS application errors still possible.
`openssl s_client`	Cert chain + SNI path.	Does not prove HTTP routing rules.
`curl -v --http2`	Full request path.	Hide differences when Host header missing.
`grpcurl`	RPC services.	Requires proto reflection or imports.

eBPF Map Pressure

Cilium and similar CNIs expose map fill levels—sudden policy import storms can exhaust LRU maps and silently drop flows.

Watch for map entry update failed warnings in agent logs during bulk NetworkPolicy applies.
Correlate with GitOps reconcilers re-applying identical policies in tight loops.
Resize maps only with vendor guidance; arbitrary sysctl bumps may destabilize nodes.
Prefer incremental policy changes over wholesale namespace dumps.
Validate kernel lockdown settings when loading BPF programs after secure-boot enablement.

Egress SNAT Pools

Centralized NAT gateways exhaust ephemeral ports faster than kube-proxy metrics reveal—track five-tuple reuse patterns when API fan-out spikes.

bashconntrack-hint.sh

# On Linux node with privileges (use sparingly):
sudo conntrack -S | head

Latency Budget Checklist

East-west microseconds matter when autoscaled fleets chatter; treat each hop as borrowed time.

Quantify DNS lookup + TCP handshake + TLS (if applicable) independently.
subtract queueing delays visible in Ingress controller histograms.
compare client-side timings with Envoy upstream_rq_time equivalents when meshed.
ensure synthetic monitors match regional customer populations, not just nearest AWS region.
document expected additional RTT when traffic hairpins through security inspection VPCs.
record baseline when cluster healthy to spot slow regression during minor CNI upgrades.

Service Defaults And Headless Edge Cases

Field	Default gotcha
`publishNotReadyAddresses`	Traffic hits bootstrapping Pods—use only with aware clients.
`externalName`	DNS CNAME only; easy to mis-read as proxy.
`loadBalancerClass`	Wrong class leaves Service pending forever on multi-controller clusters.
IPv6 dual-stack `ipFamilies`	Ordering matters for legacy clients.

Closing Network Incidents

Before resolving the ticket, capture the delta between initial symptoms and final fix so future regressions can bisect faster.

Paste exact working curl invocation with redacted tokens.
Note whether change was config-only or required node reboots.
List monitoring gaps that delayed detection.
Confirm canaries still cover the failure class post-fix.

Link follow-up tasks to the originating SLO burn if customer impact occurred; network incidents often require cross-team backlog grooming.

Traceroute Hygiene

Tool	When to trust
`mtr`	Persistent loss on specific hop.
`tracert` (Windows)	Limited ICMP may hide real path.
`paris-traceroute`	ECMP fan-out investigations.
Cloud reachability analyzers	Symmetric path vs on-prem last mile.
Service mesh tap	L7 attributes while L3 looks clean.
Synthetic global agents	Geo-specific DNS responses.
`kubectl port-forward`	Debug without exposing Service.
Sidecar admin ports	Envoy `/clusters` stale hosts.
LB access logs	Rejected TLS handshakes counted.
PCAP on hypervisor	East-west invisible to Pods.

Document which hop produced each artifact so future shifts do not re-litigate network vs app ownership.

Tag PCAP files with cluster, namespace, and directional arrow (client→Ingress vs Pod→Pod).
Mention clock skew when stitching multi-system logs.
Redact customer headers before uploading to ticket systems.

One-Line Smoke Tests

bashsmoke-net.sh

kubectl run netcheck --rm -i --restart=Never --image=curlimages/curl -- \
  curl -sS -o /dev/null -w "%{http_code}\n" http://kubernetes.default.svc.cluster.local
# Expect 403/401 from apiserver — proves Service routing + DNS alive.