SRE & Platform Scenarios Knowledge

Pods across multiple nodes intermittently fail DNS lookups. What do you check?

Check CoreDNS logs and resource saturation, kube-dns Service endpoints, NodeLocal DNS cache health if used, NetworkPolicy rules for UDP/TCP 53, kubelet cluster DNS configuration, CNI packet loss, and test with nslookup from affected Pods on different nodes.

A Deployment rollout is stuck because new Pods never become Ready. What do you check?

Use kubectl rollout status and kubectl describe deployment, then inspect new Pods for readiness probe failures, dependency failures, config or env mistakes, image issues, resource pressure, and rollout constraints such as maxUnavailable or maxSurge.

A StatefulSet Pod is stuck Terminating for 30 minutes. What do you do?

Check finalizers, node reachability, CSI volume detach status, mounted PVC state, PodDisruptionBudget constraints, and application shutdown behavior. Force deletion only after the data consistency risk is understood.

Cluster autoscaler is not adding nodes even though Pods are Pending. What do you check?

Check autoscaler logs, cloud provider permissions, node group max size, quota, Pending Pod Unschedulable events, huge CPU or memory requests, taints, affinity, topology spread constraints, and whether any node group can satisfy the Pod shape.

A node frequently goes NotReady during peak traffic. What do you check?

Check kubelet restarts, node CPU saturation, memory and disk pressure, kernel logs, container runtime health, NIC saturation, CNI errors, API connectivity, and node-level dashboards. Add node monitoring and consider larger nodes or workload isolation.

A Pod cannot reach another Pod across namespaces. What do you check?

Validate CNI health, direct Pod IP reachability, namespace-qualified Service DNS, NetworkPolicies in both namespaces, Pod labels, target port listeners, node routes, and whether egress or ingress is blocked.

A Service has no endpoints even though Pods are running. What is likely wrong?

Check Service selector versus Pod labels, Pod readiness, readiness gates, namespace assumptions, target workload health, and EndpointSlices. Running Pods do not become Service endpoints unless they match and are Ready.

A cluster upgrade failed mid-way. What do you do?

Capture current state, check component versions and kubeadm plan, inspect control-plane static Pods and kubelet logs, validate API and etcd health, restore etcd from snapshot if state is corrupted, correct version skew, then re-run the failed upgrade step.

A Pod is OOMKilled repeatedly. What do you do?

Check previous logs, memory limits, actual usage, heap settings, leak patterns, and node pressure. Raise limits only when justified, right-size requests, use profiling, and consider VPA recommendations or app fixes.

A CNI plugin upgrade caused a cluster-wide network outage. What do you do?

Roll back the CNI DaemonSet or release, validate Pod CIDR and node route consistency, inspect CNI logs, restore previous config, test a canary node before broad rollout, and require staged rollout plus rollback plans for future CNI changes.

A critical microservice is experiencing cascading failures. How do you respond?

Reduce blast radius with rate limits, circuit breakers, retry budgets, load shedding, and queue buffering. Identify the upstream latency or error source, stop unbounded retries, and protect dependencies while recovering user-facing paths.

Latency spikes happen every day at the same time. What do you investigate?

Correlate spikes with cron jobs, backups, batch workloads, compaction, autoscaling events, GC cycles, cache expiry, database maintenance, and noisy neighbors. Use dashboards by time window and compare p50, p95, p99, and saturation.

A service becomes slow after a new release. What is your approach?

Compare golden signals before and after release, inspect CPU and memory profiles, check DB query plans, dependency latency, error logs, feature flags, and config changes. Roll back or disable the change if user impact is growing.

A distributed job scheduler is creating duplicate jobs. What do you fix?

Verify leader election, add idempotency keys, enforce deduplication at job creation and execution, use transactional state updates, and make handlers safe to retry. Observability should show job lifecycle and duplicate suppression.

A message queue is growing uncontrollably. What do you check?

Check producer rate, consumer lag, slow consumers, poison messages, downstream latency, retry storms, partition imbalance, and batch size. Scale consumers when safe, add DLQ handling, and apply backpressure to producers.

A service randomly times out calling another service. What do you check?

Check p99 latency, connection pool exhaustion, DNS latency, load balancer health, timeout values, retry behavior, TCP resets, Pod restarts, and dependency saturation. Make timeouts shorter than user deadlines and retries bounded.

A distributed cache cluster is causing stale reads. What do you change?

Review cache invalidation, TTLs, write-through or write-behind behavior, consistent hashing, replication lag, and read-after-write expectations. Add versioning or explicit invalidation for data that cannot tolerate staleness.

A database failover caused five minutes of downtime. What do you improve?

Tune failover thresholds, reduce election timeouts carefully, verify client retry and reconnect logic, shorten DNS or endpoint update delays, rehearse failovers, and measure recovery time under realistic traffic.

A service is CPU-throttled despite low average CPU usage. What do you check?

Inspect cgroup throttling metrics, CPU limits, bursty workload behavior, and request/limit settings. Removing or raising CPU limits often helps latency-sensitive services while keeping requests for scheduling.

A distributed system shows clock skew issues. What do you check?

Ensure chrony or ntpd is running, monitor drift on all nodes, validate cloud time sync, set alerts on skew thresholds, and review systems that depend on timestamps such as leases, TLS, logs, and distributed transactions.

You receive a high 5xx alert. What do you check first?

Check error rate, latency, traffic volume, saturation, recent deploys, upstream dependencies, logs, and traces. Confirm whether impact is global or scoped by route, zone, version, tenant, or dependency.

Logs show intermittent connection reset errors. What do you investigate?

Check load balancer idle timeouts, keepalive settings, Pod restarts, upstream resets, connection pool reuse, ingress timeouts, node pressure, and whether one side closes connections earlier than the other expects.

A dashboard shows a sudden drop in traffic. What do you validate?

First validate the monitoring pipeline: scrape health, metric names, dashboard query, and time range. Then check load balancer health, DNS, routing, ingress, upstream outages, recent deploys, and real request logs.

Alerts are too noisy. How do you improve them?

Move toward SLO-based alerts, group related alerts, deduplicate symptoms, remove static thresholds that do not require action, add inhibition rules, and require each alert to have an owner and runbook.

A service has high p99 latency but normal p50. What do you check?

Investigate tail causes: GC pauses, lock contention, cold caches, noisy neighbors, slow dependencies, connection pool starvation, uneven load balancing, large payloads, and outlier nodes or zones.

node-exporter stopped reporting metrics. What do you check?

Check the DaemonSet Pod, node health, Prometheus target page, scrape config, firewall or NetworkPolicy, TLS or auth settings, service discovery labels, and whether the exporter port is listening.

A spike in Pod restarts is observed. What do you check?

Check OOMKilled events, liveness probe failures, image pull problems, node pressure, recent deploys, config changes, dependency outages, and whether restarts concentrate on a node, namespace, workload, or version.

A service has high errors but logs show nothing. What do you improve?

Confirm logs are emitted and collected, add structured logging, request IDs, error logs at boundaries, distributed tracing, and metrics by route and status. Some errors may occur at ingress or client layers before app logs exist.

A dashboard shows zero traffic but logs show requests. What do you check?

Check Prometheus scrape failures, exporter health, metric name or label changes, dashboard query filters, recording rules, dropped samples, and whether logs and metrics are using the same service identity.

A synthetic probe fails but real traffic is fine. What do you check?

Validate probe path, auth requirements, DNS path, firewall allowlists, probe source IPs, TLS/SNI, headers, expected status code, and whether the synthetic request matches real user behavior.

A CI pipeline takes 45 minutes. How do you optimize it?

Add dependency caching, parallelize tests, split slow suites, use incremental builds, right-size runners, containerize build steps, avoid repeated downloads, and publish timing breakdowns so bottlenecks stay visible.

A deployment caused an outage due to bad config. What guardrails do you add?

Add schema validation, config tests, canaries, feature flags, progressive rollout, automated rollback, preflight checks, and ownership for config changes. Treat config as deployable code with review and validation.

Terraform apply accidentally destroyed resources. What controls prevent repeats?

Use plan gates, mandatory review, state locking, drift detection, least-privilege deploy roles, deletion protection, policy-as-code checks, separate workspaces, and backups for state and critical resources.

A VM autoscaling group keeps flapping. What do you tune?

Increase cooldowns, adjust thresholds, use longer evaluation windows, inspect load patterns, check health check sensitivity, and avoid scaling on noisy single metrics. Stabilize with hysteresis.

A build agent pool is overloaded. What do you do?

Add autoscaling, use ephemeral agents, split queues by workload type, cap concurrency for heavy jobs, cache dependencies, and observe queue wait time as a first-class capacity metric.

A secret was accidentally committed to Git. What is the response?

Rotate the secret immediately, revoke old credentials, scan for use, purge history where required, add secret scanning, pre-commit hooks, protected variables, and move secrets to a managed secret store.

A deployment pipeline fails intermittently. What do you check?

Check flaky tests, network timeouts, artifact registry reliability, runner capacity, credentials expiry, concurrency limits, dependency downloads, and whether retries hide a real ordering or state problem.

A container image is 3 GB. How do you reduce it?

Use multi-stage builds, slim or distroless runtime images, remove build tools and caches, clean package manager metadata, copy only needed artifacts, and inspect layers to find large files.

A job runs twice after a pipeline retry. How do you make it safe?

Add idempotency keys, deduplication logic, transactional operations, external job state, and retry-safe handlers. A retry should either no-op safely or resume work without duplicate side effects.

A GitOps controller keeps reverting manual changes. What do you do?

Treat Git as source of truth, find the owning Application or Kustomization, commit desired changes to Git, educate teams on drift behavior, and restrict direct production edits except documented emergency paths.

A service violates its SLO for the month. What is the response?

Analyze burn rate, identify top regressions, pause risky changes if needed, prioritize reliability work, communicate impact, and focus fixes on the largest contributors to user-visible failure.

Error budget is burning too fast. What do you change?

Throttle releases, add canaries, improve rollback automation, reduce noisy dependencies, fix top error sources, and use error budget policy to shift engineering effort toward reliability.

A team wants 99.999% availability. How do you evaluate it?

Explain cost, complexity, dependency budgets, operational load, and diminishing returns. Validate whether user needs justify the target and whether dependencies, architecture, and team practices can support it.

A service has a 99.9% target but depends on a 99% upstream. What is the issue?

Availability compounds across dependencies. A hard dependency on a 99% upstream makes 99.9% user-facing availability unrealistic unless the design uses caching, fallback, graceful degradation, or redundancy.

A service is available but has poor latency. What do you add?

Add latency objectives, optimize hot paths, reduce tail latency, profile slow requests, tune dependencies, and alert on user-visible latency rather than availability alone.

A team wants to alert on every error. What do you recommend?

Alert on user impact and error budget burn, not every individual error. Use logs and dashboards for investigation, alerts for action, and group alerts by symptoms that require human response.

A service has no clear owner. What do you fix?

Define service ownership, escalation paths, on-call rotation, runbooks, dashboards, SLOs, and deployment responsibility. Ownership should be visible in service catalog and alert routing.

On-call load is too high. How do you reduce it?

Fix noisy alerts, automate common remediation, improve runbooks, remove non-actionable pages, invest in reliability work, rotate fairly, and track toil and sleep-impact metrics.

A post-incident review blames individuals. How do you redirect it?

Keep it blameless and systems-focused. Identify contributing conditions, weak signals, missing guardrails, and process gaps. Actions should improve the system rather than punish the responder.

A service has no runbooks. What should be created?

Create runbooks with symptoms, dashboards, commands, rollback steps, dependency maps, escalation paths, common failure modes, and automation links. Keep them close to alerts and review them after incidents.

SRE & Platform Scenarios Knowledge

Questions

Keep going

See also