DevOps & CI/CD Scenarios Knowledge

A CI pipeline takes 45 minutes to run. How do you optimize it?

Introduce build caching, parallelize independent test suites, use incremental builds, containerize build steps for consistency, split slow jobs, right-size runners, and offload heavy work to remote build systems when appropriate.

Builds fail intermittently with network timeout errors. What do you do?

Add bounded retries with backoff, increase timeouts only where justified, use local mirrors or dependency proxies, stabilize artifact registry connectivity, and monitor network failure rate separately from test failures.

A pipeline fails only on the main branch. What do you check?

Check branch-specific environment variables, protected secrets, pipeline conditions, deployment rules, required approvals, and merge results. Validate whether CI config differs after merge or whether main has stricter gates.

A pipeline step is flaky and fails randomly. How do you handle it?

Isolate the flaky step, capture logs and seed values, add retries with backoff as a short-term shield, containerize the environment, remove nondeterministic dependencies, and track flake rate until the root cause is fixed.

A pipeline is blocked waiting for manual approval. What should change?

Keep approvals for high-risk changes, but automate low-risk approvals with policy-as-code, progressive delivery, canary checks, environment protection rules, and clear ownership. Manual gates should be intentional, not default friction.

A pipeline fails due to missing environment variables. What guardrails help?

Use a centralized secret and variable manager, validate required variables at pipeline start, avoid inline secrets, document environment contracts, and fail fast with a clear error before any deploy step.

A pipeline step consumes too much CPU. What do you check?

Profile the step, identify hot tasks, optimize code or test selection, set runner resource classes deliberately, split work into parallel jobs, and offload heavy tasks to scalable runners or remote build infrastructure.

A pipeline fails because of dependency version drift. What prevents it?

Pin versions, use lock files, cache dependencies, control base images, run scheduled dependency updates, and make dependency changes explicit through review rather than accidental fresh installs.

A pipeline is slow because Docker images are large. How do you improve it?

Use multi-stage builds, slim or distroless base images, remove build artifacts, order layers for cache reuse, use registry layer caching, and keep build dependencies out of runtime images.

A pipeline fails after moving to a new CI system. What is the migration strategy?

Run both systems in parallel for a while, compare environment variables and secrets, validate runner images and permissions, migrate one workflow at a time, and use feature flags or branch-scoped rollout for high-risk steps.

A deployment caused an outage due to bad config. What do you add?

Add config schema validation, preflight checks, canary releases, automated rollback, feature flags, review rules, and tests that exercise config combinations before rollout.

A blue/green deployment failed to switch traffic. What do you check?

Check load balancer rules, target health checks, DNS TTL, readiness probes, service selectors, ingress or gateway routing, and whether both environments are serving compatible versions.

A canary deployment never reaches full rollout. What do you inspect?

Check canary metrics, error budget burn, rollout thresholds, analysis templates, sample size, route weights, logs, and whether the canary is genuinely receiving representative traffic.

A deployment works in staging but fails in production. What do you compare?

Compare secrets, ConfigMaps, feature flags, quotas, node types, resource limits, data shape, external dependencies, network policy, ingress settings, and version skew between environments.

A rollback fails because the previous version is incompatible. What prevents this?

Use backward-compatible schema changes, expand-and-contract migrations, versioned APIs, decoupled database migrations, compatibility tests, and rollback drills that include data and config state.

A deployment pipeline deploys to the wrong environment. What controls help?

Use environment protection rules, explicit environment selection, approvals for sensitive targets, separate credentials, GitOps environment repos, clear naming, and guardrails that verify cluster identity before deployment.

A deployment is stuck waiting for Pods to become Ready. What do you check?

Check readiness probes, dependency connectivity, database access, misconfigured ports, config and secrets, resource pressure, startup time, and Pod events. Readiness should represent actual ability to serve traffic.

A deployment triggers too many restarts. What do you check?

Inspect logs, liveness probe settings, startup probes, resource limits, dependency failures, configuration changes, and node pressure. Bad probes can amplify a slow startup into a restart loop.

A deployment causes a memory leak. What is the response?

Roll back or stop rollout if impact is growing, gather memory profiles, compare with canary metrics, set safe resource limits, add alerts, and fix the leak before resuming rollout.

A deployment breaks only under high load. What should be added?

Add performance and load tests to CI, simulate peak traffic, test autoscaling behavior, measure saturation and tail latency, and ensure the canary receives enough load to reveal regressions.

Terraform apply accidentally destroys resources. What controls prevent repeats?

Use terraform plan gates, mandatory reviews, state locking, drift detection, deletion protection, least-privilege apply roles, policy checks, and backups for state and critical resources.

Terraform state becomes corrupted. How do you recover?

Restore from remote backend versioning if available, avoid manual edits unless absolutely necessary, re-import resources, verify state addresses, enable locking, and document the recovery steps.

Terraform apply is slow. How do you improve it?

Split monolithic state into smaller modules or workspaces, reduce provider calls, avoid unnecessary data sources, use targeted plans only with care, and parallelize independent stacks through orchestration.

A module update breaks downstream modules. What prevents this?

Use semantic versioning, pin module versions, publish changelogs, test modules in isolation, run integration tests for downstream consumers, and avoid breaking input or output contracts without a major version.

Ansible playbooks take too long to run. How do you optimize them?

Enable SSH pipelining, use async tasks where safe, reduce fact gathering, cache facts, batch hosts deliberately, avoid shell loops when modules exist, and remove unnecessary serial bottlenecks.

Configuration drift is detected in production. What do you do?

Identify whether drift is emergency, accidental, or unmanaged. Reconcile through Git or IaC, import legitimate changes, alert on future drift, and restrict manual access where possible.

A cloud resource was created manually outside IaC. What is the safe path?

Import it into IaC or recreate it through IaC, add policy-as-code controls, restrict manual permissions, tag ownership, and make future manual creation visible through drift detection.

IaC pipelines fail due to API rate limits. How do you reduce failures?

Batch operations, add retries with backoff, tune provider throttling controls, reduce parallelism if needed, cache lookups, split high-churn stacks, and coordinate deploy windows.

A secret is accidentally committed to Git. What is the response?

Rotate the secret immediately, revoke old credentials, scan for use, purge history where required, add secret scanners, pre-commit hooks, protected variables, and move secrets to managed storage.

A VM autoscaling group keeps flapping. What do you tune?

Increase cooldowns, widen thresholds, use longer evaluation windows, inspect load patterns, tune health checks, and avoid scaling decisions on a noisy single metric.

A GitOps controller keeps reverting manual changes. What do you do?

Find the owning GitOps object, commit intended changes to Git, educate teams on source-of-truth behavior, and restrict direct production edits except documented emergency workflows.

GitOps sync loops are too slow. What do you tune?

Use webhooks, reduce sync interval carefully, optimize manifest generation, split large apps, reduce repo size, and check controller CPU, memory, and API server throttling.

GitOps deploys broken YAML. What should block it earlier?

Add YAML linting, schema validation, server-side dry runs, kubeconform or kubeval checks, OPA or Kyverno policy checks, and CI gates before manifests reach the GitOps controller.

GitOps fails due to merge conflicts. What process helps?

Use environment branches or directories consistently, require PR reviews, reduce concurrent edits to the same files, automate image tag updates safely, and use clear ownership for shared overlays.

GitOps deploys too frequently. How do you control cadence?

Batch low-risk changes, use release branches, deployment windows, progressive delivery, and policy rules that separate build frequency from production rollout frequency.

GitOps fails due to missing secrets. What pattern helps?

Use sealed secrets, external secret operators, environment-specific secret stores, clear bootstrap ordering, and validation that referenced secret names exist before rollout.

GitOps deploys to the wrong cluster. What prevents it?

Use cluster selectors, environment labels, separate repos or folders per environment, explicit project permissions, unique cluster names, and checks that verify destination cluster identity.

GitOps fails due to large manifests. What do you clean up?

Split manifests into modules, use Kustomize overlays or Helm values, reduce duplication, avoid committing generated noise, and keep CRDs and app resources organized by lifecycle.

GitOps rollback fails. What should be true beforehand?

Manifests and images should be versioned, history should be retained, database changes should be backward compatible, and rollback should be tested as a normal release operation.

GitOps sync fails due to CRD version mismatch. What is the order of operations?

Upgrade CRDs first, validate API versions, confirm controllers support the new CRDs, then roll out custom resources. Avoid applying resources before their CRDs and controllers are ready.

CI runners are overloaded. What do you do?

Autoscale runners, use ephemeral runners, separate heavy and light workloads, distribute queues, cap concurrency for expensive jobs, and measure queue time as a capacity signal.

Artifact storage is growing uncontrollably. What controls help?

Enable retention policies, prune old images and artifacts, use lifecycle rules, deduplicate layers, tag releases clearly, and keep only artifacts required for rollback and audit.

A container registry is slow. What do you check?

Check registry saturation, network path, regional distance, authentication latency, layer size, cache hit rate, and whether regional mirrors or pull-through caches can reduce latency.

A build agent pool is unbalanced. What improves distribution?

Use weighted queues, labels, autoscaling, workload sharding, runner affinity, and separate pools for CPU-heavy, memory-heavy, and privileged jobs.

A deployment pipeline is too complex. How do you simplify it?

Remove redundant stages, modularize repeated logic, standardize templates, separate build from deploy, make environment promotion explicit, and keep each stage's purpose clear.

A service has no rollback strategy. What do you implement?

Use versioned artifacts, immutable image tags, GitOps rollbacks, database-compatible changes, automated rollback triggers, canary analysis, and documented manual rollback steps.

A team pushes directly to main. What governance controls help?

Enforce branch protection, PR reviews, required CI checks, CODEOWNERS, signed commits if needed, and protected deploy environments.

A production incident was caused by a missing test. What changes?

Add a regression test for the failure, improve coverage around the class of bug, add contract or integration tests where needed, and ensure the test runs before deploy.

A pipeline deploys untested code. What quality gates should exist?

Require unit tests, integration tests, security scans, policy checks, artifact signing, config validation, and environment-specific deploy gates before production rollout.

A team wants to deploy 20 times per day safely. What platform capabilities are needed?

Use small changes, canary releases, feature flags, automated rollback, strong observability, fast tests, ownership, progressive delivery, and a culture that treats rollback as normal.

DevOps & CI/CD Scenarios Knowledge

Questions

Keep going

See also