TL;DR

Use this checklist when onboarding a new service or preparing for a high-traffic event. It covers the minimum bar an application must meet before it can be considered production-ready on Kubernetes.

Workload Configuration

  • Resource requests and limits are set on every container. No Guaranteed QoS without limits; no BestEffort pods in production.
  • Liveness and readiness probes are configured and tested. Liveness restart threshold should be high enough to avoid kill-loops during startup.
  • Startup probe is used for slow-starting containers to prevent premature liveness kills.
  • Replicas ≥ 2 for stateless services; at least 3 for stateful quorum workloads. Single replicas are a maintenance window.
  • PodDisruptionBudget is defined to protect availability during voluntary disruptions (upgrades, drains).
  • Pod topology spread or anti-affinity spreads replicas across zones to prevent zone-wide outages.
  • Graceful shutdown: the app handles SIGTERM, drains in-flight requests, and terminationGracePeriodSeconds is long enough.
  • Image tag is pinned to a digest or specific version — never :latest in production.

Reliability Patterns

These YAML snippets illustrate the key workload properties for reliability; combine them in your Deployment spec.

yamlreliability-patterns.yaml
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0          # never take a pod down before a new one is ready
      maxSurge: 1

  template:
    spec:
      terminationGracePeriodSeconds: 60

      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-service

      containers:
      - name: my-service
        image: my-service:1.2.3@sha256:abc123  # pinned digest
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            memory: 256Mi   # no CPU limit to avoid throttling; use VPA for right-sizing

        startupProbe:
          httpGet: {path: /healthz, port: 8080}
          failureThreshold: 30
          periodSeconds: 5       # 30 * 5s = 150s max startup time

        readinessProbe:
          httpGet: {path: /readyz, port: 8080}
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3

        livenessProbe:
          httpGet: {path: /healthz, port: 8080}
          periodSeconds: 10
          failureThreshold: 6    # 60 seconds of failures before restart

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 2               # always keep at least 2 pods up during drains
  selector:
    matchLabels:
      app: my-service

Security Checklist

  • Non-root user: runAsNonRoot: true and runAsUser set to a non-zero UID.
  • Read-only root filesystem: readOnlyRootFilesystem: true; mount writable emptyDir volumes only where needed.
  • Drop all capabilities: capabilities: {drop: [ALL]}; add back only the minimum required.
  • No privilege escalation: allowPrivilegeEscalation: false.
  • Secrets not in env vars: use volume mounts or External Secrets Operator rather than literal values in env blocks.
  • Network policy: default-deny ingress and egress with explicit allow-list for the service's actual communication paths.
  • Image scan: container images are scanned for CVEs in CI; critical vulnerabilities block the pipeline.
  • RBAC: service accounts have the minimum required permissions; no cluster-admin for application pods.

Observability Checklist

  • Structured logs (JSON) written to stdout/stderr. Log level configurable at runtime via env var.
  • Prometheus metrics exposed on /metrics; ServiceMonitor or PodMonitor configured for scraping.
  • SLO metrics instrumented: request count, error count, and latency histogram (p50/p95/p99).
  • Distributed tracing headers propagated (trace-id, span-id); spans exported to OpenTelemetry collector.
  • Dashboards: Grafana dashboard exists covering request rate, error rate, latency, and saturation (USE/RED method).
  • Alerts: SLO burn-rate alerts configured; on-call runbook linked in alert annotations.

Operations Checklist

  • Runbook exists and is linked from alert definitions. Runbook covers: first checks, escalation, rollback.
  • On-call rotation is set up; alerting rules route to the correct team.
  • Horizontal autoscaling (HPA) configured with appropriate min/max replicas and scaling metrics.
  • Deployment is GitOps-managed (ArgoCD / Flux); no kubectl apply from laptops in production.
  • Rollback is documented and tested: helm rollback or kubectl rollout undo verified to restore service.
  • Persistent data is backed up; restore procedure is documented and drilled.