Production Readiness — K8s SRE Reference

TL;DR

Use this checklist when onboarding a new service or preparing for a high-traffic event. It covers the minimum bar an application must meet before it can be considered production-ready on Kubernetes.

Workload Configuration

✓Resource requests and limits are set on every container. No Guaranteed QoS without limits; no BestEffort pods in production.
✓Liveness and readiness probes are configured and tested. Liveness restart threshold should be high enough to avoid kill-loops during startup.
✓Startup probe is used for slow-starting containers to prevent premature liveness kills.
✓Replicas ≥ 2 for stateless services; at least 3 for stateful quorum workloads. Single replicas are a maintenance window.
✓PodDisruptionBudget is defined to protect availability during voluntary disruptions (upgrades, drains).
✓Pod topology spread or anti-affinity spreads replicas across zones to prevent zone-wide outages.
✓Graceful shutdown: the app handles SIGTERM, drains in-flight requests, and terminationGracePeriodSeconds is long enough.
✓Image tag is pinned to a digest or specific version — never :latest in production.

Reliability Patterns

These YAML snippets illustrate the key workload properties for reliability; combine them in your Deployment spec.

yamlreliability-patterns.yaml

spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0          # never take a pod down before a new one is ready
      maxSurge: 1

  template:
    spec:
      terminationGracePeriodSeconds: 60

      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: my-service

      containers:
      - name: my-service
        image: my-service:1.2.3@sha256:abc123  # pinned digest
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            memory: 256Mi   # no CPU limit to avoid throttling; use VPA for right-sizing

        startupProbe:
          httpGet: {path: /healthz, port: 8080}
          failureThreshold: 30
          periodSeconds: 5       # 30 * 5s = 150s max startup time

        readinessProbe:
          httpGet: {path: /readyz, port: 8080}
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3

        livenessProbe:
          httpGet: {path: /healthz, port: 8080}
          periodSeconds: 10
          failureThreshold: 6    # 60 seconds of failures before restart

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 2               # always keep at least 2 pods up during drains
  selector:
    matchLabels:
      app: my-service

Security Checklist

✓Non-root user: runAsNonRoot: true and runAsUser set to a non-zero UID.
✓Read-only root filesystem: readOnlyRootFilesystem: true; mount writable emptyDir volumes only where needed.
✓Drop all capabilities: capabilities: {drop: [ALL]}; add back only the minimum required.
✓No privilege escalation: allowPrivilegeEscalation: false.
✓Secrets not in env vars: use volume mounts or External Secrets Operator rather than literal values in env blocks.
✓Network policy: default-deny ingress and egress with explicit allow-list for the service's actual communication paths.
✓Image scan: container images are scanned for CVEs in CI; critical vulnerabilities block the pipeline.
✓RBAC: service accounts have the minimum required permissions; no cluster-admin for application pods.

Observability Checklist

✓Structured logs (JSON) written to stdout/stderr. Log level configurable at runtime via env var.
✓Prometheus metrics exposed on /metrics; ServiceMonitor or PodMonitor configured for scraping.
✓SLO metrics instrumented: request count, error count, and latency histogram (p50/p95/p99).
✓Distributed tracing headers propagated (trace-id, span-id); spans exported to OpenTelemetry collector.
✓Dashboards: Grafana dashboard exists covering request rate, error rate, latency, and saturation (USE/RED method).
✓Alerts: SLO burn-rate alerts configured; on-call runbook linked in alert annotations.

Operations Checklist

✓Runbook exists and is linked from alert definitions. Runbook covers: first checks, escalation, rollback.
✓On-call rotation is set up; alerting rules route to the correct team.
✓Horizontal autoscaling (HPA) configured with appropriate min/max replicas and scaling metrics.
✓Deployment is GitOps-managed (ArgoCD / Flux); no kubectl apply from laptops in production.
✓Rollback is documented and tested: helm rollback or kubectl rollout undo verified to restore service.
✓Persistent data is backed up; restore procedure is documented and drilled.

Workload Configuration

Reliability Patterns

Security Checklist

Observability Checklist

Operations Checklist

Related Pages