Production Readiness
TL;DR
Use this checklist when onboarding a new service or preparing for a high-traffic event. It covers the minimum bar an application must meet before it can be considered production-ready on Kubernetes.
Workload Configuration
- Resource requests and limits are set on every container. No Guaranteed QoS without limits; no BestEffort pods in production.
- Liveness and readiness probes are configured and tested. Liveness restart threshold should be high enough to avoid kill-loops during startup.
- Startup probe is used for slow-starting containers to prevent premature liveness kills.
- Replicas ≥ 2 for stateless services; at least 3 for stateful quorum workloads. Single replicas are a maintenance window.
- PodDisruptionBudget is defined to protect availability during voluntary disruptions (upgrades, drains).
- Pod topology spread or anti-affinity spreads replicas across zones to prevent zone-wide outages.
- Graceful shutdown: the app handles SIGTERM, drains in-flight requests, and
terminationGracePeriodSecondsis long enough. - Image tag is pinned to a digest or specific version — never
:latestin production.
Reliability Patterns
These YAML snippets illustrate the key workload properties for reliability; combine them in your Deployment spec.
yamlreliability-patterns.yaml
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never take a pod down before a new one is ready
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 60
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-service
containers:
- name: my-service
image: my-service:1.2.3@sha256:abc123 # pinned digest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 256Mi # no CPU limit to avoid throttling; use VPA for right-sizing
startupProbe:
httpGet: {path: /healthz, port: 8080}
failureThreshold: 30
periodSeconds: 5 # 30 * 5s = 150s max startup time
readinessProbe:
httpGet: {path: /readyz, port: 8080}
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet: {path: /healthz, port: 8080}
periodSeconds: 10
failureThreshold: 6 # 60 seconds of failures before restart
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-service-pdb
spec:
minAvailable: 2 # always keep at least 2 pods up during drains
selector:
matchLabels:
app: my-serviceSecurity Checklist
- Non-root user:
runAsNonRoot: trueandrunAsUserset to a non-zero UID. - Read-only root filesystem:
readOnlyRootFilesystem: true; mount writableemptyDirvolumes only where needed. - Drop all capabilities:
capabilities: {drop: [ALL]}; add back only the minimum required. - No privilege escalation:
allowPrivilegeEscalation: false. - Secrets not in env vars: use volume mounts or External Secrets Operator rather than literal values in
envblocks. - Network policy: default-deny ingress and egress with explicit allow-list for the service's actual communication paths.
- Image scan: container images are scanned for CVEs in CI; critical vulnerabilities block the pipeline.
- RBAC: service accounts have the minimum required permissions; no cluster-admin for application pods.
Observability Checklist
- Structured logs (JSON) written to stdout/stderr. Log level configurable at runtime via env var.
- Prometheus metrics exposed on
/metrics; ServiceMonitor or PodMonitor configured for scraping. - SLO metrics instrumented: request count, error count, and latency histogram (p50/p95/p99).
- Distributed tracing headers propagated (trace-id, span-id); spans exported to OpenTelemetry collector.
- Dashboards: Grafana dashboard exists covering request rate, error rate, latency, and saturation (USE/RED method).
- Alerts: SLO burn-rate alerts configured; on-call runbook linked in alert annotations.
Operations Checklist
- Runbook exists and is linked from alert definitions. Runbook covers: first checks, escalation, rollback.
- On-call rotation is set up; alerting rules route to the correct team.
- Horizontal autoscaling (HPA) configured with appropriate min/max replicas and scaling metrics.
- Deployment is GitOps-managed (ArgoCD / Flux); no kubectl apply from laptops in production.
- Rollback is documented and tested:
helm rollbackorkubectl rollout undoverified to restore service. - Persistent data is backed up; restore procedure is documented and drilled.