DaemonSets & Jobs — K8s SRE Reference

TL;DR

DaemonSets run one Pod per matching node, usually for node agents such as log collectors, CNI, storage plugins, and monitoring agents. Jobs run work to completion. CronJobs create Jobs on a schedule.

When To Use Each

Workload	Purpose	Examples	Key Debug Signal
DaemonSet	Run a Pod on every matching node.	Fluent Bit, node-exporter, CNI, CSI node plugin.	Desired/current/ready count per node.
Job	Run finite work until successful completion.	Migration, batch import, one-time maintenance.	Completions, failed Pods, backoffLimit.
CronJob	Create Jobs on a schedule.	Backups, reports, periodic cleanup.	lastScheduleTime, missed schedules, Job history.

DaemonSets

DaemonSets are ideal for node-local agents. When a new node joins, the DaemonSet controller creates a Pod there if the node matches selectors, affinity, and tolerations.

DaemonSet creates one Pod on each matching node.

yamldaemonset-node-agent.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-log-agent
  namespace: observability
spec:
  selector:
    matchLabels:
      app: node-log-agent
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1 # Update one node agent at a time to preserve coverage.
  template:
    metadata:
      labels:
        app: node-log-agent
    spec:
      serviceAccountName: node-log-agent
      tolerations:
        - operator: Exists # Allows agent Pods on tainted nodes, including control-plane nodes if policy allows.
      nodeSelector:
        kubernetes.io/os: linux # Avoid scheduling Linux agent on Windows nodes.
      containers:
        - name: agent
          image: fluent/fluent-bit:3.1
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true # Log agents usually read host logs.
      volumes:
        - name: varlog
          hostPath:
            path: /var/log # Host path on every node.
            type: Directory

DaemonSet Operations

bashdaemonset-ops.sh

# Inspect desired/current/ready counts.
kubectl get daemonset -n <namespace> -o wide
kubectl describe daemonset <daemonset> -n <namespace>

# Show where DaemonSet Pods are running.
kubectl get pods -n <namespace> -l app=<app-label> -o wide

# Watch a DaemonSet rollout.
kubectl rollout status daemonset/<daemonset> -n <namespace>

# Restart all DaemonSet Pods through a template annotation update.
kubectl rollout restart daemonset/<daemonset> -n <namespace>

Jobs

A Job creates Pods and tracks successful completions. Use Jobs for tasks that should finish, not long-running services. A failed Job may create replacement Pods until backoffLimit is reached.

yamldatabase-migration-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-202605
  namespace: app
spec:
  completions: 1 # Number of successful Pods required.
  parallelism: 1 # Number of Pods allowed to run at the same time.
  backoffLimit: 2 # Retry failed Pods twice before marking Job failed.
  activeDeadlineSeconds: 1800 # Hard timeout for the whole Job.
  ttlSecondsAfterFinished: 86400 # Clean up completed Job object after 1 day if TTL controller is enabled.
  template:
    metadata:
      labels:
        job-type: migration
    spec:
      restartPolicy: Never # Required for Jobs unless using OnFailure.
      containers:
        - name: migrate
          image: registry.example.com/platform/app-migrations:2026.05.0
          command: ["./migrate"]
          args: ["--safe"] # App-specific migration flag.
          envFrom:
            - secretRef:
                name: app-database-credentials

Job Operations

bashjob-ops.sh

kubectl get jobs -n <namespace>
kubectl describe job <job-name> -n <namespace>

# Find Pods created by a Job.
kubectl get pods -n <namespace> -l job-name=<job-name> -o wide

# Read logs from the Job's Pods.
kubectl logs -n <namespace> job/<job-name> --all-containers=true

# Delete and recreate a Job when rerun is safe.
kubectl delete job <job-name> -n <namespace>
kubectl apply -f job.yaml

CronJobs

A CronJob creates Jobs based on a cron schedule. Pay close attention to concurrency policy, missed schedules, timezone expectations, and history cleanup.

yamlbackup-cronjob.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
  namespace: app
spec:
  schedule: "0 2 * * *" # Runs daily at 02:00 according to controller timezone/config.
  concurrencyPolicy: Forbid # Do not start a new backup if the previous one is still running.
  startingDeadlineSeconds: 900 # Allow 15 minutes for missed schedule catch-up.
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 1
      activeDeadlineSeconds: 3600
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: backup
              image: registry.example.com/platform/backup:1.0.0
              command: ["/bin/sh", "-c"]
              args:
                - ./backup.sh # Script should be idempotent and safe to retry.
              envFrom:
                - secretRef:
                    name: backup-credentials

bashcronjob-ops.sh

kubectl get cronjobs -n <namespace>
kubectl describe cronjob <cronjob-name> -n <namespace>

# Create a one-off Job from a CronJob template for manual testing.
kubectl create job <manual-job-name> -n <namespace> --from=cronjob/<cronjob-name>

# Suspend and resume a CronJob.
kubectl patch cronjob <cronjob-name> -n <namespace> -p '{"spec":{"suspend":true}}'
kubectl patch cronjob <cronjob-name> -n <namespace> -p '{"spec":{"suspend":false}}'

Troubleshooting

!DaemonSet missing nodes: check node selectors, taints/tolerations, affinity, OS labels, and cordoned nodes.
!DaemonSet unavailable: inspect Pod events for image pull, hostPath, privileged security policy, CNI, or resource pressure failures.
!Job keeps retrying: read failed Pod logs, check exit code, validate command args, and review backoffLimit.
!CronJob did not run: check schedule, suspend flag, controller health, missed starting deadline, and concurrency policy.
!Duplicate work risk: Jobs and CronJobs should be idempotent because retries and manual reruns happen during incidents.