DCGM Telemetry & GPU Alerts — K8s SRE Reference

TL;DR

DCGM Exporter (installed by the GPU Operator) exposes GPU utilization, memory, temperature, power, and XID errors to Prometheus. Enable ServiceMonitor, chart dashboards for GPU saturation, and alert on hardware errors before users open tickets.

Metrics pipeline

Figure 1 — Per-node DaemonSet scrapes DCGM; cluster Prometheus aggregates by node, GPU index, and pod (when exported).

Vital Prometheus metrics

Metric (DCGM exporter family)	Why SREs care
`DCGM_FI_DEV_GPU_UTIL`	Detect idle expensive nodes vs saturated GPUs
`DCGM_FI_DEV_FB_USED` / `FB_FREE`	VRAM pressure — OOM in CUDA workloads
`DCGM_FI_DEV_GPU_TEMP`	Thermal throttling early warning
`DCGM_FI_DEV_POWER_USAGE`	Power cap / rack limits
`DCGM_FI_DEV_XID_ERRORS`	Driver/hardware faults — page immediately
`DCGM_FI_DEV_ECC_*`	Memory reliability on datacenter GPUs

Exact metric names vary slightly by DCGM exporter version—validate against /metrics on a live pod after install.

ServiceMonitor shape

yamldcgm-servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node

Example alert rules

yamlgpu-prometheus-rules.yaml

groups:
  - name: gpu-hardware
    rules:
      - alert: GPUXidError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU XID error on {{ $labels.node }} GPU {{ $labels.gpu }}"
      - alert: GPUHighTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 10m
        labels:
          severity: warning
      - alert: GPUNodeIdleWhilePending
        expr: |
          (avg by (node) (DCGM_FI_DEV_GPU_UTIL) < 5)
          and on (node) (count(kube_pod_status_phase{phase="Pending"} == 1) > 0)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "GPU node underutilized while GPU pods are Pending — scheduling or quota issue"

Alert design patterns: Alerting design.

Telemetry: DCGM Exporter & Prometheus Alerts

Metrics pipeline

Vital Prometheus metrics

ServiceMonitor shape

Example alert rules

See also

Metrics pipeline

Vital Prometheus metrics

ServiceMonitor shape

Example alert rules

See also

Related