TL;DR

DCGM Exporter (installed by the GPU Operator) exposes GPU utilization, memory, temperature, power, and XID errors to Prometheus. Enable ServiceMonitor, chart dashboards for GPU saturation, and alert on hardware errors before users open tickets.

Metrics pipeline

GPU hardware DCGM Exporter Prometheus Alertmanager Grafana

Figure 1 — Per-node DaemonSet scrapes DCGM; cluster Prometheus aggregates by node, GPU index, and pod (when exported).

Vital Prometheus metrics

Metric (DCGM exporter family)Why SREs care
DCGM_FI_DEV_GPU_UTILDetect idle expensive nodes vs saturated GPUs
DCGM_FI_DEV_FB_USED / FB_FREEVRAM pressure — OOM in CUDA workloads
DCGM_FI_DEV_GPU_TEMPThermal throttling early warning
DCGM_FI_DEV_POWER_USAGEPower cap / rack limits
DCGM_FI_DEV_XID_ERRORSDriver/hardware faults — page immediately
DCGM_FI_DEV_ECC_*Memory reliability on datacenter GPUs

Exact metric names vary slightly by DCGM exporter version—validate against /metrics on a live pod after install.

ServiceMonitor shape

yamldcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node

Example alert rules

yamlgpu-prometheus-rules.yaml
groups:
  - name: gpu-hardware
    rules:
      - alert: GPUXidError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU XID error on {{ $labels.node }} GPU {{ $labels.gpu }}"
      - alert: GPUHighTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 10m
        labels:
          severity: warning
      - alert: GPUNodeIdleWhilePending
        expr: |
          (avg by (node) (DCGM_FI_DEV_GPU_UTIL) < 5)
          and on (node) (count(kube_pod_status_phase{phase="Pending"} == 1) > 0)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "GPU node underutilized while GPU pods are Pending — scheduling or quota issue"

Alert design patterns: Alerting design.

See also