Telemetry: DCGM Exporter & Prometheus Alerts
TL;DR
DCGM Exporter (installed by the GPU Operator) exposes GPU utilization, memory, temperature, power, and XID errors to Prometheus. Enable ServiceMonitor, chart dashboards for GPU saturation, and alert on hardware errors before users open tickets.
Metrics pipeline
Figure 1 — Per-node DaemonSet scrapes DCGM; cluster Prometheus aggregates by node, GPU index, and pod (when exported).
Vital Prometheus metrics
| Metric (DCGM exporter family) | Why SREs care |
|---|---|
DCGM_FI_DEV_GPU_UTIL | Detect idle expensive nodes vs saturated GPUs |
DCGM_FI_DEV_FB_USED / FB_FREE | VRAM pressure — OOM in CUDA workloads |
DCGM_FI_DEV_GPU_TEMP | Thermal throttling early warning |
DCGM_FI_DEV_POWER_USAGE | Power cap / rack limits |
DCGM_FI_DEV_XID_ERRORS | Driver/hardware faults — page immediately |
DCGM_FI_DEV_ECC_* | Memory reliability on datacenter GPUs |
Exact metric names vary slightly by DCGM exporter version—validate against /metrics on a live pod after install.
ServiceMonitor shape
yamldcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nvidia-dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: metrics
interval: 15s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: nodeExample alert rules
yamlgpu-prometheus-rules.yaml
groups:
- name: gpu-hardware
rules:
- alert: GPUXidError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "GPU XID error on {{ $labels.node }} GPU {{ $labels.gpu }}"
- alert: GPUHighTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 10m
labels:
severity: warning
- alert: GPUNodeIdleWhilePending
expr: |
(avg by (node) (DCGM_FI_DEV_GPU_UTIL) < 5)
and on (node) (count(kube_pod_status_phase{phase="Pending"} == 1) > 0)
for: 15m
labels:
severity: warning
annotations:
summary: "GPU node underutilized while GPU pods are Pending — scheduling or quota issue"Alert design patterns: Alerting design.