SLOs & Error Budgets
An SLO is a reliability target expressed as a percentage over a time window. The error budget is what you're allowed to burn. When the budget runs out, stop shipping features and fix reliability. PromQL multi-window burn-rate alerts give early warning before you exhaust the budget.
Concepts
| Term | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator — a measurable metric of user experience | % of HTTP requests that succeed in < 500ms |
| SLO | Service Level Objective — target value for an SLI over a window | 99.9% of requests succeed over a 30-day rolling window |
| SLA | Service Level Agreement — contractual commitment (often SLO − margin) | 99.5% monthly uptime |
| Error Budget | 1 − SLO target; how much unreliability you're allowed per window | 0.1% of requests = ~43 min downtime per month |
Error Budget Math
Error budget is a finite resource shared between reliability incidents and planned maintenance; knowing the numbers lets you have data-driven conversations about release velocity.
# Error budget for request-based SLO
SLO=99.9%
Error Budget = 1 - SLO = 0.1%
# Over 30 days (2,592,000 seconds), allowed downtime:
0.001 * 2592000 = 2592 seconds ≈ 43.2 minutes
# Error budget remaining after an incident:
Total requests in window: 1,000,000
Failed requests this month: 500
Error rate = 500 / 1,000,000 = 0.05%
Budget allowed = 0.1% = 1000 bad requests
Budget remaining = (1000 - 500) / 1000 = 50%
# Common SLO targets and their 30-day downtime budgets:
# 99.0% → 7.2 hours
# 99.5% → 3.6 hours
# 99.9% → 43.2 minutes
# 99.95% → 21.6 minutes
# 99.99% → 4.3 minutesPromQL — SLI Expressions
These queries compute the SLI (good-request ratio) for availability and latency; adapt the metric names to your application's instrumentation.
# Availability SLI: fraction of successful HTTP requests (non-5xx)
sum(rate(http_requests_total{code!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Latency SLI: fraction of requests completing under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))
# 30-day availability (rolling window)
1 - (
sum(increase(http_requests_total{code=~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))
)
# Error budget remaining (fraction)
(
(1 - 0.999) # SLO = 99.9%, error budget = 0.1%
- (1 - (
sum(rate(http_requests_total{code!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
))
) / (1 - 0.999)Burn-rate Alerting
Multi-window burn-rate alerts detect fast burns (high traffic outages) with short windows and slow burns (creeping errors) with long windows — you need both to catch all failure modes.
groups:
- name: slo.availability
rules:
# Fast burn: consuming budget 14x faster → page immediately
- alert: AvailabilityBurnRateCritical
expr: |
(
rate(http_requests_total{code=~"5.."}[1h])
/ rate(http_requests_total[1h])
) > (14 * 0.001)
and
(
rate(http_requests_total{code=~"5.."}[5m])
/ rate(http_requests_total[5m])
) > (14 * 0.001)
for: 2m
labels:
severity: page
annotations:
summary: "Error budget burning fast — 14x rate"
description: "At this rate the 30-day budget exhausts in ~2 hours."
# Slow burn: consuming budget 3x faster → ticket within 4 hours
- alert: AvailabilityBurnRateWarning
expr: |
(
rate(http_requests_total{code=~"5.."}[6h])
/ rate(http_requests_total[6h])
) > (3 * 0.001)
and
(
rate(http_requests_total{code=~"5.."}[30m])
/ rate(http_requests_total[30m])
) > (3 * 0.001)
for: 15m
labels:
severity: ticket
annotations:
summary: "Error budget burning slowly — 3x rate"Pyrra / Sloth SLO Definition
Sloth and Pyrra generate the recording rules and multi-window burn-rate alerts for you from a single SLO spec; use one of these instead of handwriting PromQL alerts.
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-availability
namespace: monitoring
spec:
service: api
labels:
team: platform
slos:
- name: availability
objective: 99.9
description: "99.9% of API requests succeed (non-5xx)"
sli:
events:
error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
alerting:
name: APIAvailability
page_alert: {labels: {severity: page}}
ticket_alert: {labels: {severity: ticket}}