SLOs & Error Budgets — K8s SRE Reference

TL;DR

An SLO is a reliability target expressed as a percentage over a time window. The error budget is what you're allowed to burn. When the budget runs out, stop shipping features and fix reliability. PromQL multi-window burn-rate alerts give early warning before you exhaust the budget.

Concepts

Term	Definition	Example
SLI	Service Level Indicator — a measurable metric of user experience	% of HTTP requests that succeed in < 500ms
SLO	Service Level Objective — target value for an SLI over a window	99.9% of requests succeed over a 30-day rolling window
SLA	Service Level Agreement — contractual commitment (often SLO − margin)	99.5% monthly uptime
Error Budget	1 − SLO target; how much unreliability you're allowed per window	0.1% of requests = ~43 min downtime per month

Error Budget Math

Error budget is a finite resource shared between reliability incidents and planned maintenance; knowing the numbers lets you have data-driven conversations about release velocity.

basherror-budget-math.txt

# Error budget for request-based SLO
SLO=99.9%
Error Budget = 1 - SLO = 0.1%

# Over 30 days (2,592,000 seconds), allowed downtime:
0.001 * 2592000 = 2592 seconds ≈ 43.2 minutes

# Error budget remaining after an incident:
Total requests in window:  1,000,000
Failed requests this month:     500
Error rate = 500 / 1,000,000 = 0.05%
Budget allowed = 0.1% = 1000 bad requests
Budget remaining = (1000 - 500) / 1000 = 50%

# Common SLO targets and their 30-day downtime budgets:
# 99.0%  → 7.2  hours
# 99.5%  → 3.6  hours
# 99.9%  → 43.2 minutes
# 99.95% → 21.6 minutes
# 99.99% →  4.3 minutes

PromQL — SLI Expressions

These queries compute the SLI (good-request ratio) for availability and latency; adapt the metric names to your application's instrumentation.

bashsli-queries.promql

# Availability SLI: fraction of successful HTTP requests (non-5xx)
sum(rate(http_requests_total{code!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# Latency SLI: fraction of requests completing under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  / sum(rate(http_request_duration_seconds_count[5m]))

# 30-day availability (rolling window)
1 - (
  sum(increase(http_requests_total{code=~"5.."}[30d]))
  / sum(increase(http_requests_total[30d]))
)

# Error budget remaining (fraction)
(
  (1 - 0.999)  # SLO = 99.9%, error budget = 0.1%
  - (1 - (
      sum(rate(http_requests_total{code!~"5.."}[30d]))
      / sum(rate(http_requests_total[30d]))
    ))
) / (1 - 0.999)

Burn-rate Alerting

Multi-window burn-rate alerts detect fast burns (high traffic outages) with short windows and slow burns (creeping errors) with long windows — you need both to catch all failure modes.

yamlburn-rate-alerts.yaml

groups:
- name: slo.availability
  rules:
  # Fast burn: consuming budget 14x faster → page immediately
  - alert: AvailabilityBurnRateCritical
    expr: |
      (
        rate(http_requests_total{code=~"5.."}[1h])
        / rate(http_requests_total[1h])
      ) > (14 * 0.001)
      and
      (
        rate(http_requests_total{code=~"5.."}[5m])
        / rate(http_requests_total[5m])
      ) > (14 * 0.001)
    for: 2m
    labels:
      severity: page
    annotations:
      summary: "Error budget burning fast — 14x rate"
      description: "At this rate the 30-day budget exhausts in ~2 hours."

  # Slow burn: consuming budget 3x faster → ticket within 4 hours
  - alert: AvailabilityBurnRateWarning
    expr: |
      (
        rate(http_requests_total{code=~"5.."}[6h])
        / rate(http_requests_total[6h])
      ) > (3 * 0.001)
      and
      (
        rate(http_requests_total{code=~"5.."}[30m])
        / rate(http_requests_total[30m])
      ) > (3 * 0.001)
    for: 15m
    labels:
      severity: ticket
    annotations:
      summary: "Error budget burning slowly — 3x rate"

Pyrra / Sloth SLO Definition

Sloth and Pyrra generate the recording rules and multi-window burn-rate alerts for you from a single SLO spec; use one of these instead of handwriting PromQL alerts.

yamlslo-sloth.yaml

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-availability
  namespace: monitoring
spec:
  service: api
  labels:
    team: platform
  slos:
  - name: availability
    objective: 99.9
    description: "99.9% of API requests succeed (non-5xx)"
    sli:
      events:
        error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
    alerting:
      name: APIAvailability
      page_alert:   {labels: {severity: page}}
      ticket_alert: {labels: {severity: ticket}}

Concepts

Error Budget Math

PromQL — SLI Expressions

Burn-rate Alerting

Pyrra / Sloth SLO Definition

Related Pages