---
id: sm-slo-implementation
name: "slo-implementation"
url: https://skills.yangsir.net/skill/sm-slo-implementation
author: wshobson
domain: ai-system-observability-sre
tags: ["slo-(service-level-objectives)", "sre", "reliability-engineering", "monitoring", "incident-management"]
install_count: 6300
rating: 4.40 (69 reviews)
github: https://github.com/wshobson/agents
---

# slo-implementation

> 掌握服务水平目标（SLO）的实施，通过智能自动化和多智能体编排，确保系统可靠性，满足用户期望。

**Stats**: 6,300 installs · 4.4/5 (69 reviews)

## Before / After 对比

### SLO实施与SLA达成率对比

| Metric | Before | After | Change |
|---|---|---|---|
| - | - | - | - |
| - | - | - | - |
| - | - | - | - |

## Readme

# slo-implementation

# SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

## Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

## When to Use

- Define service reliability targets

- Measure user-perceived reliability

- Implement error budgets

- Create SLO-based alerts

- Track reliability goals

## SLI/SLO/SLA Hierarchy

```
SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement

```

## Defining SLIs

### Common SLI Types

#### 1. Availability SLI

```
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

```

#### 2. Latency SLI

```
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

```

#### 3. Durability SLI

```
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

```

**Reference:** See `references/slo-definitions.md`

## Setting SLO Targets

### Availability SLO Examples

SLO %
Downtime/Month
Downtime/Year

99%
7.2 hours
3.65 days

99.9%
43.2 minutes
8.76 hours

99.95%
21.6 minutes
4.38 hours

99.99%
4.32 minutes
52.56 minutes

### Choose Appropriate SLOs

**Consider:**

- User expectations

- Business requirements

- Current performance

- Cost of reliability

- Competitor benchmarks

**Example SLOs:**

```
slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

```

## Error Budget Calculation

### Error Budget Formula

```
Error Budget = 1 - SLO Target

```

**Example:**

- SLO: 99.9% availability

- Error Budget: 0.1% = 43.2 minutes/month

- Current Error: 0.05% = 21.6 minutes/month

- Remaining Budget: 50%

### Error Budget Policy

```
error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability

```

**Reference:** See `references/error-budget.md`

## SLO Implementation

### Prometheus Recording Rules

```
# SLI Recording Rules
groups:
  - name: sli_rules
    interval: 30s
    rules:
      # Availability SLI
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules
    interval: 5m
    rules:
      # SLO compliance (1 = meeting SLO, 0 = violating)
      - record: slo:http_availability:compliance
        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance
        expr: sli:http_latency:ratio >= bool 0.99

      # Error budget remaining (percentage)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Error budget burn rate
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

```

### SLO Alerting Rules

```
groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget remaining: {{ $value }}%"

```

## SLO Dashboard

**Grafana Dashboard Structure:**

```
┌────────────────────────────────────┐
│ SLO Compliance (Current)           │
│ ✓ 99.95% (Target: 99.9%)          │
├────────────────────────────────────┤
│ Error Budget Remaining: 65%        │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI Trend (28 days)                │
│ [Time series graph]                │
├────────────────────────────────────┤
│ Burn Rate Analysis                 │
│ [Burn rate by time window]         │
└────────────────────────────────────┘

```

**Example Queries:**

```
# Current SLO compliance
sli:http_availability:ratio * 100

# Error budget remaining
slo:http_availability:error_budget_remaining

# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)

```

## Multi-Window Burn Rate Alerts

```
# Combination of short and long windows reduces false positives
rules:
  - alert: SLOBurnRateHigh
    expr: |
      (
        slo:http_availability:burn_rate_1h > 14.4
        and
        slo:http_availability:burn_rate_5m > 14.4
      )
      or
      (
        slo:http_availability:burn_rate_6h > 6
        and
        slo:http_availability:burn_rate_30m > 6
      )
    labels:
      severity: critical

```

## SLO Review Process

### Weekly Review

- Current SLO compliance

- Error budget status

- Trend analysis

- Incident impact

### Monthly Review

- SLO achievement

- Error budget usage

- Incident postmortems

- SLO adjustments

### Quarterly Review

- SLO relevance

- Target adjustments

- Process improvements

- Tooling enhancements

## Best Practices

- **Start with user-facing services**

- **Use multiple SLIs** (availability, latency, etc.)

- **Set achievable SLOs** (don't aim for 100%)

- **Implement multi-window alerts** to reduce noise

- **Track error budget** consistently

- **Review SLOs regularly**

- **Document SLO decisions**

- **Align with business goals**

- **Automate SLO reporting**

- **Use SLOs for prioritization**

## Related Skills

- `prometheus-configuration` - For metric collection

- `grafana-dashboards` - For SLO visualization

Weekly Installs2.9KRepository[wshobson/agents](https://github.com/wshobson/agents)GitHub Stars31.5KFirst SeenJan 20, 2026Security Audits[Gen Agent Trust HubPass](/wshobson/agents/slo-implementation/security/agent-trust-hub)[SocketPass](/wshobson/agents/slo-implementation/security/socket)[SnykPass](/wshobson/agents/slo-implementation/security/snyk)Installed onclaude-code2.3Kopencode2.1Kgemini-cli2.1Kcursor2.0Kcodex2.0Kgithub-copilot1.7K

---
*Source: https://skills.yangsir.net/skill/sm-slo-implementation*
*Markdown mirror: https://skills.yangsir.net/api/skill/sm-slo-implementation/markdown*