S

slo-implementation

by @wshobsonv
4.4(69)

掌握服务水平目标(SLO)的实施,通过智能自动化和多智能体编排,确保系统可靠性,满足用户期望。

slo-(service-level-objectives)srereliability-engineeringmonitoringincident-managementGitHub
安装方式
npx skills add wshobson/agents --skill slo-implementation
compare_arrows

Before / After 效果对比

1
使用前

缺乏明确的服务等级目标(SLO),服务质量难以衡量,导致客户满意度下降,SLA(服务等级协议)难以达成。

使用后

实施SLO(Service Level Objectives)模式,定义清晰的服务质量指标,并自动化监控和报告,提升服务可靠性。

SKILL.md

slo-implementation

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

When to Use

  • Define service reliability targets

  • Measure user-perceived reliability

  • Implement error budgets

  • Create SLO-based alerts

  • Track reliability goals

SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement

Defining SLIs

Common SLI Types

1. Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

2. Latency SLI

# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

3. Durability SLI

# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO % Downtime/Month Downtime/Year

99% 7.2 hours 3.65 days

99.9% 43.2 minutes 8.76 hours

99.95% 21.6 minutes 4.38 hours

99.99% 4.32 minutes 52.56 minutes

Choose Appropriate SLOs

Consider:

  • User expectations

  • Business requirements

  • Current performance

  • Cost of reliability

  • Competitor benchmarks

Example SLOs:

slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

Error Budget = 1 - SLO Target

Example:

  • SLO: 99.9% availability

  • Error Budget: 0.1% = 43.2 minutes/month

  • Current Error: 0.05% = 21.6 minutes/month

  • Remaining Budget: 50%

Error Budget Policy

error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation

Prometheus Recording Rules

# SLI Recording Rules
groups:
  - name: sli_rules
    interval: 30s
    rules:
      # Availability SLI
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules
    interval: 5m
    rules:
      # SLO compliance (1 = meeting SLO, 0 = violating)
      - record: slo:http_availability:compliance
        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance
        expr: sli:http_latency:ratio >= bool 0.99

      # Error budget remaining (percentage)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Error budget burn rate
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

SLO Alerting Rules

groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

┌────────────────────────────────────┐
│ SLO Compliance (Current)           │
│ ✓ 99.95% (Target: 99.9%)          │
├────────────────────────────────────┤
│ Error Budget Remaining: 65%        │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI Trend (28 days)                │
│ [Time series graph]                │
├────────────────────────────────────┤
│ Burn Rate Analysis                 │
│ [Burn rate by time window]         │
└────────────────────────────────────┘

Example Queries:

# Current SLO compliance
sli:http_availability:ratio * 100

# Error budget remaining
slo:http_availability:error_budget_remaining

# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

# Combination of short and long windows reduces false positives
rules:
  - alert: SLOBurnRateHigh
    expr: |
      (
        slo:http_availability:burn_rate_1h > 14.4
        and
        slo:http_availability:burn_rate_5m > 14.4
      )
      or
      (
        slo:http_availability:burn_rate_6h > 6
        and
        slo:http_availability:burn_rate_30m > 6
      )
    labels:
      severity: critical

SLO Review Process

Weekly Review

  • Current SLO compliance

  • Error budget status

  • Trend analysis

  • Incident impact

Monthly Review

  • SLO achievement

  • Error budget usage

  • Incident postmortems

  • SLO adjustments

Quarterly Review

  • SLO relevance

  • Target adjustments

  • Process improvements

  • Tooling enhancements

Best Practices

  • Start with user-facing services

  • Use multiple SLIs (availability, latency, etc.)

  • Set achievable SLOs (don't aim for 100%)

  • Implement multi-window alerts to reduce noise

  • Track error budget consistently

  • Review SLOs regularly

  • Document SLO decisions

  • Align with business goals

  • Automate SLO reporting

  • Use SLOs for prioritization

Related Skills

  • prometheus-configuration - For metric collection

  • grafana-dashboards - For SLO visualization

Weekly Installs2.9KRepositorywshobson/agentsGitHub Stars31.5KFirst SeenJan 20, 2026Security AuditsGen Agent Trust HubPassSocketPassSnykPassInstalled onclaude-code2.3Kopencode2.1Kgemini-cli2.1Kcursor2.0Kcodex2.0Kgithub-copilot1.7K

用户评价 (0)

发表评价

效果
易用性
文档
兼容性

暂无评价

统计数据

安装量6.3K
评分4.4 / 5.0
版本
更新日期2026年5月23日
对比案例1 组

用户评分

4.4(69)
5
59%
4
41%
3
0%
2
0%
1
0%

为此 Skill 评分

0.0

兼容平台

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

时间线

创建2026年3月17日
最后更新2026年5月23日