M

monitoring-expert

by @jeffallanv
4.4(37)

Configures monitoring systems and implements structured logging to ensure stable system operation and rapid fault localization.

monitoringalertingobservabilityprometheusgrafanaelk-stackGitHub
Installation
npx skills add jeffallan/claude-skills --skill monitoring-expert
compare_arrows

Before / After Comparison

1
Before

Lack of an effective monitoring system and log management makes system failures difficult to detect and pinpoint. This results in prolonged service outages, poor user experience, an overwhelmed operations team, and impacts business continuity.

After

Establish a comprehensive monitoring system and structured log pipeline to provide real-time alerts for potential issues. This enables rapid fault localization, reduces recovery time, ensures high system availability, and significantly boosts operational efficiency.

SKILL.md

Monitoring Expert

Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.

Core Workflow

  1. Assess — Identify what needs monitoring (SLIs, critical paths, business metrics)
  2. Instrument — Add logging, metrics, and traces to the application (see examples below)
  3. Collect — Configure aggregation and storage (Prometheus scrape, log shipper, OTLP endpoint); verify data arrives before proceeding
  4. Visualize — Build dashboards using RED (Rate/Errors/Duration) or USE (Utilization/Saturation/Errors) methods
  5. Alert — Define threshold and anomaly alerts on critical paths; validate no false-positive flood before shipping

Quick-Start Examples

Structured Logging (Node.js / Pino)

import pino from 'pino';

const logger = pino({ level: 'info' });

// Good — structured fields, includes correlation ID
logger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created');

// Bad — string interpolation, no correlation
console.log(`Order created for user ${userId}`);

Prometheus Metrics (Node.js)

import { Counter, Histogram, register } from 'prom-client';

const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route'],
  buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});

// Instrument a route
app.use((req, res, next) => {
  const end = httpDuration.startTimer({ method: req.method, route: req.path });
  res.on('finish', () => {
    httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
    end();
  });
  next();
});

// Expose scrape endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

OpenTelemetry Tracing (Node.js)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { trace } from '@opentelemetry/api';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }),
});
sdk.start();

// Manual span around a critical operation
const tracer = trace.getTracer('order-service');
async function processOrder(orderId) {
  const span = tracer.startSpan('order.process');
  span.setAttribute('order.id', orderId);
  try {
    const result = await db.saveOrder(orderId);
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (err) {
    span.recordException(err);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw err;
  } finally {
    span.end();
  }
}

Prometheus Alerting Rule

groups:
  - name: api.rules
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% on {{ $labels.route }}"

k6 Load Test

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '1m', target: 50 },   // ramp up
    { duration: '5m', target: 50 },   // sustained load
    { duration: '1m', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95th percentile < 500 ms
    http_req_failed:   ['rate<0.01'],  // error rate < 1%
  },
};

export default function () {
  const res = http.get('https://api.example.com/orders');
  check(res, { 'status is 200': (r) => r.status === 200 });
  sleep(1);
}

Reference Guide

Load detailed guidance based on context:

TopicReferenceLoad When
Loggingreferences/structured-logging.mdPino, JSON logging
Metricsreferences/prometheus-metrics.mdCounter, Histogram, Gauge
Tracingreferences/opentelemetry.mdOpenTelemetry, spans
Alertingreferences/alerting-rules.mdPrometheus alerts
Dashboardsreferences/dashboards.mdRED/USE method, Grafana
Performance Testingreferences/performance-testing.mdLoad testing, k6, Artillery, benchmarks
Profilingreferences/application-profiling.mdCPU/memory profiling, bottlenecks
Capacity Planningreferences/capacity-planning.mdScaling, forecasting, budgets

Constraints

MUST DO

  • Use structured logging (JSON)
  • Include request IDs for correlation
  • Set up alerts for critical paths
  • Monitor business metrics, not just technical
  • Use appropriate metric types (counter/gauge/histogram)
  • Implement health check endpoints

MUST NOT DO

  • Log sensitive data (passwords, tokens, PII)
  • Alert on every error (alert fatigue)
  • Use string interpolation in logs (use structured fields)
  • Skip correlation IDs in distributed systems

User Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs2.6K
Rating4.4 / 5.0
Version
Updated2026年5月21日
Comparisons1

User Rating

4.4(37)
5
14%
4
46%
3
35%
2
5%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

Timeline

Created2026年3月16日
Last Updated2026年5月21日