machine-learning-engineer
機械学習エンジニアリングの専門知識を提供し、モデルのデプロイ、本番サービスインフラストラクチャ、リアルタイム推論システムに焦点を当てます。
npx skills add 404kidwiz/claude-supercode-skills --skill machine-learning-engineerBefore / After 効果比較
1 组機械学習モデルのデプロイメントは困難で、本番環境でのパフォーマンスが不安定です。リアルタイム推論システムはレイテンシが高く、ビジネス要件を満たすのが難しいです。
専門的な機械学習エンジニアリングサポートを受けることで、モデルのデプロイメントがより効率的になります。安定したリアルタイム推論システムを構築し、AIサービスの高性能な運用を保証します。
description SKILL.md
machine-learning-engineer
Machine Learning Engineer
Purpose
Provides ML engineering expertise specializing in model deployment, production serving infrastructure, and real-time inference systems. Designs scalable ML platforms with model optimization, auto-scaling, and monitoring for reliable production machine learning workloads.
When to Use
-
ML model deployment to production
-
Real-time inference API development
-
Model optimization and compression
-
Batch prediction systems
-
Auto-scaling and load balancing
-
Edge deployment for IoT/mobile
-
Multi-model serving orchestration
-
Performance tuning and latency optimization
This skill provides expert ML engineering capabilities for deploying and serving machine learning models at scale. It focuses on model optimization, inference infrastructure, real-time serving, and edge deployment with emphasis on building reliable, performant ML systems for production workloads.
When to Use
User needs:
-
ML model deployment to production
-
Real-time inference API development
-
Model optimization and compression
-
Batch prediction systems
-
Auto-scaling and load balancing
-
Edge deployment for IoT/mobile
-
Multi-model serving orchestration
-
Performance tuning and latency optimization
What This Skill Does
This skill deploys ML models to production with comprehensive infrastructure. It optimizes models for inference, builds serving pipelines, configures auto-scaling, implements monitoring, and ensures models meet performance, reliability, and scalability requirements in production environments.
ML Deployment Components
-
Model optimization and compression
-
Serving infrastructure (REST/gRPC APIs, batch jobs)
-
Load balancing and request routing
-
Auto-scaling and resource management
-
Real-time and batch prediction systems
-
Monitoring, logging, and observability
-
Edge deployment and model compression
-
A/B testing and canary deployments
Core Capabilities
Model Deployment Pipelines
-
CI/CD integration for ML models
-
Automated testing and validation
-
Model performance benchmarking
-
Security scanning and vulnerability assessment
-
Container building and registry management
-
Progressive rollout and blue-green deployment
Serving Infrastructure
-
Load balancer configuration (NGINX, HAProxy)
-
Request routing and model caching
-
Connection pooling and health checking
-
Graceful shutdown and resource allocation
-
Multi-region deployment and failover
-
Container orchestration (Kubernetes, ECS)
Model Optimization
-
Quantization (FP32, FP16, INT8, INT4)
-
Model pruning and sparsification
-
Knowledge distillation techniques
-
ONNX and TensorRT conversion
-
Graph optimization and operator fusion
-
Memory optimization and throughput tuning
Real-time Inference
-
Request preprocessing and validation
-
Model prediction execution
-
Response formatting and error handling
-
Timeout management and circuit breaking
-
Request batching and response caching
-
Streaming predictions and async processing
Batch Prediction Systems
-
Job scheduling and orchestration
-
Data partitioning and parallel processing
-
Progress tracking and error handling
-
Result aggregation and storage
-
Cost optimization and resource management
Auto-scaling Strategies
-
Metric-based scaling (CPU, GPU, request rate)
-
Scale-up and scale-down policies
-
Warm-up periods and predictive scaling
-
Cost controls and regional distribution
-
Traffic prediction and capacity planning
Multi-model Serving
-
Model routing and version management
-
A/B testing and traffic splitting
-
Ensemble serving and model cascading
-
Fallback strategies and performance isolation
-
Shadow mode testing and validation
Edge Deployment
-
Model compression for edge devices
-
Hardware optimization and power efficiency
-
Offline capability and update mechanisms
-
Telemetry collection and security hardening
-
Resource constraints and optimization
Tool Restrictions
-
Read: Access model artifacts, infrastructure configs, and monitoring data
-
Write/Edit: Create deployment configs, serving code, and optimization scripts
-
Bash: Execute deployment commands, monitoring setup, and performance tests
-
Glob/Grep: Search codebases for model integration and serving endpoints
Integration with Other Skills
-
ml-engineer: Model optimization and training pipeline integration
-
mlops-engineer: Infrastructure and platform setup
-
data-engineer: Data pipelines and feature stores
-
devops-engineer: CI/CD and deployment automation
-
cloud-architect: Cloud infrastructure and architecture
-
sre-engineer: Reliability and availability
-
performance-engineer: Performance profiling and optimization
-
ai-engineer: Model selection and integration
Example Interactions
Scenario 1: Real-time Inference API Deployment
User: "Deploy our ML model as a real-time API with auto-scaling"
Interaction:
-
Skill analyzes model characteristics and requirements
-
Implements serving infrastructure:
Optimizes model with ONNX conversion (60% size reduction)
-
Creates FastAPI/gRPC serving endpoints
-
Configures GPU auto-scaling based on request rate
-
Implements request batching for throughput
-
Sets up monitoring and alerting
-
Deploys to Kubernetes with horizontal pod autoscaler
-
Achieves <50ms P99 latency and 2000+ RPS throughput
Scenario 2: Multi-model Serving Platform
User: "Build a platform to serve 50+ models with intelligent routing"
Interaction:
- Skill designs multi-model architecture:
Model registry and version management
-
Intelligent routing based on request type
-
Specialist models for different use cases
-
Fallback and circuit breaking
-
Cost optimization with smaller models for simple queries
-
Implements serving framework with:
Model loading and unloading
-
Request queuing and load balancing
-
A/B testing and traffic splitting
-
Ensemble serving for critical paths
-
Deploys with comprehensive monitoring and cost tracking
Scenario 3: Edge Deployment for IoT
User: "Deploy ML model to edge devices with limited resources"
Interaction:
-
Skill analyzes device constraints and requirements
-
Optimizes model for edge:
Quantizes to INT8 (4x size reduction)
-
Prunes and compresses model
-
Implements ONNX Runtime for efficient inference
-
Adds offline capability and local caching
-
Creates deployment package:
Edge-optimized inference runtime
-
Update mechanism with delta updates
-
Telemetry collection and monitoring
-
Security hardening and encryption
-
Tests on target hardware and validates performance
Best Practices
-
Performance: Target <100ms P99 latency for real-time inference
-
Reliability: Implement graceful degradation and fallback models
-
Monitoring: Track latency, throughput, error rates, and resource usage
-
Testing: Conduct load testing and validate against production traffic patterns
-
Security: Implement authentication, encryption, and model security
-
Documentation: Document all deployment configurations and operational procedures
-
Cost: Optimize resource usage and implement auto-scaling for cost efficiency
Examples
Example 1: Real-Time Inference API for Production
Scenario: Deploy a fraud detection model as a real-time API with auto-scaling.
Deployment Approach:
-
Model Optimization: Converted model to ONNX (60% size reduction)
-
Serving Framework: Built FastAPI endpoints with async processing
-
Infrastructure: Kubernetes deployment with Horizontal Pod Autoscaler
-
Monitoring: Integrated Prometheus metrics and Grafana dashboards
Configuration:
# FastAPI serving with optimization
from fastapi import FastAPI
import onnxruntime as ort
app = FastAPI()
session = ort.InferenceSession("model.onnx")
@app.post("/predict")
async def predict(features: List[float]):
input_tensor = np.array([features])
outputs = session.run(None, {"input": input_tensor})
return {"prediction": outputs[0].tolist()}
Performance Results:
Metric Value
P99 Latency 45ms
Throughput 2,500 RPS
Availability 99.99%
Auto-scaling 2-50 pods
Example 2: Multi-Model Serving Platform
Scenario: Build a platform serving 50+ ML models for different prediction types.
Architecture Design:
-
Model Registry: Central registry with versioning
-
Router: Intelligent routing based on request type
-
Resource Manager: Dynamic resource allocation per model
-
Fallback System: Graceful degradation for unavailable models
Implementation:
-
Model loading/unloading based on request patterns
-
A/B testing framework for model comparisons
-
Cost optimization with model prioritization
-
Shadow mode testing for new models
Results:
-
50+ models deployed with 99.9% uptime
-
40% reduction in infrastructure costs
-
Zero downtime during model updates
-
95% cache hit rate for frequent requests
Example 3: Edge Deployment for Mobile Devices
Scenario: Deploy image classification model to iOS and Android apps.
Edge Optimization:
-
Model Compression: Quantized to INT8 (4x size reduction)
-
Runtime Selection: CoreML for iOS, TFLite for Android
-
On-Device Caching: Intelligent model caching and updates
-
Privacy Compliance: All processing on-device
Performance Metrics:
Platform Model Size Inference Time Accuracy
Original 25 MB 150ms 94.2%
Optimized 6 MB 35ms 93.8%
Results:
-
80% reduction in app download size
-
4x faster inference on device
-
Offline capability with local inference
-
GDPR compliant (no data leaves device)
Best Practices
Model Optimization
-
Quantization: Start with FP16, move to INT8 for edge
-
Pruning: Remove unnecessary weights for efficiency
-
Distillation: Transfer knowledge to smaller models
-
ONNX Export: Standard format for cross-platform deployment
-
Benchmarking: Always test on target hardware
Production Serving
-
Health Checks: Implement /health and /ready endpoints
-
Graceful Degradation: Fallback to simpler models or heuristics
-
Circuit Breakers: Prevent cascade failures
-
Rate Limiting: Protect against abuse and overuse
-
Caching: Cache predictions for identical inputs
Monitoring and Observability
-
Latency Tracking: Monitor P50, P95, P99 latencies
-
Error Rates: Track failures and error types
-
Prediction Distribution: Alert on distribution shifts
-
Resource Usage: CPU, GPU, memory monitoring
-
Business Metrics: Track model impact on KPIs
Security and Compliance
-
Model Security: Protect model weights and artifacts
-
Input Validation: Sanitize all prediction inputs
-
Output Filtering: Prevent sensitive data exposure
-
Audit Logging: Log all prediction requests
-
Compliance: Meet industry regulations (HIPAA, GDPR)
Anti-Patterns
Model Deployment Anti-Patterns
-
Manual Deployment: Deploying models without automation - implement CI/CD for models
-
No Versioning: Replacing models without tracking versions - maintain model version history
-
Hotfix Culture: Making urgent model changes without testing - require validation before deployment
-
Black Box Deployment: Deploying models without explainability - implement model interpretability
Performance Anti-Patterns
-
No Baselines: Deploying without performance benchmarks - establish performance baselines
-
Over-Optimization: Tuning beyond practical benefit - focus on customer-impacting metrics
-
Ignore Latency: Focusing only on accuracy, ignoring latency - optimize for real-world use cases
-
Resource Waste: Over-provisioning infrastructure - right-size resources based on actual load
Monitoring Anti-Patterns
-
Silent Failures: Models failing without detection - implement comprehensive health checks
-
Metric Overload: Monitoring too many metrics - focus on actionable metrics
-
Data Drift Blindness: Not detecting model degradation - monitor input data distribution
-
Alert Fatigue: Too many alerts causing ignored warnings - tune alert thresholds
Scalability Anti-Patterns
-
No Load Testing: Deploying without performance testing - test with production-like traffic
-
Single Point of Failure: No redundancy in serving infrastructure - implement failover
-
No Autoscaling: Manual capacity management - implement automatic scaling
-
Stateful Design: Inference that requires state - design stateless inference
Output Format
This skill delivers:
-
Complete model serving infrastructure (Docker, Kubernetes configs)
-
Production deployment pipelines and CI/CD workflows
-
Real-time and batch prediction APIs
-
Model optimization artifacts and configurations
-
Auto-scaling policies and infrastructure as code
-
Monitoring dashboards and alert configurations
-
Performance benchmarks and load test reports
All outputs include:
-
Detailed architecture documentation
-
Deployment scripts and configurations
-
Performance metrics and SLA validations
-
Security hardening guidelines
-
Operational runbooks and troubleshooting guides
-
Cost analysis and optimization recommendations
Weekly Installs503Repository404kidwiz/claud…e-skillsGitHub Stars51First SeenJan 24, 2026Security AuditsGen Agent Trust HubPassSocketPassSnykPassInstalled onopencode456gemini-cli440codex435github-copilot420kimi-cli395amp386
forumユーザーレビュー (0)
レビューを書く
レビューなし
統計データ
ユーザー評価
この Skill を評価