Home/AI 工程/machine-learning-engineer
M

machine-learning-engineer

by @404kidwizv
4.8(21)

Provides machine learning engineering expertise, focusing on model deployment, production serving infrastructure, and real-time inference systems.

Machine LearningModel DeploymentMLOpsData SciencePython ML LibrariesGitHub
Installation
npx skills add 404kidwiz/claude-supercode-skills --skill machine-learning-engineer
compare_arrows

Before / After Comparison

1
Before

Machine learning model deployment is difficult, and performance in production environments is unstable. Real-time inference systems have high latency, making it hard to meet business demands.

After

Gain professional machine learning engineering support for more efficient model deployment. Build stable real-time inference systems to ensure high-performance AI service operation.

description SKILL.md

machine-learning-engineer

Machine Learning Engineer

Purpose

Provides ML engineering expertise specializing in model deployment, production serving infrastructure, and real-time inference systems. Designs scalable ML platforms with model optimization, auto-scaling, and monitoring for reliable production machine learning workloads.

When to Use

  • ML model deployment to production

  • Real-time inference API development

  • Model optimization and compression

  • Batch prediction systems

  • Auto-scaling and load balancing

  • Edge deployment for IoT/mobile

  • Multi-model serving orchestration

  • Performance tuning and latency optimization

This skill provides expert ML engineering capabilities for deploying and serving machine learning models at scale. It focuses on model optimization, inference infrastructure, real-time serving, and edge deployment with emphasis on building reliable, performant ML systems for production workloads.

When to Use

User needs:

  • ML model deployment to production

  • Real-time inference API development

  • Model optimization and compression

  • Batch prediction systems

  • Auto-scaling and load balancing

  • Edge deployment for IoT/mobile

  • Multi-model serving orchestration

  • Performance tuning and latency optimization

What This Skill Does

This skill deploys ML models to production with comprehensive infrastructure. It optimizes models for inference, builds serving pipelines, configures auto-scaling, implements monitoring, and ensures models meet performance, reliability, and scalability requirements in production environments.

ML Deployment Components

  • Model optimization and compression

  • Serving infrastructure (REST/gRPC APIs, batch jobs)

  • Load balancing and request routing

  • Auto-scaling and resource management

  • Real-time and batch prediction systems

  • Monitoring, logging, and observability

  • Edge deployment and model compression

  • A/B testing and canary deployments

Core Capabilities

Model Deployment Pipelines

  • CI/CD integration for ML models

  • Automated testing and validation

  • Model performance benchmarking

  • Security scanning and vulnerability assessment

  • Container building and registry management

  • Progressive rollout and blue-green deployment

Serving Infrastructure

  • Load balancer configuration (NGINX, HAProxy)

  • Request routing and model caching

  • Connection pooling and health checking

  • Graceful shutdown and resource allocation

  • Multi-region deployment and failover

  • Container orchestration (Kubernetes, ECS)

Model Optimization

  • Quantization (FP32, FP16, INT8, INT4)

  • Model pruning and sparsification

  • Knowledge distillation techniques

  • ONNX and TensorRT conversion

  • Graph optimization and operator fusion

  • Memory optimization and throughput tuning

Real-time Inference

  • Request preprocessing and validation

  • Model prediction execution

  • Response formatting and error handling

  • Timeout management and circuit breaking

  • Request batching and response caching

  • Streaming predictions and async processing

Batch Prediction Systems

  • Job scheduling and orchestration

  • Data partitioning and parallel processing

  • Progress tracking and error handling

  • Result aggregation and storage

  • Cost optimization and resource management

Auto-scaling Strategies

  • Metric-based scaling (CPU, GPU, request rate)

  • Scale-up and scale-down policies

  • Warm-up periods and predictive scaling

  • Cost controls and regional distribution

  • Traffic prediction and capacity planning

Multi-model Serving

  • Model routing and version management

  • A/B testing and traffic splitting

  • Ensemble serving and model cascading

  • Fallback strategies and performance isolation

  • Shadow mode testing and validation

Edge Deployment

  • Model compression for edge devices

  • Hardware optimization and power efficiency

  • Offline capability and update mechanisms

  • Telemetry collection and security hardening

  • Resource constraints and optimization

Tool Restrictions

  • Read: Access model artifacts, infrastructure configs, and monitoring data

  • Write/Edit: Create deployment configs, serving code, and optimization scripts

  • Bash: Execute deployment commands, monitoring setup, and performance tests

  • Glob/Grep: Search codebases for model integration and serving endpoints

Integration with Other Skills

  • ml-engineer: Model optimization and training pipeline integration

  • mlops-engineer: Infrastructure and platform setup

  • data-engineer: Data pipelines and feature stores

  • devops-engineer: CI/CD and deployment automation

  • cloud-architect: Cloud infrastructure and architecture

  • sre-engineer: Reliability and availability

  • performance-engineer: Performance profiling and optimization

  • ai-engineer: Model selection and integration

Example Interactions

Scenario 1: Real-time Inference API Deployment

User: "Deploy our ML model as a real-time API with auto-scaling"

Interaction:

  • Skill analyzes model characteristics and requirements

  • Implements serving infrastructure:

Optimizes model with ONNX conversion (60% size reduction)

  • Creates FastAPI/gRPC serving endpoints

  • Configures GPU auto-scaling based on request rate

  • Implements request batching for throughput

  • Sets up monitoring and alerting

  • Deploys to Kubernetes with horizontal pod autoscaler

  • Achieves <50ms P99 latency and 2000+ RPS throughput

Scenario 2: Multi-model Serving Platform

User: "Build a platform to serve 50+ models with intelligent routing"

Interaction:

  • Skill designs multi-model architecture:

Model registry and version management

  • Intelligent routing based on request type

  • Specialist models for different use cases

  • Fallback and circuit breaking

  • Cost optimization with smaller models for simple queries

  • Implements serving framework with:

Model loading and unloading

  • Request queuing and load balancing

  • A/B testing and traffic splitting

  • Ensemble serving for critical paths

  • Deploys with comprehensive monitoring and cost tracking

Scenario 3: Edge Deployment for IoT

User: "Deploy ML model to edge devices with limited resources"

Interaction:

  • Skill analyzes device constraints and requirements

  • Optimizes model for edge:

Quantizes to INT8 (4x size reduction)

  • Prunes and compresses model

  • Implements ONNX Runtime for efficient inference

  • Adds offline capability and local caching

  • Creates deployment package:

Edge-optimized inference runtime

  • Update mechanism with delta updates

  • Telemetry collection and monitoring

  • Security hardening and encryption

  • Tests on target hardware and validates performance

Best Practices

  • Performance: Target <100ms P99 latency for real-time inference

  • Reliability: Implement graceful degradation and fallback models

  • Monitoring: Track latency, throughput, error rates, and resource usage

  • Testing: Conduct load testing and validate against production traffic patterns

  • Security: Implement authentication, encryption, and model security

  • Documentation: Document all deployment configurations and operational procedures

  • Cost: Optimize resource usage and implement auto-scaling for cost efficiency

Examples

Example 1: Real-Time Inference API for Production

Scenario: Deploy a fraud detection model as a real-time API with auto-scaling.

Deployment Approach:

  • Model Optimization: Converted model to ONNX (60% size reduction)

  • Serving Framework: Built FastAPI endpoints with async processing

  • Infrastructure: Kubernetes deployment with Horizontal Pod Autoscaler

  • Monitoring: Integrated Prometheus metrics and Grafana dashboards

Configuration:

# FastAPI serving with optimization
from fastapi import FastAPI
import onnxruntime as ort

app = FastAPI()
session = ort.InferenceSession("model.onnx")

@app.post("/predict")
async def predict(features: List[float]):
    input_tensor = np.array([features])
    outputs = session.run(None, {"input": input_tensor})
    return {"prediction": outputs[0].tolist()}

Performance Results:

Metric Value

P99 Latency 45ms

Throughput 2,500 RPS

Availability 99.99%

Auto-scaling 2-50 pods

Example 2: Multi-Model Serving Platform

Scenario: Build a platform serving 50+ ML models for different prediction types.

Architecture Design:

  • Model Registry: Central registry with versioning

  • Router: Intelligent routing based on request type

  • Resource Manager: Dynamic resource allocation per model

  • Fallback System: Graceful degradation for unavailable models

Implementation:

  • Model loading/unloading based on request patterns

  • A/B testing framework for model comparisons

  • Cost optimization with model prioritization

  • Shadow mode testing for new models

Results:

  • 50+ models deployed with 99.9% uptime

  • 40% reduction in infrastructure costs

  • Zero downtime during model updates

  • 95% cache hit rate for frequent requests

Example 3: Edge Deployment for Mobile Devices

Scenario: Deploy image classification model to iOS and Android apps.

Edge Optimization:

  • Model Compression: Quantized to INT8 (4x size reduction)

  • Runtime Selection: CoreML for iOS, TFLite for Android

  • On-Device Caching: Intelligent model caching and updates

  • Privacy Compliance: All processing on-device

Performance Metrics:

Platform Model Size Inference Time Accuracy

Original 25 MB 150ms 94.2%

Optimized 6 MB 35ms 93.8%

Results:

  • 80% reduction in app download size

  • 4x faster inference on device

  • Offline capability with local inference

  • GDPR compliant (no data leaves device)

Best Practices

Model Optimization

  • Quantization: Start with FP16, move to INT8 for edge

  • Pruning: Remove unnecessary weights for efficiency

  • Distillation: Transfer knowledge to smaller models

  • ONNX Export: Standard format for cross-platform deployment

  • Benchmarking: Always test on target hardware

Production Serving

  • Health Checks: Implement /health and /ready endpoints

  • Graceful Degradation: Fallback to simpler models or heuristics

  • Circuit Breakers: Prevent cascade failures

  • Rate Limiting: Protect against abuse and overuse

  • Caching: Cache predictions for identical inputs

Monitoring and Observability

  • Latency Tracking: Monitor P50, P95, P99 latencies

  • Error Rates: Track failures and error types

  • Prediction Distribution: Alert on distribution shifts

  • Resource Usage: CPU, GPU, memory monitoring

  • Business Metrics: Track model impact on KPIs

Security and Compliance

  • Model Security: Protect model weights and artifacts

  • Input Validation: Sanitize all prediction inputs

  • Output Filtering: Prevent sensitive data exposure

  • Audit Logging: Log all prediction requests

  • Compliance: Meet industry regulations (HIPAA, GDPR)

Anti-Patterns

Model Deployment Anti-Patterns

  • Manual Deployment: Deploying models without automation - implement CI/CD for models

  • No Versioning: Replacing models without tracking versions - maintain model version history

  • Hotfix Culture: Making urgent model changes without testing - require validation before deployment

  • Black Box Deployment: Deploying models without explainability - implement model interpretability

Performance Anti-Patterns

  • No Baselines: Deploying without performance benchmarks - establish performance baselines

  • Over-Optimization: Tuning beyond practical benefit - focus on customer-impacting metrics

  • Ignore Latency: Focusing only on accuracy, ignoring latency - optimize for real-world use cases

  • Resource Waste: Over-provisioning infrastructure - right-size resources based on actual load

Monitoring Anti-Patterns

  • Silent Failures: Models failing without detection - implement comprehensive health checks

  • Metric Overload: Monitoring too many metrics - focus on actionable metrics

  • Data Drift Blindness: Not detecting model degradation - monitor input data distribution

  • Alert Fatigue: Too many alerts causing ignored warnings - tune alert thresholds

Scalability Anti-Patterns

  • No Load Testing: Deploying without performance testing - test with production-like traffic

  • Single Point of Failure: No redundancy in serving infrastructure - implement failover

  • No Autoscaling: Manual capacity management - implement automatic scaling

  • Stateful Design: Inference that requires state - design stateless inference

Output Format

This skill delivers:

  • Complete model serving infrastructure (Docker, Kubernetes configs)

  • Production deployment pipelines and CI/CD workflows

  • Real-time and batch prediction APIs

  • Model optimization artifacts and configurations

  • Auto-scaling policies and infrastructure as code

  • Monitoring dashboards and alert configurations

  • Performance benchmarks and load test reports

All outputs include:

  • Detailed architecture documentation

  • Deployment scripts and configurations

  • Performance metrics and SLA validations

  • Security hardening guidelines

  • Operational runbooks and troubleshooting guides

  • Cost analysis and optimization recommendations

Weekly Installs503Repository404kidwiz/claud…e-skillsGitHub Stars51First SeenJan 24, 2026Security AuditsGen Agent Trust HubPassSocketPassSnykPassInstalled onopencode456gemini-cli440codex435github-copilot420kimi-cli395amp386

forumUser Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs895
Rating4.8 / 5.0
Version
Updated2026年3月17日
Comparisons1

User Rating

4.8(21)
5
0%
4
0%
3
0%
2
0%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

Timeline

Created2026年3月17日
Last Updated2026年3月17日