A

airunway-aks-setup

by @microsoftv
4.7(120)

此技能指导用户在 Azure Kubernetes Service (AKS) 上部署 AI Runway,从裸集群到运行 AI 模型。它涵盖集群验证、控制器安装、GPU 评估、推理提供商设置以及首次模型部署,简化了 AI 模型在 AKS 上的上线流程。

aksai-runwaygpumodel-deploymentkubernetesGitHub
安装方式
git clone https://github.com/microsoft/azure-skills.git
compare_arrows

Before / After 效果对比

1
使用前

在 AKS 上手动设置 AI Runway 和部署 AI 模型是一个复杂且耗时的过程。需要手动配置 Kubernetes 资源、安装控制器、评估 GPU 兼容性并选择合适的推理提供商,这通常需要数小时甚至数天,且容易出错。

使用后

使用此技能,用户可以获得一个自动化且引导式的流程,将 AI Runway 部署到 AKS 并运行第一个 AI 模型。它将繁琐的手动步骤自动化,显著减少了设置时间和配置错误,使 AI 模型更快上线。

SKILL.md

AI Runway AKS Setup

This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides skip-to-step N to resume from a specific phase.

Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.

Prerequisites

This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the azure-kubernetes skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.

Quick Reference

PropertyValue
Best forEnd-to-end AI Runway onboarding on AKS
CLI toolskubectl, make, curl
MCP toolsNone
Related skillsazure-kubernetes (cluster setup), azure-diagnostics (troubleshooting)

When to Use This Skill

Use this skill when the user wants to:

  • Set up AI Runway on an existing AKS cluster from scratch
  • Install the AI Runway controller and CRDs
  • Assess GPU hardware compatibility for model deployment
  • Choose and install an inference provider (KAITO, Dynamo, KubeRay)
  • Deploy their first AI model to AKS via AI Runway
  • Resume a partially-complete AI Runway setup from a specific step

MCP Tools

This skill uses no MCP tools. All cluster operations are performed directly via kubectl and make.

Rules

  1. Execute steps in sequence — load the reference for each step as you reach it
  2. Report cluster state at each step: ✓ healthy, ✗ missing/failed
  3. Ask for user confirmation before any install or deployment action
  4. If a step is already complete, report status and skip to the next step
  5. If the user provides skip-to-step N, start at step N; assume prior steps are complete

Steps

#StepReference
1Cluster Verification — context check, node inventory, GPU detectionstep-1-verify.md
2Controller Installation — CRD + controller deploymentstep-2-controller.md
3GPU Assessment — detect GPU models, flag dtype/attention constraintsstep-3-gpu.md
4Provider Setup — recommend and install inference providerstep-4-provider.md
5First Deployment — pick a model, deploy, verify Readystep-5-deploy.md
6Summary — recap, smoke test, next stepsstep-6-summary.md

Error Handling

Error / SymptomLikely CauseRemediation
No kubeconfig contextNot connected to a clusterRun az aks get-credentials or equivalent
Controller in CrashLoopBackOffConfig or RBAC issuekubectl logs -n airunway-system -l control-plane=controller-manager --previous
Provider not readyImage pull or RBAC issuekubectl logs <pod-name> -n <namespace> for the provider pod
ModelDeployment stuck in PendingGPU scheduling failure or provider not readykubectl describe modeldeployment <name> -n <namespace> events
bfloat16 errors at inferenceT4 or V100 lacks bfloat16 supportAdd --dtype float16 to serving args

For full error handling and rollback procedures, see troubleshooting.md.

用户评价 (0)

发表评价

效果
易用性
文档
兼容性

暂无评价

统计数据

安装量89.4K
评分4.7 / 5.0
版本
更新日期2026年5月23日
对比案例1 组

用户评分

4.7(120)
5
37%
4
43%
3
13%
2
5%
1
2%

为此 Skill 评分

0.0

兼容平台

🤖claude-code

时间线

创建2026年5月8日
最后更新2026年5月23日