airunway-aks-setup
此技能指导用户在 Azure Kubernetes Service (AKS) 上部署 AI Runway,从裸集群到运行 AI 模型。它涵盖集群验证、控制器安装、GPU 评估、推理提供商设置以及首次模型部署,简化了 AI 模型在 AKS 上的上线流程。
git clone https://github.com/microsoft/azure-skills.gitBefore / After 效果对比
1 组在 AKS 上手动设置 AI Runway 和部署 AI 模型是一个复杂且耗时的过程。需要手动配置 Kubernetes 资源、安装控制器、评估 GPU 兼容性并选择合适的推理提供商,这通常需要数小时甚至数天,且容易出错。
使用此技能,用户可以获得一个自动化且引导式的流程,将 AI Runway 部署到 AKS 并运行第一个 AI 模型。它将繁琐的手动步骤自动化,显著减少了设置时间和配置错误,使 AI 模型更快上线。
AI Runway AKS Setup
This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides skip-to-step N to resume from a specific phase.
Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.
Prerequisites
This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the azure-kubernetes skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.
Quick Reference
| Property | Value |
|---|---|
| Best for | End-to-end AI Runway onboarding on AKS |
| CLI tools | kubectl, make, curl |
| MCP tools | None |
| Related skills | azure-kubernetes (cluster setup), azure-diagnostics (troubleshooting) |
When to Use This Skill
Use this skill when the user wants to:
- Set up AI Runway on an existing AKS cluster from scratch
- Install the AI Runway controller and CRDs
- Assess GPU hardware compatibility for model deployment
- Choose and install an inference provider (KAITO, Dynamo, KubeRay)
- Deploy their first AI model to AKS via AI Runway
- Resume a partially-complete AI Runway setup from a specific step
MCP Tools
This skill uses no MCP tools. All cluster operations are performed directly via kubectl and make.
Rules
- Execute steps in sequence — load the reference for each step as you reach it
- Report cluster state at each step: ✓ healthy, ✗ missing/failed
- Ask for user confirmation before any install or deployment action
- If a step is already complete, report status and skip to the next step
- If the user provides
skip-to-step N, start at step N; assume prior steps are complete
Steps
| # | Step | Reference |
|---|---|---|
| 1 | Cluster Verification — context check, node inventory, GPU detection | step-1-verify.md |
| 2 | Controller Installation — CRD + controller deployment | step-2-controller.md |
| 3 | GPU Assessment — detect GPU models, flag dtype/attention constraints | step-3-gpu.md |
| 4 | Provider Setup — recommend and install inference provider | step-4-provider.md |
| 5 | First Deployment — pick a model, deploy, verify Ready | step-5-deploy.md |
| 6 | Summary — recap, smoke test, next steps | step-6-summary.md |
Error Handling
| Error / Symptom | Likely Cause | Remediation |
|---|---|---|
| No kubeconfig context | Not connected to a cluster | Run az aks get-credentials or equivalent |
| Controller in CrashLoopBackOff | Config or RBAC issue | kubectl logs -n airunway-system -l control-plane=controller-manager --previous |
| Provider not ready | Image pull or RBAC issue | kubectl logs <pod-name> -n <namespace> for the provider pod |
| ModelDeployment stuck in Pending | GPU scheduling failure or provider not ready | kubectl describe modeldeployment <name> -n <namespace> events |
bfloat16 errors at inference | T4 or V100 lacks bfloat16 support | Add --dtype float16 to serving args |
For full error handling and rollback procedures, see troubleshooting.md.
用户评价 (0)
发表评价
暂无评价
统计数据
用户评分
为此 Skill 评分