--- id: ssh2-videoagent-audio-studio name: "videoagent-audio-studio" url: https://skills.yangsir.net/skill/ssh2-videoagent-audio-studio author: pexoai domain: multimedia tags: ["audio-editing", "sound-design", "voiceover-production", "audio-mixing", "video-production"] install_count: 3600 rating: 4.40 (128 reviews) github: https://github.com/pexoai/pexo-skills --- # videoagent-audio-studio > 提供统一的音频处理工作室，简化多音频API管理，提升多媒体内容创作效率和质量。 **Stats**: 3,600 installs · 4.4/5 (128 reviews) ## Before / After 对比 ### VideoAgent音频工作室 ## Readme # 🎙️ VideoAgent Audio Studio **Use when:** User asks to generate speech, narrate text, create a voice-over, compose music, or produce a sound effect. VideoAgent Audio Studio is a smart audio dispatcher. It analyzes your request and routes it to the best available model — ElevenLabs for speech and music, fal.ai for fast SFX — and returns a ready-to-use audio URL. --- ## Quick Reference | Request Type | Best Model | Latency | |---|---|---| | Narrate text / Voice-over | `elevenlabs-tts-v3` | ~3s | | Low-latency TTS (real-time) | `elevenlabs-tts-turbo` | <1s | | Background music | `cassetteai-music` | ~15s | | Sound effect | `elevenlabs-sfx` | ~5s | | Clone a voice from audio | `elevenlabs-voice-clone` | ~10s | --- ## How to Use ### 1. Start the AudioMind server (once per session) ```bash bash {baseDir}/tools/start_server.sh ``` This starts the ElevenLabs MCP server on port 8124. The skill uses it for all audio generation. ### 2. Route the request Analyze the user's request and call the appropriate tool via the MCP server: **Text-to-Speech (TTS)** When user asks to "narrate", "read aloud", "say", or "create a voice-over": ``` Use MCP tool: text_to_speech text: "" voice_id: "JBFqnCBsd6RMkjVDRZzb" # Default: "George" (professional, neutral) model_id: "eleven_multilingual_v2" # Use "eleven_turbo_v2_5" for low latency ``` **Music Generation** When user asks to "compose", "create background music", or "make a soundtrack": ``` Use MCP tool: text_to_sound_effects (via cassetteai-music on fal.ai) prompt: "" duration_seconds: ``` **Sound Effect (SFX)** When user asks for a specific sound (e.g., "a door creaking", "rain on a window"): ``` Use MCP tool: text_to_sound_effects text: "" duration_seconds: <1-22> ``` **Voice Cloning** When user provides an audio sample and wants to clone the voice: ``` Use MCP tool: voice_add name: "" files: [""] ``` --- ## Example Conversations **User:** "Voice this text for me: Welcome to our product launch" ``` → Route to: text_to_speech text: "Welcome to our product launch" voice_id: "JBFqnCBsd6RMkjVDRZzb" model_id: "eleven_multilingual_v2" ``` > 🎙️ Voiceover done! [Listen here](audio_url) --- **User:** "Generate 60 seconds of relaxing background music for a podcast" ``` → Route to: cassetteai-music (fal.ai) prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds" duration_seconds: 60 ``` > 🎵 Background music ready! [Listen here](audio_url) --- **User:** "Generate a sci-fi style door opening sound effect" ``` → Route to: text_to_sound_effects text: "a futuristic sci-fi door sliding open with a hydraulic hiss" duration_seconds: 3 ``` --- ## Setup ### Required Set `ELEVENLABS_API_KEY` in `~/.openclaw/openclaw.json`: ```json { "skills": { "entries": { "videoagent-audio-studio": { "enabled": true, "env": { "ELEVENLABS_API_KEY": "your_elevenlabs_key_here" } } } } } ``` Get your key at [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys). ### Optional (for fal.ai music & SFX models) ```json "FAL_KEY": "your_fal_key_here" ``` Get your key at [fal.ai/dashboard/keys](https://fal.ai/dashboard/keys). --- ## Self-Hosting the Proxy The `cli.js` connects to a hosted proxy by default. If you want full control — or need to serve users in regions where `vercel.app` is blocked — you can deploy your own instance from the `proxy/` directory. ### Quick Deploy (Vercel) ```bash cd proxy npm install vercel --prod ``` ### Environment Variables Set these in your Vercel project (Dashboard → Settings → Environment Variables): | Variable | Required For | Where to Get | |---|---|---| | `ELEVENLABS_API_KEY` | TTS, SFX, Voice Clone | [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) | | `FAL_KEY` | Music generation | [fal.ai/dashboard/keys](https://fal.ai/dashboard/keys) | | `VALID_PRO_KEYS` | (Optional) Restrict access | Comma-separated list of allowed client keys | ### Point cli.js to Your Proxy ```bash export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio" ``` Or set it in `~/.openclaw/openclaw.json`: ```json { "skills": { "entries": { "videoagent-audio-studio": { "env": { "AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio" } } } } } ``` ### Custom Domain (Recommended) If your users are in mainland China, bind a custom domain in Vercel Dashboard → Settings → Domains to avoid DNS issues with `vercel.app`. --- ## Model Reference | Model ID | Type | Provider | Notes | |---|---|---|---| | `eleven_multilingual_v2` | TTS | ElevenLabs | Best quality, supports 29 languages | | `eleven_turbo_v2_5` | TTS | ElevenLabs | Ultra-low latency, ideal for real-time | | `eleven_monolingual_v1` | TTS | ElevenLabs | English only, fastest | | `cassetteai-music` | Music | fal.ai | Reliable, fast music generation | | `elevenlabs-sfx` | SFX | ElevenLabs | High-quality sound effects (up to 22s) | | `elevenlabs-voice-clone` | Clone | ElevenLabs | Clone any voice from a short audio sample | --- ## Changelog ### v3.0.0 - **Simplified routing table**: Removed unstable/offline models from the main reference. The skill now only surfaces models that reliably work. - **Clearer use-case triggers**: Added "Use when" section so the agent activates this skill at the right moment. - **Unified setup**: Single `ELEVENLABS_API_KEY` is all you need to get started. `FAL_KEY` is now optional. - **Removed polling complexity**: Music generation now uses `cassetteai-music` by default, which completes synchronously. ### v2.1.0 - Added async workflow for long-running music generation tasks. - Added `cassetteai-music` as a stable alternative for music generation. ### v2.0.0 - Migrated to ElevenLabs MCP server architecture. - Added voice cloning support. ### v1.0.0 - Initial release with TTS, music, and SFX routing. --- *Source: https://skills.yangsir.net/skill/ssh2-videoagent-audio-studio* *Markdown mirror: https://skills.yangsir.net/api/skill/ssh2-videoagent-audio-studio/markdown*