speak-tts
AIエージェントにリアルタイム音声会話機能を与え、ローカルのテキスト読み上げ、音声クローン、オーディオ生成をサポートし、特にApple Siliconデバイスのパフォーマンスを最適化します。
npx skills add emzod/speak --skill speak-ttsBefore / After 効果比較
1 组エージェントはテキストでのみコミュニケーションが可能で、ユーザーは画面上の文字を読む必要がありました。これにより、没入感と即時性が欠如し、特に迅速な情報取得やマルチタスクが必要なシナリオでは、効率が低くなっていました。
`speak-tts`スキルにより、エージェントはテキストをリアルタイムで音声出力に変換し、より自然で効率的なインタラクション体験を提供できます。ユーザーは聞きながら他の操作を行うことができ、ユーザー満足度と作業効率が大幅に向上しました。
description SKILL.md
speak-tts
speak - Talk to your Claude!
Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.
Prerequisites
Requirement Check Install
Apple Silicon Mac
uname -m → arm64
Intel not supported
macOS 12.0+
sw_vers
sox
which sox
brew install sox
ffmpeg
which ffmpeg
brew install ffmpeg
poppler (PDF)
which pdftotext
brew install poppler
Input Sources
Source Example
Text file
speak article.txt
Markdown
speak doc.md
Direct string
speak "Hello"
Clipboard
pbpaste | speak
Stdin
cat file.txt | speak
Web Articles
lynx -dump -nolist "https://example.com/article" | speak --output article.wav
Converting Formats
Format Convert Command
PDF
pdftotext doc.pdf doc.txt
DOCX
textutil -convert txt doc.docx
HTML
pandoc -f html -t plain doc.html > doc.txt
Output Modes
Goal Command
Save for later
speak text.txt --output file.wav
Listen now (streaming)
speak text.txt --stream
Listen now (complete)
speak text.txt --play
Both
speak text.txt --stream --output file.wav
Default Behavior
speak article.txt # → ~/Audio/speak/article.wav (no playback)
speak "Hello" # → ~/Audio/speak/speak_<timestamp>.wav
Directory Auto-Creation
Directory Auto-Created?
~/Audio/speak/
✓ Yes
~/.chatter/voices/
✗ No
Custom directories ✗ No
Always create custom directories first:
mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/
Voice Cloning
Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.
Quality Expectations
-
Output captures general voice characteristics but is not a perfect replica
-
Quality depends heavily on sample quality
-
15-25 seconds is optimal (10s minimum, 30s maximum)
Recording Your Voice
Using QuickTime:
-
Open QuickTime Player → File → New Audio Recording
-
Record 20 seconds of clear speech
-
File → Export As → Audio Only (.m4a)
-
Convert to WAV (see below)
Using sox (command line):
# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25
Converting to Required Format
Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.
# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav
# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav
# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav
# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono
Using Your Voice
# Create directory
mkdir -p ~/.chatter/voices/
# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav
# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream
# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav
Path requirements:
-
✓ Works:
~/.chatter/voices/my_voice.wav(tilde expanded by shell) -
✓ Works:
/Users/name/.chatter/voices/my_voice.wav -
✗ Fails:
my_voice.wav(relative path) -
✗ Fails:
./voices/my_voice.wav(relative path)
Voice Sample Tips
Good Sample Bad Sample
Quiet room Background noise
Natural pace Rushed or monotone
Clear diction Mumbling
Varied content Repetitive phrases
Default Voice
When --voice is omitted, a built-in default voice is used:
speak "Hello world" --stream # Uses default voice
Emotion Tags
Tags produce audible effects (actual sounds), not spoken words:
speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."
Tag Effect
[laugh]
Laughter
[chuckle]
Light chuckle
[sigh]
Sighing
[gasp]
Gasping
[groan]
Groaning
[clear throat]
Throat clearing
[cough]
Coughing
[crying]
Crying
[singing]
Sung speech
NOT supported: [pause], [whisper] (ignored)
For pauses: Use punctuation: "Wait... let me think."
Batch Processing
mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav
# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk
# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing
Auto-Chunk Behavior
When using --auto-chunk with batch processing:
-
Each input file is chunked independently
-
Chunks are generated and automatically concatenated per file
-
Final output: one
.wavper input file (e.g.,ch01.wav) -
Intermediate chunks deleted (unless
--keep-chunks)
You don't need to manually concatenate chunks — only concatenate final chapter files.
Concatenating Audio
# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav
# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav
Zero-Padding Rules
Critical for correct concatenation order:
Files Correct Wrong
1-9
01, 02, ..., 09
1, 2, ..., 9
10-99
01, 02, ..., 99
1, 10, 2, ...
100+
001, 002, ..., 999
1, 100, 2, ...
Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.
PDF to Audiobook (Complete Workflow)
Step 1: Find Chapter Boundaries
# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt # Note chapter page numbers
# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"
Step 2: Extract Chapters (Zero-Padded!)
# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters
Step 3: Estimate Time
speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed
# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB
Step 4: Generate Audio
mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
Step 5: Concatenate
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav
PDF Troubleshooting
Issue Solution
Empty/garbled text
Scanned PDF — use OCR: brew install tesseract
Wrong encoding
Try: pdftotext -enc UTF-8 doc.pdf
Check word count
pdftotext doc.pdf - | wc -w (should be >100)
Multi-Voice Content
mkdir -p podcast/scripts podcast/wav
echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt
speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav
speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav
Options Reference
Option Description Default
--stream
Stream as it generates
false
--play
Play after complete
false
--output <path>
Output file
~/Audio/speak/
--output-dir <dir>
Batch output directory
--voice <path>
Voice sample (full path)
default
--timeout <sec>
Timeout per file
300
--auto-chunk
Split long documents
false
--chunk-size <n>
Chars per chunk
6000
--resume <file>
Resume from manifest
--keep-chunks
Keep intermediate files
false
--skip-existing
Skip if output exists
false
--estimate
Show duration estimate
false
--dry-run
Preview only
false
--quiet
Suppress output
false
Commands
Command Description
speak setup
Set up environment
speak health
Check system status
speak models
List TTS models
speak concat
Concatenate audio
speak daemon kill
Stop TTS server
speak config
Show configuration
Performance
Metric Value
Cold start ~4-8s
Warm start ~3-8s
Speed 0.3-0.5x RTF (faster than real-time)
Storage ~2.5 MB/min, ~150 MB/hour
Resume Capability
For interrupted long generations:
# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json
# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing
Common Errors
Error Cause Solution
"Voice file not found"
Relative path
Use full path: ~/.chatter/voices/x.wav
"Invalid WAV format"
Wrong specs
Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav
"Voice sample too short" <10 seconds Record 15-25 seconds
"Output directory doesn't exist"
Not created
mkdir -p dirname/
"sox not found"
Not installed
brew install sox
Scrambled concat order
Non-zero-padded
Use 01, 02, not 1, 2
Timeout
5 min generation Use
--auto-chunkor--timeout 600
"Server not running"
Stale daemon
speak daemon kill && speak health
Setup
speak "test" # Auto-setup on first run (downloads model ~500MB)
speak setup # Or manual setup
speak health # Verify everything works
Server Management
Server auto-starts and shuts down after 1 hour idle.
speak health # Check status
speak daemon kill # Stop manually
Weekly Installs684Repositoryemzod/speakGitHub Stars6First SeenJan 27, 2026Security AuditsGen Agent Trust HubWarnSocketPassSnykWarnInstalled ongithub-copilot649gemini-cli639opencode638codex635cursor630cline616
forumユーザーレビュー (0)
レビューを書く
レビューなし
統計データ
ユーザー評価
この Skill を評価