S

speak-tts

by @emzodv
4.8(26)

Equips AI agents with real-time voice conversation capabilities, supporting local text-to-speech, voice cloning, and audio generation, with optimized performance for Apple Silicon devices.

Text-to-SpeechTTSSpeech SynthesisVoice AINatural Language GenerationGitHub
Installation
npx skills add emzod/speak --skill speak-tts
compare_arrows

Before / After Comparison

1
Before

Agents could only communicate via text, requiring users to read on-screen text. This lacked immersion and immediacy, leading to lower efficiency, especially in scenarios requiring quick information retrieval or multitasking.

After

With the `speak-tts` skill, agents can convert text to speech in real-time, offering a more natural and efficient interaction experience. Users can listen while performing other tasks, significantly boosting user satisfaction and work efficiency.

description SKILL.md

speak-tts

speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites

Requirement Check Install

Apple Silicon Mac uname -m → arm64 Intel not supported

macOS 12.0+ sw_vers

sox which sox brew install sox

ffmpeg which ffmpeg brew install ffmpeg

poppler (PDF) which pdftotext brew install poppler

Input Sources

Source Example

Text file speak article.txt

Markdown speak doc.md

Direct string speak "Hello"

Clipboard pbpaste | speak

Stdin cat file.txt | speak

Web Articles

lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats

Format Convert Command

PDF pdftotext doc.pdf doc.txt

DOCX textutil -convert txt doc.docx

HTML pandoc -f html -t plain doc.html > doc.txt

Output Modes

Goal Command

Save for later speak text.txt --output file.wav

Listen now (streaming) speak text.txt --stream

Listen now (complete) speak text.txt --play

Both speak text.txt --stream --output file.wav

Default Behavior

speak article.txt          # → ~/Audio/speak/article.wav (no playback)
speak "Hello"              # → ~/Audio/speak/speak_<timestamp>.wav

Directory Auto-Creation

Directory Auto-Created?

~/Audio/speak/ ✓ Yes

~/.chatter/voices/ ✗ No

Custom directories ✗ No

Always create custom directories first:

mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations

  • Output captures general voice characteristics but is not a perfect replica

  • Quality depends heavily on sample quality

  • 15-25 seconds is optimal (10s minimum, 30s maximum)

Recording Your Voice

Using QuickTime:

  • Open QuickTime Player → File → New Audio Recording

  • Record 20 seconds of clear speech

  • File → Export As → Audio Only (.m4a)

  • Convert to WAV (see below)

Using sox (command line):

# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono

Using Your Voice

# Create directory
mkdir -p ~/.chatter/voices/

# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav

# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:

  • ✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)

  • ✓ Works: /Users/name/.chatter/voices/my_voice.wav

  • ✗ Fails: my_voice.wav (relative path)

  • ✗ Fails: ./voices/my_voice.wav (relative path)

Voice Sample Tips

Good Sample Bad Sample

Quiet room Background noise

Natural pace Rushed or monotone

Clear diction Mumbling

Varied content Repetitive phrases

Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream  # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."

Tag Effect

[laugh] Laughter

[chuckle] Light chuckle

[sigh] Sighing

[gasp] Gasping

[groan] Groaning

[clear throat] Throat clearing

[cough] Coughing

[crying] Crying

[singing] Sung speech

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing

mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav

# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:

  • Each input file is chunked independently

  • Chunks are generated and automatically concatenated per file

  • Final output: one .wav per input file (e.g., ch01.wav)

  • Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files Correct Wrong

1-9 01, 02, ..., 09 1, 2, ..., 9

10-99 01, 02, ..., 99 1, 10, 2, ...

100+ 001, 002, ..., 999 1, 100, 2, ...

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt  # Note chapter page numbers

# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"

Step 2: Extract Chapters (Zero-Padded!)

# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters

Step 3: Estimate Time

speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed

# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB

Step 4: Generate Audio

mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

Step 5: Concatenate

speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting

Issue Solution

Empty/garbled text Scanned PDF — use OCR: brew install tesseract

Wrong encoding Try: pdftotext -enc UTF-8 doc.pdf

Check word count pdftotext doc.pdf - | wc -w (should be >100)

Multi-Voice Content

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference

Option Description Default

--stream Stream as it generates false

--play Play after complete false

--output <path> Output file ~/Audio/speak/

--output-dir <dir> Batch output directory

--voice <path> Voice sample (full path) default

--timeout <sec> Timeout per file 300

--auto-chunk Split long documents false

--chunk-size <n> Chars per chunk 6000

--resume <file> Resume from manifest

--keep-chunks Keep intermediate files false

--skip-existing Skip if output exists false

--estimate Show duration estimate false

--dry-run Preview only false

--quiet Suppress output false

Commands

Command Description

speak setup Set up environment

speak health Check system status

speak models List TTS models

speak concat Concatenate audio

speak daemon kill Stop TTS server

speak config Show configuration

Performance

Metric Value

Cold start ~4-8s

Warm start ~3-8s

Speed 0.3-0.5x RTF (faster than real-time)

Storage ~2.5 MB/min, ~150 MB/hour

Resume Capability

For interrupted long generations:

# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json

# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors

Error Cause Solution

"Voice file not found" Relative path Use full path: ~/.chatter/voices/x.wav

"Invalid WAV format" Wrong specs Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav

"Voice sample too short" <10 seconds Record 15-25 seconds

"Output directory doesn't exist" Not created mkdir -p dirname/

"sox not found" Not installed brew install sox

Scrambled concat order Non-zero-padded Use 01, 02, not 1, 2

Timeout

5 min generation Use --auto-chunk or --timeout 600

"Server not running" Stale daemon speak daemon kill && speak health

Setup

speak "test"     # Auto-setup on first run (downloads model ~500MB)
speak setup      # Or manual setup
speak health     # Verify everything works

Server Management

Server auto-starts and shuts down after 1 hour idle.

speak health        # Check status
speak daemon kill   # Stop manually

Weekly Installs684Repositoryemzod/speakGitHub Stars6First SeenJan 27, 2026Security AuditsGen Agent Trust HubWarnSocketPassSnykWarnInstalled ongithub-copilot649gemini-cli639opencode638codex635cursor630cline616

forumUser Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs551
Rating4.8 / 5.0
Version
Updated2026年3月17日
Comparisons1

User Rating

4.8(26)
5
0%
4
0%
3
0%
2
0%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

Timeline

Created2026年3月17日
Last Updated2026年3月17日