---
id: ssh2-chat-with-anyone
name: "chat-with-anyone"
url: https://skills.yangsir.net/skill/ssh2-chat-with-anyone
author: noizai
domain: persona
tags: ["natural-language-processing", "chatbots", "api-integration", "real-time-communication", "conversational-ai"]
install_count: 2000
rating: 4.30 (48 reviews)
github: https://github.com/noizai/skills
---

# chat-with-anyone

> 通过在线语音样本，以真实人物或虚构角色的声音进行对话。，AI Agent Skill，提升工作效率和自动化能力

**Stats**: 2,000 installs · 4.3/5 (48 reviews)

## Before / After 对比

### 模拟真实或虚构角色语音对话

## Readme

# Chat with Anyone

Clone a real person's voice from online video, or design a voice from a photo, then roleplay as that person with TTS.

## Prerequisites

- `youtube-downloader` skill installed (Workflow A)
- `tts` skill installed
- `ffmpeg` on PATH
- Noiz API key configured: `python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY`

## Mode Selection

- **User names a person** (real or fictional) --> Workflow A
- **User provides an image**, person is unrecognizable --> Workflow B
- **User provides an image**, person is a recognizable public figure --> Workflow A (real voice is more authentic)
- **Multiple people in image** --> Ask which person first

---

## Workflow A: Name-based (voice from online video)

Track progress with this checklist:

```
- [ ] A1. Disambiguate character
- [ ] A2. Find reference video
- [ ] A3. Download audio + subtitles
- [ ] A4. Extract best reference segment
- [ ] A5. Generate speech
```

### A1. Disambiguate Character

If ambiguous (e.g. "US President", "Spider-Man actor"), ask the user to specify the exact person before proceeding.

### A2. Find a Reference Video

Use web search to find a YouTube video of the person speaking clearly. Best candidates: interviews, speeches, press conferences. Avoid videos with heavy background music.

Search queries to try:
- `{CHARACTER_NAME} interview` / `{CHARACTER_NAME} 采访`
- `{CHARACTER_NAME} speech` / `{CHARACTER_NAME} 演讲`
- `{CHARACTER_NAME} press conference`

### A3. Download Audio and Subtitles

```bash
python skills/youtube-downloader/scripts/download_video.py "{VIDEO_URL}" \
  -o "tmp/chat_with_anyone/{CHARACTER_NAME}" --audio-only --subtitles
```

After download, list the output directory to identify the audio file and SRT subtitle file:

```bash
ls tmp/chat_with_anyone/{CHARACTER_NAME}/
```

Expected output: a `.mp3` audio file and one or more `.srt` subtitle files.

**If no subtitle files appear**: try a different video that has auto-generated captions, or add `--sub-lang en,zh-Hans` to request specific languages.

### A4. Extract Best Reference Segment

Use the automated extraction script — it parses the SRT, finds the densest 3-12 second speech window, and extracts it as a WAV:

```bash
python3 skills/chat-with-anyone/scripts/extract_ref_segment.py \
  --srt "tmp/chat_with_anyone/{CHARACTER_NAME}/{SRT_FILE}" \
  --audio "tmp/chat_with_anyone/{CHARACTER_NAME}/{AUDIO_FILE}" \
  -o "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav"
```

The script prints the selected time range and saves the reference WAV. Verify the output exists and is non-empty before proceeding.

**If the script reports no suitable segment**: try `--min-duration 2` for shorter clips, or download a different video.

### A5. Generate Speech and Roleplay

Write a response in character, then synthesize it:

```bash
python3 skills/tts/scripts/tts.py \
  -t "{RESPONSE_TEXT}" \
  --ref-audio "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav" \
  -o "tmp/chat_with_anyone/{CHARACTER_NAME}/reply.wav"
```

Present the generated audio file to the user along with the text. For subsequent messages, reuse the same `--ref-audio` path.

---

## Workflow B: Image-based (voice from photo)

Track progress with this checklist:

```
- [ ] B1. Analyze image
- [ ] B2. Design voice
- [ ] B3. Preview (optional)
- [ ] B4. Generate speech
```

### B1. Analyze the Image

Use your vision capability to examine the image:

1. **If the person is a recognizable public figure** --> switch to Workflow A for authentic voice.
2. **If unrecognizable**, produce a voice description covering:
   - Gender (male / female)
   - Approximate age (e.g. "around 30 years old")
   - Apparent demeanor (e.g. cheerful, authoritative, gentle)
   - Contextual cues (e.g. suit --> professional tone; athletic outfit --> energetic)

### B2. Design the Voice

Pass both the image and the description to the voice-design script:

```bash
python3 skills/chat-with-anyone/scripts/voice_design.py \
  --picture "{IMAGE_PATH}" \
  --voice-description "{VOICE_DESCRIPTION}" \
  -o "tmp/chat_with_anyone/voice_design"
```

The script outputs:
- Detected voice features (printed to stdout)
- Preview audio files in the output directory
- `voice_id.txt` containing the best voice ID

Read the voice ID:

```bash
cat tmp/chat_with_anyone/voice_design/voice_id.txt
```

### B3. Preview (Optional)

Present the preview audio files from the output directory so the user can hear the voice. If unsatisfied, re-run B2 with adjusted `--voice-description` or `--guidance-scale`.

### B4. Generate Speech and Roleplay

```bash
python3 skills/tts/scripts/tts.py \
  -t "{RESPONSE_TEXT}" \
  --voice-id "{VOICE_ID}" \
  -o "tmp/chat_with_anyone/voice_design/reply.wav"
```

For subsequent messages, keep using the same `--voice-id` for consistency.

---

## Example: Name-based

**User**: 我想跟特朗普聊天，让他给我讲个睡前故事。

**Agent steps**:
1. Character: Donald Trump. No disambiguation needed.
2. Search `Donald Trump speech youtube`, find a clear speech video.
3. Download:
   `python skills/youtube-downloader/scripts/download_video.py "https://youtube.com/watch?v=..." -o tmp/chat_with_anyone/trump --audio-only --subtitles`
4. Extract reference:
   `python3 skills/chat-with-anyone/scripts/extract_ref_segment.py --srt "tmp/chat_with_anyone/trump/....srt" --audio "tmp/chat_with_anyone/trump/....mp3" -o "tmp/chat_with_anyone/trump/ref.wav"`
5. Generate TTS in Trump's style:
   `python3 skills/tts/scripts/tts.py -t "Let me tell you a tremendous bedtime story..." --ref-audio "tmp/chat_with_anyone/trump/ref.wav" -o "tmp/chat_with_anyone/trump/reply.wav"`
6. Present `reply.wav` and the story text to the user.

## Example: Image-based

**User**: [uploads photo.jpg] 我想跟这张图片里的人聊天

**Agent steps**:
1. Vision analysis: unrecognizable young woman, ~25, casual sweater, warm smile.
2. Design voice:
   `python3 skills/chat-with-anyone/scripts/voice_design.py --picture "photo.jpg" --voice-description "A young Chinese woman around 25, gentle and warm voice, friendly tone" -o "tmp/chat_with_anyone/voice_design"`
3. Read voice ID from `tmp/chat_with_anyone/voice_design/voice_id.txt`.
4. Generate TTS:
   `python3 skills/tts/scripts/tts.py -t "你好呀！很高兴认识你！" --voice-id "{VOICE_ID}" -o "tmp/chat_with_anyone/voice_design/reply.wav"`
5. Present audio and continue roleplay with same `--voice-id`.

## Troubleshooting

| Problem | Solution |
|---------|----------|
| Download fails or video unavailable | Try a different video URL; some regions/videos are restricted |
| No SRT subtitle files | Re-download with `--sub-lang en,zh-Hans`; if still none, try a different video with auto-captions |
| `extract_ref_segment.py` finds no suitable window | Use `--min-duration 2` for shorter clips, or try a different video |
| Voice design returns error | Check Noiz API key; ensure image is a clear photo of a person |
| TTS output sounds wrong | For Workflow A, try a different reference video; for Workflow B, adjust `--voice-description` |


---
*Source: https://skills.yangsir.net/skill/ssh2-chat-with-anyone*
*Markdown mirror: https://skills.yangsir.net/api/skill/ssh2-chat-with-anyone/markdown*