baoyu-youtube-transcript
自动下载 YouTube 视频的字幕和转录文本,支持手动和自动生成的字幕,无需 API 密钥或浏览器,直接调用 YouTube InnerTube API
npx skills add jimliu/baoyu-skills --skill baoyu-youtube-transcriptBefore / After 效果对比
1 组手动完成自动下载 YouTube 视频相关任务,需要反复操作和确认,整个过程大约需要115小时,容易出错且效率低下
使用该 Skill 自动化处理,智能分析和执行,6小时内完成全部工作,准确率高且流程标准化
description SKILL.md
baoyu-youtube-transcript
YouTube Transcript
Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly.
Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.
Script Directory
Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.
Script Purpose
scripts/main.ts
Transcript download CLI
Usage
# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>
# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja
# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps
# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters
# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers
# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt
# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans
# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list
# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh
Options
Option Description Default
<url-or-id>
YouTube URL or video ID (multiple allowed)
Required
--languages <codes>
Language codes, comma-separated, in priority order
en
--format <fmt>
Output format: text, srt
text
--translate <code>
Translate to specified language code
--list
List available transcripts instead of fetching
--timestamps
Include [HH:MM:SS → HH:MM:SS] timestamps per paragraph
on
--no-timestamps
Disable timestamps
--chapters
Chapter segmentation from video description
--speakers
Raw transcript with metadata for speaker identification
--exclude-generated
Skip auto-generated transcripts
--exclude-manually-created
Skip manually created transcripts
--refresh
Force re-fetch, ignore cached data
-o, --output <path>
Save to specific file path
auto-generated
--output-dir <dir>
Base output directory
youtube-transcript
Input Formats
Accepts any of these as video input:
-
Full URL:
https://www.youtube.com/watch?v=dQw4w9WgXcQ -
Short URL:
https://youtu.be/dQw4w9WgXcQ -
Embed URL:
https://www.youtube.com/embed/dQw4w9WgXcQ -
Shorts URL:
https://www.youtube.com/shorts/dQw4w9WgXcQ -
Video ID:
dQw4w9WgXcQ
Output Formats
Format Extension Description
text
.md
Markdown with frontmatter, natural paragraphs, optional timestamps/chapters/speakers
srt
.srt
SubRip subtitle format for video players
Output Directory
youtube-transcript/
├── .index.json # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
├── meta.json # Video metadata (title, channel, description, duration, chapters, etc.)
├── transcript-raw.json # Raw transcript snippets from YouTube API (cached)
├── transcript-sentences.json # Sentence-segmented transcript (split by punctuation, merged across snippets)
├── imgs/
│ └── cover.jpg # Video thumbnail
├── transcript.md # Markdown transcript (generated from sentences)
└── transcript.srt # SRT subtitle (generated from raw snippets, if --format srt)
-
{channel-slug}: Channel name in kebab-case -
{title-full-slug}: Full video title in kebab-case
The --list mode outputs to stdout only (no file saved).
Caching
On first fetch, the script saves:
-
meta.json— video metadata, chapters, cover image path, language info -
transcript-raw.json— raw transcript snippets from YouTube API ({ text, start, duration }[]) -
transcript-sentences.json— sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。?!etc.), timestamps proportionally allocated by character length, CJK-aware text merging -
imgs/cover.jpg— video thumbnail
Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.
SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.
Workflow
When user provides a YouTube URL and wants the transcript:
-
Run with
--listfirst if the user hasn't specified a language, to show available options -
Default: run with
--chapters --speakersfor the richest output (chapters + speaker identification) -
The script auto-saves cached data + output file and prints the file path
-
For
--speakersmode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels
When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.
When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.
Chapter & Speaker Workflow
Chapters (--chapters)
The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.
If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.
Speaker Identification (--speakers)
Speaker identification requires AI processing. The script outputs a raw .md file containing:
-
YAML frontmatter with video metadata (title, channel, date, cover, language)
-
Video description (for speaker name extraction)
-
Chapter list from description (if available)
-
Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)
After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:
-
Read the saved
.mdfile -
Read the prompt template at
{baseDir}/prompts/speaker-transcript.md -
Process the raw transcript following the prompt:
Identify speakers using video metadata (title → guest, channel → host, description → names)
-
Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
-
Segment into chapters (use description chapters if available, else create from topic shifts)
-
Format with
**Speaker Name:**labels, paragraph grouping (2-4 sentences), and[HH:MM:SS → HH:MM:SS]timestamps -
Overwrite the
.mdfile with the processed transcript (keep the YAML frontmatter)
When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.
Error Cases
Error Meaning
Transcripts disabled Video has no captions at all
No transcript found Requested language not available
Video unavailable Video deleted, private, or region-locked
IP blocked Too many requests, try again later
Age restricted Video requires login for age verification
Weekly Installs417Repositoryjimliu/baoyu-skillsGitHub Stars9.9KFirst SeenTodaySecurity AuditsGen Agent Trust HubPassSocketPassSnykWarnInstalled ongemini-cli393opencode393kimi-cli391amp391codex391github-copilot391
forum用户评价 (0)
发表评价
暂无评价,来写第一条吧
统计数据
用户评分
为此 Skill 评分