tavily-crawl
能够高效爬取整个网站,从多个页面中提取所需内容,并支持保存数据,为深度数据分析和信息收集提供支持。
npx skills add tavily-ai/skills --skill tavily-crawlBefore / After 效果对比
1 组过去需要手动访问网站的多个页面并复制内容,或编写复杂的爬虫脚本,过程耗时且技术门槛较高。
Tavily Crawl技能能智能爬取整个网站并从多页面提取内容,大幅简化数据收集工作,提高效率。
tavily-crawl
tavily crawl
Crawl a website and extract content from multiple pages. Supports saving each page as a local markdown file.
Before running any command
If tvly is not found on PATH, install it first:
curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login
Do not skip this step or fall back to other tools.
See tavily-cli for alternative install methods and auth options.
When to use
-
You need content from many pages on a site (e.g., all
/docs/) -
You want to download documentation for offline use
-
Step 4 in the workflow: search → extract → map → crawl → research
Quick start
# Basic crawl
tvly crawl "https://docs.example.com" --json
# Save each page as a markdown file
tvly crawl "https://docs.example.com" --output-dir ./docs/
# Deeper crawl with limits
tvly crawl "https://docs.example.com" --max-depth 2 --limit 50 --json
# Filter to specific paths
tvly crawl "https://example.com" --select-paths "/api/.*,/guides/.*" --exclude-paths "/blog/.*" --json
# Semantic focus (returns relevant chunks, not full pages)
tvly crawl "https://docs.example.com" --instructions "Find authentication docs" --chunks-per-source 3 --json
Options
Option Description
--max-depth
Levels deep (1-5, default: 1)
--max-breadth
Links per page (default: 20)
--limit
Total pages cap (default: 50)
--instructions
Natural language guidance for semantic focus
--chunks-per-source
Chunks per page (1-5, requires --instructions)
--extract-depth
basic (default) or advanced
--format
markdown (default) or text
--select-paths
Comma-separated regex patterns to include
--exclude-paths
Comma-separated regex patterns to exclude
--select-domains
Comma-separated regex for domains to include
--exclude-domains
Comma-separated regex for domains to exclude
--allow-external / --no-external
Include external links (default: allow)
--include-images
Include images
--timeout
Max wait (10-150 seconds)
-o, --output
Save JSON output to file
--output-dir
Save each page as a .md file in directory
--json
Structured JSON output
Crawl for context vs. data collection
For agentic use (feeding results to an LLM):
Always use --instructions + --chunks-per-source. Returns only relevant chunks instead of full pages — prevents context explosion.
tvly crawl "https://docs.example.com" --instructions "API authentication" --chunks-per-source 3 --json
For data collection (saving to files):
Use --output-dir without --chunks-per-source to get full pages as markdown files.
tvly crawl "https://docs.example.com" --max-depth 2 --output-dir ./docs/
Tips
-
Start conservative —
--max-depth 1,--limit 20— and scale up. -
Use
--select-pathsto focus on the section you need. -
Use map first to understand site structure before a full crawl.
-
Always set
--limitto prevent runaway crawls.
See also
-
tavily-map — discover URLs before deciding to crawl
-
tavily-extract — extract individual pages
-
tavily-search — find pages when you don't have a URL
Weekly Installs292Repositorytavily-ai/skillsGitHub Stars95First Seen2 days agoSecurity AuditsGen Agent Trust HubFailSocketPassSnykFailInstalled oncodex286opencode285cursor285kimi-cli284gemini-cli284amp284
用户评价 (0)
发表评价
暂无评价
统计数据
用户评分
为此 Skill 评分