C

crawl

by @tavily-aiv
4.5(4)

このスキルは、ウェブサイトのクロールに使用され、複数のページからコンテンツを抽出します。ドキュメント、ナレッジベース、サイト全体のコンテンツ抽出に適しており、OAuth認証をサポートします。

Web CrawlingData ExtractionWeb ScrapingData CollectionPython ScrapyGitHub
インストール方法
npx skills add tavily-ai/skills --skill crawl
compare_arrows

Before / After 効果比較

1
使用前

ウェブサイトコンテンツの手動コピー&ペーストや、複雑なクローラースクリプトの作成は非効率的で、アンチスクレイピングメカニズムに阻まれやすく、データ取得が不完全になります。

使用後

`crawl` スキルを活用することで、複数のウェブページからコンテンツを自動的に抽出し、効率的かつ正確にナレッジベースやデータセットを構築し、大幅な人件費を削減します。

description SKILL.md

crawl

Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

Authentication

The script uses OAuth via the Tavily MCP server. No manual setup required - on first run, it will:

  • Check for existing tokens in ~/.mcp-auth/

  • If none found, automatically open your browser for OAuth authentication

Note: You must have an existing Tavily account. The OAuth flow only supports login — account creation is not available through this flow. Sign up at tavily.com first if you don't have an account.

Alternative: API Key

If you prefer using an API key, get one at https://tavily.com and add to ~/.claude/settings.json:

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

Using the Script

./scripts/crawl.sh '<json>' [output_dir]

Examples:

# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

When output_dir is provided, each crawled page is saved as a separate markdown file.

Basic Crawl

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

Endpoint

POST https://api.tavily.com/crawl

Headers

Header Value

Authorization Bearer <TAVILY_API_KEY>

Content-Type application/json

Request Body

Field Type Default Description

url string Required Root URL to begin crawling

max_depth integer 1 Levels deep to crawl (1-5)

max_breadth integer 20 Links per page

limit integer 50 Total pages cap

instructions string null Natural language guidance for focus

chunks_per_source integer 3 Chunks per page (1-5, requires instructions)

extract_depth string "basic" basic or advanced

format string "markdown" markdown or text

select_paths array null Regex patterns to include

exclude_paths array null Regex patterns to exclude

allow_external boolean true Include external domain links

timeout float 150 Max wait (10-150 seconds)

Response Format

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

Depth Typical Pages Time

1 10-50 Seconds

2 50-500 Minutes

3 500-5000 Many minutes

Start with max_depth=1 and increase only if needed.

Crawl for Context vs Data Collection

For agentic use (feeding results into context): Always use instructions + chunks_per_source. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit chunks_per_source to get full page content.

Examples

For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

For Context: Targeted Technical Docs

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

Use when saving content to files for later processing:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

Returns full page content - use the script with output_dir to save as markdown files.

Map API (URL Discovery)

Use map instead of crawl when you only need URLs, not content:

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

Returns URLs only (faster than crawl):

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

  • Always use chunks_per_source for agentic workflows - prevents context explosion when feeding results to LLMs

  • Omit chunks_per_source only for data collection - when saving full pages to files

  • Start conservative (max_depth=1, limit=20) and scale up

  • Use path patterns to focus on relevant sections

  • Use Map first to understand site structure before full crawl

  • Always set a limit to prevent runaway crawls

Weekly Installs3.5KRepositorytavily-ai/skillsGitHub Stars95First SeenJan 25, 2026Security AuditsGen Agent Trust HubFailSocketPassSnykWarnInstalled onopencode3.2Kgemini-cli3.1Kcodex3.1Kgithub-copilot3.0Kkimi-cli2.9Kamp2.9K

forumユーザーレビュー (0)

レビューを書く

効果
使いやすさ
ドキュメント
互換性

レビューなし

統計データ

インストール数292
評価4.5 / 5.0
バージョン
更新日2026年3月17日
比較事例1 件

ユーザー評価

4.5(4)
5
0%
4
0%
3
0%
2
0%
1
0%

この Skill を評価

0.0

対応プラットフォーム

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

タイムライン

作成2026年3月17日
最終更新2026年3月17日