C

crawl

by @tavily-aiv
4.5(4)

This skill is used for website crawling, extracting content from multiple pages, suitable for documents, knowledge bases, and full-site content extraction, supporting OAuth authentication.

Web CrawlingData ExtractionWeb ScrapingData CollectionPython ScrapyGitHub
Installation
npx skills add tavily-ai/skills --skill crawl
compare_arrows

Before / After Comparison

1
Before

Manually copying and pasting website content, or writing complex web scraping scripts, is inefficient, easily blocked by anti-scraping mechanisms, and results in incomplete data acquisition.

After

Utilizing the `crawl` skill automates content extraction from multiple web pages, efficiently and accurately building knowledge bases or datasets, saving significant human effort.

description SKILL.md

crawl

Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

Authentication

The script uses OAuth via the Tavily MCP server. No manual setup required - on first run, it will:

  • Check for existing tokens in ~/.mcp-auth/

  • If none found, automatically open your browser for OAuth authentication

Note: You must have an existing Tavily account. The OAuth flow only supports login — account creation is not available through this flow. Sign up at tavily.com first if you don't have an account.

Alternative: API Key

If you prefer using an API key, get one at https://tavily.com and add to ~/.claude/settings.json:

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

Using the Script

./scripts/crawl.sh '<json>' [output_dir]

Examples:

# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

When output_dir is provided, each crawled page is saved as a separate markdown file.

Basic Crawl

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

Endpoint

POST https://api.tavily.com/crawl

Headers

Header Value

Authorization Bearer <TAVILY_API_KEY>

Content-Type application/json

Request Body

Field Type Default Description

url string Required Root URL to begin crawling

max_depth integer 1 Levels deep to crawl (1-5)

max_breadth integer 20 Links per page

limit integer 50 Total pages cap

instructions string null Natural language guidance for focus

chunks_per_source integer 3 Chunks per page (1-5, requires instructions)

extract_depth string "basic" basic or advanced

format string "markdown" markdown or text

select_paths array null Regex patterns to include

exclude_paths array null Regex patterns to exclude

allow_external boolean true Include external domain links

timeout float 150 Max wait (10-150 seconds)

Response Format

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

Depth Typical Pages Time

1 10-50 Seconds

2 50-500 Minutes

3 500-5000 Many minutes

Start with max_depth=1 and increase only if needed.

Crawl for Context vs Data Collection

For agentic use (feeding results into context): Always use instructions + chunks_per_source. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit chunks_per_source to get full page content.

Examples

For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

For Context: Targeted Technical Docs

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

Use when saving content to files for later processing:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

Returns full page content - use the script with output_dir to save as markdown files.

Map API (URL Discovery)

Use map instead of crawl when you only need URLs, not content:

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

Returns URLs only (faster than crawl):

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

  • Always use chunks_per_source for agentic workflows - prevents context explosion when feeding results to LLMs

  • Omit chunks_per_source only for data collection - when saving full pages to files

  • Start conservative (max_depth=1, limit=20) and scale up

  • Use path patterns to focus on relevant sections

  • Use Map first to understand site structure before full crawl

  • Always set a limit to prevent runaway crawls

Weekly Installs3.5KRepositorytavily-ai/skillsGitHub Stars95First SeenJan 25, 2026Security AuditsGen Agent Trust HubFailSocketPassSnykWarnInstalled onopencode3.2Kgemini-cli3.1Kcodex3.1Kgithub-copilot3.0Kkimi-cli2.9Kamp2.9K

forumUser Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs292
Rating4.5 / 5.0
Version
Updated2026年3月17日
Comparisons1

User Rating

4.5(4)
5
0%
4
0%
3
0%
2
0%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

Timeline

Created2026年3月17日
Last Updated2026年3月17日