data-scraper-agent
Build production-grade AI-driven data scraping agents, supporting scheduled runs, LLM data augmentation, database storage, and continuous optimization, usable for scraping any public data source.
npx skills add affaan-m/everything-claude-code --skill data-scraper-agentBefore / After Comparison
1 组Manually completing tasks related to building production-grade AI-driven data acquisition requires repetitive operations and confirmations. The entire process takes approximately 89 hours, is prone to errors, and is inefficient.
Using this Skill automates processing, intelligently analyzes and executes, completing all work within 10 hours with high accuracy and standardized processes.
data-scraper-agent
Data Scraper Agent
Build a production-ready, AI-powered data collection agent for any public data source. Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.
Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase
When to Activate
-
User wants to scrape or monitor any public website or API
-
User says "build a bot that checks...", "monitor X for me", "collect data from..."
-
User wants to track jobs, prices, news, repos, sports scores, events, listings
-
User asks how to automate data collection without paying for hosting
-
User wants an agent that gets smarter over time based on their decisions
Core Concepts
The Three Layers
Every data scraper agent has three layers:
COLLECT → ENRICH → STORE
│ │ │
Scraper AI (LLM) Database
runs on scores/ Notion /
schedule summarises Sheets /
& classifies Supabase
Free Stack
Layer Tool Why
Scraping
requests + BeautifulSoup
No cost, covers 80% of public sites
JS-rendered sites
playwright (free)
When HTML scraping fails
AI enrichment Gemini Flash via REST API 500 req/day, 1M tokens/day — free
Storage Notion API Free tier, great UI for review
Schedule GitHub Actions cron Free for public repos
Learning JSON feedback file in repo Zero infra, persists in git
AI Model Fallback Chain
Build agents to auto-fallback across Gemini models on quota exhaustion:
gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (fallback)
Batch API Calls for Efficiency
Never call the LLM once per item. Always batch:
# BAD: 33 API calls for 33 items
for item in items:
result = call_ai(item) # 33 calls → hits rate limit
# GOOD: 7 API calls for 33 items (batch size 5)
for batch in chunks(items, size=5):
results = call_ai(batch) # 7 calls → stays within free tier
Workflow
Step 1: Understand the Goal
Ask the user:
-
What to collect: "What data source? URL / API / RSS / public endpoint?"
-
What to extract: "What fields matter? Title, price, URL, date, score?"
-
How to store: "Where should results go? Notion, Google Sheets, Supabase, or local file?"
-
How to enrich: "Do you want AI to score, summarise, classify, or match each item?"
-
Frequency: "How often should it run? Every hour, daily, weekly?"
Common examples to prompt:
-
Job boards → score relevance to resume
-
Product prices → alert on drops
-
GitHub repos → summarise new releases
-
News feeds → classify by topic + sentiment
-
Sports results → extract stats to tracker
-
Events calendar → filter by interest
Step 2: Design the Agent Architecture
Generate this directory structure for the user:
my-agent/
├── config.yaml # User customises this (keywords, filters, preferences)
├── profile/
│ └── context.md # User context the AI uses (resume, interests, criteria)
├── scraper/
│ ├── __init__.py
│ ├── main.py # Orchestrator: scrape → enrich → store
│ ├── filters.py # Rule-based pre-filter (fast, before AI)
│ └── sources/
│ ├── __init__.py
│ └── source_name.py # One file per data source
├── ai/
│ ├── __init__.py
│ ├── client.py # Gemini REST client with model fallback
│ ├── pipeline.py # Batch AI analysis
│ ├── jd_fetcher.py # Fetch full content from URLs (optional)
│ └── memory.py # Learn from user feedback
├── storage/
│ ├── __init__.py
│ └── notion_sync.py # Or sheets_sync.py / supabase_sync.py
├── data/
│ └── feedback.json # User decision history (auto-updated)
├── .env.example
├── setup.py # One-time DB/schema creation
├── enrich_existing.py # Backfill AI scores on old rows
├── requirements.txt
└── .github/
└── workflows/
└── scraper.yml # GitHub Actions schedule
Step 3: Build the Scraper Source
Template for any data source:
# scraper/sources/my_source.py
"""
[Source Name] — scrapes [what] from [where].
Method: [REST API / HTML scraping / RSS feed]
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone
from scraper.filters import is_relevant
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",
}
def fetch() -> list[dict]:
"""
Returns a list of items with consistent schema.
Each item must have at minimum: name, url, date_found.
"""
results = []
# ---- REST API source ----
resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)
if resp.status_code == 200:
for item in resp.json().get("results", []):
if not is_relevant(item.get("title", "")):
continue
results.append(_normalise(item))
return results
def _normalise(raw: dict) -> dict:
"""Convert raw API/HTML data to the standard schema."""
return {
"name": raw.get("title", ""),
"url": raw.get("link", ""),
"source": "MySource",
"date_found": datetime.now(timezone.utc).date().isoformat(),
# add domain-specific fields here
}
HTML scraping pattern:
soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select("[class*='listing']"):
title = card.select_one("h2, h3").get_text(strip=True)
link = card.select_one("a")["href"]
if not link.startswith("http"):
link = f"https://example.com{link}"
RSS feed pattern:
import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
title = item.findtext("title", "")
link = item.findtext("link", "")
Step 4: Build the Gemini AI Client
# ai/client.py
import os, json, time, requests
_last_call = 0.0
MODEL_FALLBACK = [
"gemini-2.0-flash-lite",
"gemini-2.0-flash",
"gemini-2.5-flash",
"gemini-flash-lite-latest",
]
def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:
"""Call Gemini with auto-fallback on 429. Returns parsed JSON or {}."""
global _last_call
api_key = os.environ.get("GEMINI_API_KEY", "")
if not api_key:
return {}
elapsed = time.time() - _last_call
if elapsed < rate_limit:
time.sleep(rate_limit - elapsed)
models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK
_last_call = time.time()
for m in models:
url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}"
payload = {
"contents": [{"parts": [{"text": prompt}]}],
"generationConfig": {
"responseMimeType": "application/json",
"temperature": 0.3,
"maxOutputTokens": 2048,
},
}
try:
resp = requests.post(url, json=payload, timeout=30)
if resp.status_code == 200:
return _parse(resp)
if resp.status_code in (429, 404):
time.sleep(1)
continue
return {}
except requests.RequestException:
return {}
return {}
def _parse(resp) -> dict:
try:
text = (
resp.json()
.get("candidates", [{}])[0]
.get("content", {})
.get("parts", [{}])[0]
.get("text", "")
.strip()
)
if text.startswith("```"):
text = text.split("\n", 1)[-1].rsplit("```", 1)[0]
return json.loads(text)
except (json.JSONDecodeError, KeyError):
return {}
Step 5: Build the AI Pipeline (Batch)
# ai/pipeline.py
import json
import yaml
from pathlib import Path
from ai.client import generate
def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]:
"""Analyse items in batches. Returns items enriched with AI fields."""
config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
model = config.get("ai", {}).get("model", "gemini-2.5-flash")
rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0)
min_score = config.get("ai", {}).get("min_score", 0)
batch_size = config.get("ai", {}).get("batch_size", 5)
batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
print(f" [AI] {len(items)} items → {len(batches)} API calls")
enriched = []
for i, batch in enumerate(batches):
print(f" [AI] Batch {i + 1}/{len(batches)}...")
prompt = _build_prompt(batch, context, preference_prompt, config)
result = generate(prompt, model=model, rate_limit=rate_limit)
analyses = result.get("analyses", [])
for j, item in enumerate(batch):
ai = analyses[j] if j < len(analyses) else {}
if ai:
score = max(0, min(100, int(ai.get("score", 0))))
if min_score and score < min_score:
continue
enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")})
else:
enriched.append(item)
return enriched
def _build_prompt(batch, context, preference_prompt, config):
priorities = config.get("priorities", [])
items_text = "\n\n".join(
f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}"
for i, item in enumerate(batch)
)
return f"""Analyse these {len(batch)} items and return a JSON object.
# Items
{items_text}
# User Context
{context[:800] if context else "Not provided"}
# User Priorities
{chr(10).join(f"- {p}" for p in priorities)}
{preference_prompt}
# Instructions
Return: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": "<why this matches or doesn't>"}} for each item in order]}}
Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak."""
Step 6: Build the Feedback Learning System
# ai/memory.py
"""Learn from user decisions to improve future scoring."""
import json
from pathlib import Path
FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json"
def load_feedback() -> dict:
if FEEDBACK_PATH.exists():
try:
return json.loads(FEEDBACK_PATH.read_text())
except (json.JSONDecodeError, OSError):
pass
return {"positive": [], "negative": []}
def save_feedback(fb: dict):
FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True)
FEEDBACK_PATH.write_text(json.dumps(fb, indent=2))
def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str:
"""Convert feedback history into a prompt bias section."""
lines = []
if feedback.get("positive"):
lines.append("# Items the user LIKED (positive signal):")
for e in feedback["positive"][-max_examples:]:
lines.append(f"- {e}")
if feedback.get("negative"):
lines.append("\n# Items the user SKIPPED/REJECTED (negative signal):")
for e in feedback["negative"][-max_examples:]:
lines.append(f"- {e}")
if lines:
lines.append("\nUse these patterns to bias scoring on new items.")
return "\n".join(lines)
Integration with your storage layer: after each run, query your DB for items with positive/negative status and call save_feedback() with the extracted patterns.
Step 7: Build Storage (Notion example)
# storage/notion_sync.py
import os
from notion_client import Client
from notion_client.errors import APIResponseError
_client = None
def get_client():
global _client
if _client is None:
_client = Client(auth=os.environ["NOTION_TOKEN"])
return _client
def get_existing_urls(db_id: str) -> set[str]:
"""Fetch all URLs already stored — used for deduplication."""
client, seen, cursor = get_client(), set(), None
while True:
resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {})
for page in resp["results"]:
url = page["properties"].get("URL", {}).get("url", "")
if url: seen.add(url)
if not resp["has_more"]: break
cursor = resp["next_cursor"]
return seen
def push_item(db_id: str, item: dict) -> bool:
"""Push one item to Notion. Returns True on success."""
props = {
"Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]},
"URL": {"url": item.get("url")},
"Source": {"select": {"name": item.get("source", "Unknown")}},
"Date Found": {"date": {"start": item.get("date_found")}},
"Status": {"select": {"name": "New"}},
}
# AI fields
if item.get("ai_score") is not None:
props["AI Score"] = {"number": item["ai_score"]}
if item.get("ai_summary"):
props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]}
if item.get("ai_notes"):
props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]}
try:
get_client().pages.create(parent={"database_id": db_id}, properties=props)
return True
except APIResponseError as e:
print(f"[notion] Push failed: {e}")
return False
def sync(db_id: str, items: list[dict]) -> tuple[int, int]:
existing = get_existing_urls(db_id)
added = skipped = 0
for item in items:
if item.get("url") in existing:
skipped += 1; continue
if push_item(db_id, item):
added += 1; existing.add(item["url"])
else:
skipped += 1
return added, skipped
Step 8: Orchestrate in main.py
# scraper/main.py
import os, sys, yaml
from pathlib import Path
from dotenv import load_dotenv
load_dotenv()
from scraper.sources import my_source # add your sources
# NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase",
# replace this import with storage.sheets_sync or storage.supabase_sync and update
# the env var and sync() call accordingly.
from storage.notion_sync import sync
SOURCES = [
("My Source", my_source.fetch),
]
def ai_enabled():
return bool(os.environ.get("GEMINI_API_KEY"))
def main():
config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
provider = config.get("storage", {}).get("provider", "notion")
# Resolve the storage target identifier from env based on provider
if provider == "notion":
db_id = os.environ.get("NOTION_DATABASE_ID")
if not db_id:
print("ERROR: NOTION_DATABASE_ID not set"); sys.exit(1)
else:
# Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc.
print(f"ERROR: provider '{provider}' not yet wired in main.py"); sys.exit(1)
config = yaml.safe_load((Path(_
...
User Reviews (0)
Write a Review
No reviews yet
Statistics
User Rating
Rate this Skill