Home/数据分析/pdf-processing
P

pdf-processing

by @davila7v
4.6(36)

Extracts text content from PDF documents using the pdfplumber library. An AI Agent Skill to improve work efficiency and automation.

PDF ParsingDocument Data ExtractionData PreprocessingText MiningDocument AutomationGitHub
Installation
npx skills add davila7/claude-code-templates --skill pdf-processing
compare_arrows

Before / After Comparison

1
Before

Traditional PDF text extraction tools often encounter issues like messy formatting and misaligned table data, requiring extensive manual adjustments and cleaning. This is time-consuming, prone to errors, and hinders large-scale data analysis.

After

With the pdfplumber library, we can accurately extract structured text from PDF documents, including tables and paragraphs. This significantly improves the accuracy and automation of data extraction, accelerating the data analysis process.

description SKILL.md

pdf-processing

PDF Processing

Quick start

Use pdfplumber to extract text from PDFs:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

Extracting tables

Extract tables from PDFs with automatic detection:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        for row in table:
            print(row)

Extracting all pages

Process multi-page documents efficiently:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    full_text = ""
    for page in pdf.pages:
        full_text += page.extract_text() + "\n\n"

    print(full_text)

Form filling

For PDF form filling, see FORMS.md for the complete guide including field analysis and validation.

Merging PDFs

Combine multiple PDF files:

from pypdf import PdfMerger

merger = PdfMerger()

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

merger.write("merged.pdf")
merger.close()

Splitting PDFs

Extract specific pages or ranges:

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Extract pages 2-5
for page_num in range(1, 5):
    writer.add_page(reader.pages[page_num])

with open("output.pdf", "wb") as output:
    writer.write(output)

Available packages

  • pdfplumber - Text and table extraction (recommended)

  • pypdf - PDF manipulation, merging, splitting

  • pdf2image - Convert PDFs to images (requires poppler)

  • pytesseract - OCR for scanned PDFs (requires tesseract)

Common patterns

Extract and save text:

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text = "\n\n".join(page.extract_text() for page in pdf.pages)

with open("output.txt", "w") as f:
    f.write(text)

Extract tables to CSV:

import pdfplumber
import csv

with pdfplumber.open("tables.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

    with open("output.csv", "w", newline="") as f:
        writer = csv.writer(f)
        for table in tables:
            writer.writerows(table)

Error handling

Handle common PDF issues:

import pdfplumber

try:
    with pdfplumber.open("document.pdf") as pdf:
        if len(pdf.pages) == 0:
            print("PDF has no pages")
        else:
            text = pdf.pages[0].extract_text()
            if text is None or text.strip() == "":
                print("Page contains no extractable text (might be scanned)")
            else:
                print(text)
except Exception as e:
    print(f"Error processing PDF: {e}")

Performance tips

  • Process pages in batches for large PDFs

  • Use multiprocessing for multiple files

  • Extract only needed pages rather than entire document

  • Close PDF objects after use

Weekly Installs0Repositorydavila7/claude-…emplatesGitHub Stars23.1KFirst SeenJan 1, 1970Security AuditsGen Agent Trust HubPassSocketPassSnykPass

forumUser Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs1.0K
Rating4.6 / 5.0
Version
Updated2026年3月17日
Comparisons1

User Rating

4.6(36)
5
0%
4
0%
3
0%
2
0%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

Timeline

Created2026年3月17日
Last Updated2026年3月17日