O
ocr-document-processor
by @dkyazzentwatwav1.0.0
0.0(0)
使用OCR技术从图片和扫描PDF中提取文本,支持多语言、表格检测、结构化输出和批量处理。
安装方式
npx skills add dkyazzentwatwa/chatgpt-skills --skill ocr-document-processorcompare_arrows
Before / After 效果对比
1 组使用前
从图片和扫描PDF中手动提取文本耗时费力,且容易出错。面对多语言、表格数据时,处理难度更大,效率低下,无法满足批量需求。
使用后
此技能运用OCR技术,能从图片和扫描PDF中高效提取文本。它支持多语言、表格检测和结构化输出,并可批量处理,极大提升数据处理效率。
description SKILL.md
name: ocr-document-processor description: Extract text from images and scanned PDFs using OCR. Supports 100+ languages, table detection, structured output (markdown/JSON), and batch processing.
OCR Document Processor
Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.
Core Capabilities
- Image OCR: Extract text from PNG, JPEG, TIFF, BMP images
- PDF OCR: Process scanned PDFs page by page
- Multi-language: Support for 100+ languages
- Structured Output: Plain text, Markdown, JSON, or HTML
- Table Detection: Extract tabular data to CSV/JSON
- Batch Processing: Process multiple documents at once
- Quality Assessment: Confidence scoring for OCR results
Quick Start
from scripts.ocr_processor import OCRProcessor
# Simple text extraction
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)
# Extract to structured format
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks']) # Text blocks with positions
Core Workflow
1. Basic Text Extraction
from scripts.ocr_processor import OCRProcessor
# From image
processor = OCRProcessor("scan.png")
text = processor.extract_text()
# From PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text() # All pages
# Specific pages
text = processor.extract_text(pages=[1, 2, 3])
2. Structured Extraction
# Get detailed results
result = processor.extract_structured()
# Result contains:
# - text: Full extracted text
# - blocks: Text blocks with bounding boxes
# - lines: Individual lines
# - words: Individual words with confidence
# - confidence: Overall confidence score
# - language: Detected language
3. Export Formats
# Export to Markdown
processor.export_markdown("output.md")
# Export to JSON
processor.export_json("output.json")
# Export to searchable PDF
processor.export_searchable_pdf("searchable.pdf")
# Export to HTML
processor.export_html("output.html")
Language Support
# Specify language for better accuracy
processor = OCRProcessor("german_doc.png", lang='deu')
# Multiple languages
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')
# Auto-detect language
processor = OCRProcessor("document.png", lang='auto')
Supported Languages (Common)
| Code | Language | Code | Language |
|---|---|---|---|
| eng | English | fra | French |
| deu | German | spa | Spanish |
| ita | Italian | por | Portuguese |
| rus | Russian | chi_sim | Chinese (Simplified) |
| chi_tra | Chinese (Traditional) | jpn | Japanese |
| kor | Korean | ara | Arabic |
| hin | Hindi | nld | Dutch |
Image Preprocessing
Preprocessing improves OCR accuracy on low-quality images.
# Enable preprocessing
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
deskew=True, # Fix rotation
denoise=True, # Remove noise
threshold=True, # Binarize image
contrast=1.5 # Enhance contrast
)
text = processor.extract_text()
Available Preprocessing Options
| Option | Description | Default |
|---|---|---|
deskew | Correct skewed/rotated images | False |
denoise | Remove noise and artifacts | False |
threshold | Convert to black/white | False |
threshold_method | 'otsu', 'adaptive', 'simple' | 'otsu' |
contrast | Contrast factor (1.0 = no change) | 1.0 |
sharpen | Sharpen factor (0 = none) | 0 |
scale | Upscale factor for small text | 1.0 |
remove_shadows | Remove shadow artifacts | False |
Table Extraction
# Extract tables from document
tables = processor.extract_tables()
# Each table is a list of rows
for table in tables:
for row in table:
print(row)
# Export tables to CSV
processor.export_tables_csv("tables/")
# Export to JSON
processor.export_tables_json("tables.json")
PDF Processing
Multi-Page PDFs
# Process all pages
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()
# Process specific pages
page_3 = processor.extract_text(pages=[3])
# Get per-page results
results = processor.extract_by_page()
for page_num, text in results.items():
print(f"Page {page_num}: {len(text)} characters")
Create Searchable PDF
# Convert scanned PDF to searchable PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")
Batch Processing
from scripts.ocr_processor import batch_ocr
# Process directory of images
results = batch_ocr(
input_dir="scans/",
output_dir="extracted/",
output_format="markdown",
lang="eng",
recursive=True
)
print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")
Receipt/Document Parsing
Receipt Extraction
# Parse receipt structure
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()
# Returns structured data:
# - vendor: Store name
# - date: Transaction date
# - items: List of items with prices
# - subtotal: Subtotal amount
# - tax: Tax amount
# - total: Total amount
Business Card Parsing
# Extract business card info
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()
# Returns:
# - name: Person's name
# - title: Job title
# - company: Company name
# - email: Email addresses
# - phone: Phone numbers
# - address: Physical address
# - website: Website URLs
Configuration
processor = OCRProcessor("document.png")
# Configure OCR settings
processor.config.update({
'psm': 3, # Page segmentation mode
'oem': 3, # OCR engine mode
'dpi': 300, # DPI for processing
'timeout': 30, # Timeout in seconds
'min_confidence': 60, # Minimum word confidence
})
Page Segmentation Modes (PSM)
| Mode | Description |
|---|---|
| 0 | Orientation and script detection only |
| 1 | Automatic page segmentation with OSD |
| 3 | Fully automatic page segmentation (default) |
| 4 | Assume single column of text |
| 6 | Assume single uniform block of text |
| 7 | Treat image as single text line |
| 8 | Treat image as single word |
| 11 | Sparse text. Find as much text as possible |
| 12 | Sparse text with OSD |
Quality Assessment
# Get confidence scores
result = processor.extract_structured()
# Overall confidence (0-100)
print(f"Confidence: {result['confidence']}%")
# Per-word confidence
for word in result['words']:
print(f"{word['text']}: {word['confidence']}%")
# Filter low-confidence words
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]
Output Formats
Markdown Export
processor.export_markdown("output.md")
Output includes:
- Document title (if detected)
- Structured headings
- Paragraphs
- Tables (as Markdown tables)
- Page breaks for multi-page docs
JSON Export
processor.export_json("output.json")
Output structure:
{
"source": "document.pdf",
"pages": 5,
"language": "eng",
"confidence": 92.5,
"text": "Full extracted text...",
"blocks": [
{
"type": "paragraph",
"text": "Block text...",
"bbox": [x, y, width, height],
"confidence": 95.2
}
],
"tables": [...]
}
HTML Export
processor.export_html("output.html")
Creates styled HTML with:
- Preserved layout approximation
- Highlighted low-confidence regions
- Embedded images (optional)
- Print-friendly styling
CLI Usage
# Basic extraction
python ocr_processor.py image.png -o output.txt
# Extract to markdown
python ocr_processor.py document.pdf -o output.md --format markdown
# Specify language
python ocr_processor.py german.png --lang deu
# Batch processing
python ocr_processor.py scans/ -o extracted/ --batch
# With preprocessing
python ocr_processor.py noisy.png --preprocess --deskew --denoise
Error Handling
from scripts.ocr_processor import OCRProcessor, OCRError
try:
processor = OCRProcessor("document.png")
text = processor.extract_text()
except OCRError as e:
print(f"OCR failed: {e}")
except FileNotFoundError:
print("File not found")
Performance Tips
- Image Quality: Higher resolution (300+ DPI) improves accuracy
- Preprocessing: Use for low-quality scans
- Language: Specifying language improves speed and accuracy
- PSM Mode: Choose appropriate mode for document type
- Large Files: Process PDFs page by page for memory efficiency
Limitations
- Handwritten text: Limited accuracy
- Complex layouts: May lose structure
- Very low quality: Preprocessing helps but has limits
- Non-Latin scripts: Require specific language packs
Dependencies
pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0
System Requirements
- Tesseract OCR engine must be installed
- Language data files for non-English languages
forum用户评价 (0)
发表评价
效果
易用性
文档
兼容性
暂无评价,来写第一条吧
统计数据
安装量0
评分0.0 / 5.0
版本1.0.0
更新日期2026年3月16日
对比案例1 组
用户评分
0.0(0)
5
0%
4
0%
3
0%
2
0%
1
0%
为此 Skill 评分
0.0
兼容平台
🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI
时间线
创建2026年3月16日
最后更新2026年3月16日