pdf-ocr-extraction
从扫描文档和图像型 PDF 中提取文本,支持批量处理,将非数字化文档转为可搜索和可编辑的文本
npx skills add claude-office-skills/skills --skill pdf-ocr-extractionBefore / After 效果对比
1 组手动输入扫描文档中的文字内容,或使用付费 OCR 服务逐个处理文件,100 页文档需要 4-5 小时
自动批量识别并提取文档中的所有文字,保留基本格式,100 页文档 10 分钟完成处理
description SKILL.md
pdf-ocr-extraction
PDF OCR Extraction
Extract text from scanned documents and image-based PDFs using OCR technology.
Overview
This skill helps you:
-
Extract text from scanned documents
-
Make image PDFs searchable
-
Digitize paper documents
-
Process handwritten text (limited)
-
Batch process multiple documents
How to Use
Basic OCR
"Extract text from this scanned PDF"
"OCR this document image"
"Make this PDF searchable"
With Options
"Extract text from pages 1-10, English language"
"OCR this document, preserve layout"
"Extract and output as structured data"
Document Types
OCR Quality by Document Type
Document Type Expected Quality Tips
Typed documents ⭐⭐⭐⭐⭐ 95%+ Best results
Printed books ⭐⭐⭐⭐ 90%+ Watch for aging
Forms ⭐⭐⭐⭐ 85%+ Check boxes may need manual
Tables/Data ⭐⭐⭐ 80%+ Structure may need fixing
Handwritten (neat) ⭐⭐ 60-80% Variable results
Handwritten (cursive) ⭐ 30-60% Often needs manual review
Mixed content ⭐⭐⭐ 75%+ Depends on complexity
Output Formats
Plain Text Extraction
## OCR Result: [Document Name]
**Pages Processed**: [X]
**Language**: [Detected/Specified]
**Confidence**: [X]%
---
[Extracted text content here]
---
### Notes
- [Any issues or uncertainties]
- [Characters that may be incorrect]
Structured Extraction
## OCR Extraction: [Document Name]
### Document Info
| Field | Value |
|-------|-------|
| Title | [Extracted or inferred] |
| Date | [If found] |
| Author | [If found] |
### Content by Section
#### [Header 1]
[Content under this header]
#### [Header 2]
[Content under this header]
### Tables Found
| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| [Data] | [Data] | [Data] |
### Uncertain Text
| Page | Original | Confidence | Possible |
|------|----------|------------|----------|
| 3 | "teh" | 70% | "the" |
| 5 | "l0ve" | 65% | "love" |
Searchable PDF Output
## OCR to Searchable PDF
**Source**: [filename.pdf]
**Output**: [filename_searchable.pdf]
### Processing Summary
| Metric | Value |
|--------|-------|
| Pages | [X] |
| Words extracted | [Y] |
| Average confidence | [Z]% |
| Processing time | [T] seconds |
### Quality Report
- [X] pages with 95%+ confidence
- [Y] pages with 80-94% confidence
- [Z] pages with <80% confidence (review recommended)
### Searchability
✅ Document is now text-searchable
✅ Original images preserved
✅ Text layer added behind images
Pre-Processing Tips
Image Quality Checklist
Before OCR, ensure:
-
Resolution: 300 DPI minimum (600 for small text)
-
Contrast: Clear black text on white background
-
Alignment: Document is straight (not skewed)
-
Completeness: No cut-off edges
-
Cleanliness: No stains, marks, or shadows
Common Pre-Processing Steps
Issue Solution
Low resolution Upscale image first
Skewed/rotated Auto-deskew
Poor contrast Adjust levels/threshold
Noise/specks Apply noise reduction
Shadows Flatten lighting
Color document Convert to grayscale
Language Support
Supported Languages
-
Excellent: English, Spanish, French, German, Italian
-
Good: Chinese (Simplified/Traditional), Japanese, Korean
-
Moderate: Arabic, Hebrew (RTL support), Hindi
-
Basic: Many others with varying quality
Multi-Language Documents
"OCR this document, detect language automatically"
"Extract text, primary: English, secondary: Chinese"
Handling Specific Content
Forms and Checkboxes
## Form Extraction: [Form Name]
### Field Values
| Field | Value | Confidence |
|-------|-------|------------|
| Name | John Smith | 98% |
| Date | 01/15/2026 | 95% |
| Address | 123 Main St | 92% |
### Checkboxes
| Question | Checked |
|----------|---------|
| Option A | ☑️ Yes |
| Option B | ☐ No |
| Option C | ☑️ Yes |
### Signature
[Signature detected on page X - cannot extract text]
Tables
## Table Extraction
### Table 1 (Page 2)
| Header A | Header B | Header C |
|----------|----------|----------|
| Value 1 | Value 2 | Value 3 |
| Value 4 | Value 5 | Value 6 |
**Table confidence**: 85%
**Note**: Column 3 may have alignment issues
Handwritten Text
## Handwritten Text Extraction
**Legibility Assessment**: [Good/Fair/Poor]
**Recommended**: Manual review
### Extracted Text (Confidence: 65%)
[Extracted text with uncertain words marked]
### Uncertain Words
| Original | Best Guess | Alternatives |
|----------|------------|--------------|
| [image] | "meeting" | "meeting", "meaning" |
| [image] | "Tuesday" | "Tuesday", "Thursday" |
⚠️ **Low confidence extraction - please verify manually**
Batch Processing
Batch OCR Job
## Batch OCR Processing
**Folder**: [Path]
**Total Documents**: [X]
**Status**: [In Progress/Complete]
### Results
| File | Pages | Confidence | Status |
|------|-------|------------|--------|
| doc1.pdf | 5 | 96% | ✅ Complete |
| doc2.pdf | 12 | 88% | ✅ Complete |
| doc3.pdf | 3 | 72% | ⚠️ Review |
| doc4.pdf | 8 | - | ❌ Failed |
### Issues
- doc3.pdf: Pages 2-3 have handwriting
- doc4.pdf: File corrupted
### Summary
- Successful: [X]
- Need Review: [Y]
- Failed: [Z]
Tool Recommendations
Cloud Services
-
Google Cloud Vision (excellent accuracy)
-
Amazon Textract (good for forms)
-
Azure Computer Vision (balanced)
-
Adobe Acrobat (integrated)
Desktop Software
-
ABBYY FineReader (best accuracy)
-
Adobe Acrobat Pro (reliable)
-
Readiris (good value)
-
Tesseract (free, open source)
Programming Libraries
-
pytesseract (Python + Tesseract)
-
EasyOCR (Python, multi-language)
-
PaddleOCR (Python, good for Asian languages)
Limitations
-
Cannot guarantee 100% accuracy
-
Handwritten text has low accuracy
-
Very small text may not extract well
-
Decorative fonts are problematic
-
Background images reduce quality
-
Cannot read text in complex graphics
-
Processing time increases with pages
Weekly Installs–Repositoryclaude-office-s…s/skillsGitHub Stars16First Seen–Security AuditsGen Agent Trust HubPassSocketPassSnykPass
forum用户评价 (0)
发表评价
暂无评价,来写第一条吧
统计数据
用户评分
为此 Skill 评分