doc-pipeline
构建文档处理流水线,串联提取、转换、转换等操作为可复用工作流,实现数据在阶段间自动流转
npx skills add claude-office-skills/skills --skill doc-pipelineBefore / After 效果对比
1 组处理一批文档需要手动执行多个步骤:下载、格式转换、数据提取、清洗、入库。每个环节独立操作,容易出错且无法复用,处理 100 份文档需要一整天。
定义一次文档处理流水线,后续批量自动执行全流程。数据在提取、转换、加载阶段间无缝流转,支持并行处理和错误重试,100 份文档 10 分钟完成且质量可控。
description SKILL.md
doc-pipeline
Doc Pipeline Skill
Overview
This skill enables building document processing pipelines - chain multiple operations (extract, transform, convert) into reusable workflows with data flowing between stages.
How to Use
-
Describe what you want to accomplish
-
Provide any required input data or files
-
I'll execute the appropriate operations
Example prompts:
-
"PDF → Extract Text → Translate → Generate DOCX"
-
"Image → OCR → Summarize → Create Report"
-
"Excel → Analyze → Generate Charts → Create PPT"
-
"Multiple inputs → Merge → Format → Output"
Domain Knowledge
Pipeline Architecture
Stage 1 Stage 2 Stage 3 Stage 4
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Extract│ → │Transform│ → │ AI │ → │Output│
│ PDF │ │ Data │ │Analyze│ │ DOCX │
└──────┘ └──────┘ └──────┘ └──────┘
│ │ │ │
└───────────┴───────────┴───────────┘
Data Flow
Pipeline DSL (Domain Specific Language)
# pipeline.yaml
name: contract-review-pipeline
description: Extract, analyze, and report on contracts
stages:
- name: extract
operation: pdf-extraction
input: $input_file
output: $extracted_text
- name: analyze
operation: ai-analyze
input: $extracted_text
prompt: "Review this contract for risks..."
output: $analysis
- name: report
operation: docx-generation
input: $analysis
template: templates/review_report.docx
output: $output_file
Python Implementation
from typing import Callable, Any
from dataclasses import dataclass
@dataclass
class Stage:
name: str
operation: Callable
class Pipeline:
def __init__(self, name: str):
self.name = name
self.stages: list[Stage] = []
def add_stage(self, name: str, operation: Callable):
self.stages.append(Stage(name, operation))
return self # Fluent API
def run(self, input_data: Any) -> Any:
data = input_data
for stage in self.stages:
print(f"Running stage: {stage.name}")
data = stage.operation(data)
return data
# Example usage
pipeline = Pipeline("contract-review")
pipeline.add_stage("extract", extract_pdf_text)
pipeline.add_stage("analyze", analyze_with_ai)
pipeline.add_stage("generate", create_docx_report)
result = pipeline.run("/path/to/contract.pdf")
Advanced: Conditional Pipelines
class ConditionalPipeline(Pipeline):
def add_conditional_stage(self, name: str, condition: Callable,
if_true: Callable, if_false: Callable):
def conditional_op(data):
if condition(data):
return if_true(data)
return if_false(data)
return self.add_stage(name, conditional_op)
# Usage
pipeline.add_conditional_stage(
"ocr_if_needed",
condition=lambda d: d.get("has_images"),
if_true=run_ocr,
if_false=lambda d: d
)
Best Practices
-
Keep stages focused (single responsibility)
-
Use intermediate outputs for debugging
-
Implement stage-level error handling
-
Make pipelines configurable via YAML/JSON
Installation
# Install required dependencies
pip install python-docx openpyxl python-pptx reportlab jinja2
Resources
Weekly Installs303Repositoryclaude-office-s…s/skillsGitHub Stars35First SeenMar 5, 2026Security AuditsGen Agent Trust HubPassSocketPassSnykPassInstalled onclaude-code256opencode120github-copilot119kimi-cli117amp117cline117
forum用户评价 (0)
发表评价
暂无评价,来写第一条吧
统计数据
用户评分
为此 Skill 评分