---
id: daily-doc-pipeline
name: "doc-pipeline"
url: https://skills.yangsir.net/skill/daily-doc-pipeline
author: claude-office-skills
domain: data-ai
tags: ["data-extraction", "automation", "information-retrieval", "workflow", "data-analysis"]
install_count: 1800
rating: 4.30 (20 reviews)
github: https://github.com/claude-office-skills/skills
---

# doc-pipeline

> 构建文档处理流水线，串联提取、转换、转换等操作为可复用工作流，实现数据在阶段间自动流转

**Stats**: 1,800 installs · 4.3/5 (20 reviews)

## Before / After 对比

### 文档处理效率

**Before**:

处理一批文档需要手动执行多个步骤：下载、格式转换、数据提取、清洗、入库。每个环节独立操作，容易出错且无法复用，处理 100 份文档需要一整天。

**After**:

定义一次文档处理流水线，后续批量自动执行全流程。数据在提取、转换、加载阶段间无缝流转，支持并行处理和错误重试，100 份文档 10 分钟完成且质量可控。

| Metric | Before | After | Change |
|---|---|---|---|
| 批量文档处理时间 | 480分钟/100份 | 10分钟/100份 | -98% |
| 处理错误率 | 8% | 1% | -88% |

## Readme

# doc-pipeline

# Doc Pipeline Skill

## Overview

This skill enables building document processing pipelines - chain multiple operations (extract, transform, convert) into reusable workflows with data flowing between stages.

## How to Use

- Describe what you want to accomplish

- Provide any required input data or files

- I'll execute the appropriate operations

**Example prompts:**

- "PDF → Extract Text → Translate → Generate DOCX"

- "Image → OCR → Summarize → Create Report"

- "Excel → Analyze → Generate Charts → Create PPT"

- "Multiple inputs → Merge → Format → Output"

## Domain Knowledge

### Pipeline Architecture

```
Stage 1      Stage 2      Stage 3      Stage 4
┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐
│Extract│ → │Transform│ → │ AI   │ → │Output│
│ PDF  │    │  Data  │    │Analyze│   │ DOCX │
└──────┘    └──────┘    └──────┘    └──────┘
     │           │           │           │
     └───────────┴───────────┴───────────┘
                 Data Flow

```

### Pipeline DSL (Domain Specific Language)

```
# pipeline.yaml
name: contract-review-pipeline
description: Extract, analyze, and report on contracts

stages:
  - name: extract
    operation: pdf-extraction
    input: $input_file
    output: $extracted_text
    
  - name: analyze
    operation: ai-analyze
    input: $extracted_text
    prompt: "Review this contract for risks..."
    output: $analysis
    
  - name: report
    operation: docx-generation
    input: $analysis
    template: templates/review_report.docx
    output: $output_file

```

### Python Implementation

```
from typing import Callable, Any
from dataclasses import dataclass

@dataclass
class Stage:
    name: str
    operation: Callable
    
class Pipeline:
    def __init__(self, name: str):
        self.name = name
        self.stages: list[Stage] = []
    
    def add_stage(self, name: str, operation: Callable):
        self.stages.append(Stage(name, operation))
        return self  # Fluent API
    
    def run(self, input_data: Any) -> Any:
        data = input_data
        for stage in self.stages:
            print(f"Running stage: {stage.name}")
            data = stage.operation(data)
        return data

# Example usage
pipeline = Pipeline("contract-review")
pipeline.add_stage("extract", extract_pdf_text)
pipeline.add_stage("analyze", analyze_with_ai)
pipeline.add_stage("generate", create_docx_report)

result = pipeline.run("/path/to/contract.pdf")

```

### Advanced: Conditional Pipelines

```
class ConditionalPipeline(Pipeline):
    def add_conditional_stage(self, name: str, condition: Callable, 
                               if_true: Callable, if_false: Callable):
        def conditional_op(data):
            if condition(data):
                return if_true(data)
            return if_false(data)
        return self.add_stage(name, conditional_op)

# Usage
pipeline.add_conditional_stage(
    "ocr_if_needed",
    condition=lambda d: d.get("has_images"),
    if_true=run_ocr,
    if_false=lambda d: d
)

```

## Best Practices

- **Keep stages focused (single responsibility)**

- **Use intermediate outputs for debugging**

- **Implement stage-level error handling**

- **Make pipelines configurable via YAML/JSON**

## Installation

```
# Install required dependencies
pip install python-docx openpyxl python-pptx reportlab jinja2

```

## Resources

- [Custom Repository](https://github.com/claude-office-skills/skills)

- [Claude Office Skills Hub](https://github.com/claude-office-skills/skills)

Weekly Installs303Repository[claude-office-s…s/skills](https://github.com/claude-office-skills/skills)GitHub Stars35First SeenMar 5, 2026Security Audits[Gen Agent Trust HubPass](/claude-office-skills/skills/doc-pipeline/security/agent-trust-hub)[SocketPass](/claude-office-skills/skills/doc-pipeline/security/socket)[SnykPass](/claude-office-skills/skills/doc-pipeline/security/snyk)Installed onclaude-code256opencode120github-copilot119kimi-cli117amp117cline117

---
*Source: https://skills.yangsir.net/skill/daily-doc-pipeline*
*Markdown mirror: https://skills.yangsir.net/api/skill/daily-doc-pipeline/markdown*