liteparse
本地解析 PDF、DOCX、PPTX、XLSX 等非结构化文档,快速轻量,无需云端依赖或 LLM 支持
npx skills add run-llama/llamaparse-agent-skills --skill liteparseBefore / After 效果对比
1 组手动从 PDF 或 Word 文档中复制粘贴文本,处理复杂格式时需要反复调整,一个 50 页的报告需要 2 小时才能完成内容整理
一键上传文档自动提取结构化文本,保留段落格式和表格结构,50 页报告在 2 分钟内完成解析并输出可编辑的 Markdown
description SKILL.md
liteparse
LiteParse Skill
Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required.
Initial Setup
When this skill is invoked, respond with:
I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that:
- `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`)
- The `lit` CLI command is available in your terminal
If both are set, please provide:
1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.
I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results.
Then wait for the user's input.
Step 0 — Install LiteParse (if needed)
If liteparse is not yet installed, install it globally:
npm i -g @llamaindex/liteparse
Verify installation:
lit --version
For Office document support (DOCX, PPTX, XLSX), LibreOffice is required:
# macOS
brew install --cask libreoffice
# Ubuntu/Debian
apt-get install libreoffice
For image parsing, ImageMagick is required:
# macOS
brew install imagemagick
# Ubuntu/Debian
apt-get install imagemagick
Step 1 — Produce the CLI Command or Script
Parse a Single File
# Basic text extraction
lit parse document.pdf
# JSON output saved to a file
lit parse document.pdf --format json -o output.json
# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"
# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr
# Use an external HTTP OCR server for higher accuracy
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr
# Higher DPI for better quality
lit parse document.pdf --dpi 300
Batch Parse a Directory
lit batch-parse ./input-directory ./output-directory
# Only process PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive
Generate Page Screenshots
Screenshots are useful for LLM agents that need to see visual layout.
# All pages
lit screenshot document.pdf -o ./screenshots
# Specific pages
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
# High-DPI PNG
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots
# Page range
lit screenshot document.pdf --pages "1-10" -o ./screenshots
Step 3 — Key Options Reference
OCR Options
Option Description
(default) Tesseract.js — zero setup, built-in
--ocr-language fra
Set OCR language (ISO code)
--ocr-server-url <url>
Use external HTTP OCR server (EasyOCR, PaddleOCR, custom)
--no-ocr
Disable OCR entirely
Output Options
Option Description
--format json
Structured JSON with bounding boxes
--format text
Plain text (default)
-o <file>
Save output to file
Performance / Quality Options
Option Description
--dpi <n>
Rendering DPI (default: 150; use 300 for high quality)
--max-pages <n>
Limit pages parsed
--target-pages <pages>
Parse specific pages (e.g. "1-5,10")
--no-precise-bbox
Disable precise bounding boxes (faster)
--skip-diagonal-text
Ignore rotated/diagonal text
--preserve-small-text
Keep very small text that would otherwise be dropped
Step 4 — Using a Config File
For repeated use with consistent options, generate a liteparse.config.json:
{
"ocrLanguage": "en",
"ocrEnabled": true,
"maxPages": 1000,
"dpi": 150,
"outputFormat": "json",
"preciseBoundingBox": true,
"skipDiagonalText": false,
"preserveVerySmallText": false
}
For an HTTP OCR server:
{
"ocrServerUrl": "http://localhost:8828/ocr",
"ocrLanguage": "en",
"outputFormat": "json"
}
Use with:
lit parse document.pdf --config liteparse.config.json
Step 5 — HTTP OCR Server API (Advanced)
If the user wants to plug in a custom OCR backend, the server must implement:
-
Endpoint:
POST /ocr -
Accepts:
file(multipart) andlanguage(string) parameters -
Returns:
{
"results": [
{ "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
]
}
Ready-to-use wrappers exist for EasyOCR and PaddleOCR in the LiteParse repo.
Supported Input Formats
Category Formats
PDF
.pdf
Word
.doc, .docx, .docm, .odt, .rtf
PowerPoint
.ppt, .pptx, .pptm, .odp
Spreadsheets
.xls, .xlsx, .xlsm, .ods, .csv, .tsv
Images
.jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg
Office documents require LibreOffice; images require ImageMagick. LiteParse auto-converts these formats to PDF before parsing. Weekly Installs332Repositoryrun-llama/llama…t-skillsGitHub Stars26First Seen8 days agoSecurity AuditsGen Agent Trust HubPassSocketPassSnykPassInstalled onopencode329codex328github-copilot326cursor326gemini-cli325kimi-cli325
forum用户评价 (0)
发表评价
暂无评价,来写第一条吧
统计数据
用户评分
为此 Skill 评分