nemo-curator
NVIDIA的GPU加速数据整理工具,用于为大型语言模型(LLMs)准备高质量训练数据,优化数据预处理流程。
npx skills add davila7/claude-code-templates --skill nemo-curatorBefore / After 效果对比
1 组在没有 NeMo Curator 之前,处理大规模 LLM 训练数据(如 Common Crawl)进行去重、过滤和格式化通常依赖于 CPU,速度慢,耗时数天甚至数周,且难以处理多模态数据。
使用 NeMo Curator 技能,可以利用 GPU 加速数据处理,实现 16 倍于 CPU 的去重速度,并高效地策展多模态数据集,显著缩短了 LLM 训练数据准备周期,并提升了数据质量。
description SKILL.md
nemo-curator
NeMo Curator - GPU-Accelerated Data Curation NVIDIA's toolkit for preparing high-quality training data for LLMs. When to use NeMo Curator Use NeMo Curator when: Preparing LLM training data from web scrapes (Common Crawl) Need fast deduplication (16× faster than CPU) Curating multi-modal datasets (text, images, video, audio) Filtering low-quality or toxic content Scaling data processing across GPU cluster Performance: 16× faster fuzzy deduplication (8TB RedPajama v2) 40% lower TCO vs CPU alternatives Near-linear scaling across GPU nodes Use alternatives instead: datatrove: CPU-based, open-source data processing dolma: Allen AI's data toolkit Ray Data: General ML data processing (no curation focus) Quick start Installation # Text curation (CUDA 12) uv pip install "nemo-curator[text_cuda12]" # All modalities uv pip install "nemo-curator[all_cuda12]" # CPU-only (slower) uv pip install "nemo-curator[cpu]" Basic text curation pipeline from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd # Load data df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df) # Quality filtering def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs filtered = ScoreFilter(quality_score)(dataset) # Deduplication from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered) # Save deduped.to_parquet("curated_data/") Data curation pipeline Stage 1: Quality filtering from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter ) # Apply 30+ heuristic filters from nemo_curator import ScoreFilter # Word count filter dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000)) # Remove repetitive content dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3)) # URL ratio filter dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2)) Stage 2: Deduplication Exact deduplication: from nemo_curator.modules import ExactDuplicates # Remove exact duplicates deduped = ExactDuplicates(id_field="id", text_field="text")(dataset) Fuzzy deduplication (16× faster on GPU): from nemo_curator.modules import FuzzyDuplicates # MinHash + LSH deduplication fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" ) deduped = fuzzy_dedup(dataset) Semantic deduplication: from nemo_curator.modules import SemanticDuplicates # Embedding-based deduplication semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold ) deduped = semantic_dedup(dataset) Stage 3: PII redaction from nemo_curator.modules import Modify from nemo_curator.modifiers import PIIRedactor # Redact personally identifiable information pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" ) redacted = Modify(pii_redactor)(dataset) Stage 4: Classifier filtering from nemo_curator.classifiers import QualityClassifier # Quality classification quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" ) # Filter low-quality documents high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5) GPU acceleration GPU vs CPU performance Operation CPU (16 cores) GPU (A100) Speedup Fuzzy dedup (8TB) 120 hours 7.5 hours 16× Exact dedup (1TB) 8 hours 0.5 hours 16× Quality filtering 2 hours 0.2 hours 10× Multi-GPU scaling from nemo_curator import get_client import dask_cuda # Initialize GPU cluster client = get_client(cluster_type="gpu", n_workers=8) # Process with 8 GPUs deduped = FuzzyDuplicates(...)(dataset) Multi-modal curation Image curation from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder ) # Aesthetic scoring aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset) # NSFW detection nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images) # Generate CLIP embeddings clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images) Video curation from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder ) # Detect scenes scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset) # Extract clips clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes) # Generate embeddings video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips) Audio curation from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter ) # ASR transcription asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset) # Filter by WER (word error rate) wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed) # Duration filtering duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio) Common patterns Web scrape curation (Common Crawl) from nemo_curator import ScoreFilter, Modify from nemo_curator.filters import * from nemo_curator.modules import * from nemo_curator.datasets import DocumentDataset # Load Common Crawl data dataset = DocumentDataset.read_parquet("common_crawl/.parquet") # Pipeline pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3), # 2. Language filtering LanguageIdentificationFilter(target_languages=["en"]), # 3. Deduplication ExactDuplicates(id_field="id", text_field="text"), FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260), # 4. PII redaction PIIRedactor(), # 5. NSFW filtering NSFWClassifier(threshold=0.8) ] # Execute for stage in pipeline: dataset = stage(dataset) # Save dataset.to_parquet("curated_common_crawl/") Distributed processing from nemo_curator import get_client from dask_cuda import LocalCUDACluster # Multi-GPU cluster cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster) # Process large dataset dataset = DocumentDataset.read_parquet("s3://large_dataset/.parquet") deduped = FuzzyDuplicates(...)(dataset) # Cleanup client.close() cluster.close() Performance benchmarks Fuzzy deduplication (8TB RedPajama v2) CPU (256 cores): 120 hours GPU (8× A100): 7.5 hours Speedup: 16× Exact deduplication (1TB) CPU (64 cores): 8 hours GPU (4× A100): 0.5 hours Speedup: 16× Quality filtering (100GB) CPU (32 cores): 2 hours GPU (2× A100): 0.2 hours Speedup: 10× Cost comparison CPU-based curation (AWS c5.18xlarge × 10): Cost: $3.60/hour × 10 = $36/hour Time for 8TB: 120 hours Total: $4,320 GPU-based curation (AWS p4d.24xlarge × 2): Cost: $32.77/hour × 2 = $65.54/hour Time for 8TB: 7.5 hours Total: $491.55 Savings: 89% reduction ($3,828 saved) Supported data formats Input: Parquet, JSONL, CSV Output: Parquet (recommended), JSONL WebDataset: TAR archives for multi-modal Use cases Production deployments: NVIDIA used NeMo Curator to prepare Nemotron-4 training data Open-source datasets curated: RedPajama v2, The Pile References Filtering Guide - 30+ quality filters, heuristics Deduplication Guide - Exact, fuzzy, semantic methods Resources GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+ Docs: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/ Version: 0.4.0+ License: Apache 2.0 Weekly Installs172Repositorydavila7/claude-…emplatesGitHub Stars23.0KFirst SeenJan 21, 2026Security AuditsGen Agent Trust HubPassSocketPassSnykWarnInstalled onopencode137claude-code136gemini-cli130cursor122codex117antigravity107
forum用户评价 (0)
发表评价
暂无评价
统计数据
用户评分
为此 Skill 评分