Home/数据 & AI/nemo-curator
N

nemo-curator

by @davila7v
4.4(121)

NVIDIA's GPU-accelerated data curation tool for preparing high-quality training data for Large Language Models (LLMs) and optimizing data preprocessing workflows.

NVIDIA NeMoLLM Data CurationDataset ManagementData PreprocessingAI Model TrainingGitHub
Installation
npx skills add davila7/claude-code-templates --skill nemo-curator
compare_arrows

Before / After Comparison

1
Before

Before NeMo Curator, processing large-scale LLM training data (such as Common Crawl) for deduplication, filtering, and formatting typically relied on CPUs. This was slow, took days or even weeks, and struggled with multimodal data.

After

With NeMo Curator, data processing can be accelerated using GPUs, achieving 16x faster deduplication compared to CPUs. It also efficiently curates multimodal datasets, significantly shortening the LLM training data preparation cycle and improving data quality.

description SKILL.md

nemo-curator

NeMo Curator - GPU-Accelerated Data Curation NVIDIA's toolkit for preparing high-quality training data for LLMs. When to use NeMo Curator Use NeMo Curator when: Preparing LLM training data from web scrapes (Common Crawl) Need fast deduplication (16× faster than CPU) Curating multi-modal datasets (text, images, video, audio) Filtering low-quality or toxic content Scaling data processing across GPU cluster Performance: 16× faster fuzzy deduplication (8TB RedPajama v2) 40% lower TCO vs CPU alternatives Near-linear scaling across GPU nodes Use alternatives instead: datatrove: CPU-based, open-source data processing dolma: Allen AI's data toolkit Ray Data: General ML data processing (no curation focus) Quick start Installation # Text curation (CUDA 12) uv pip install "nemo-curator[text_cuda12]" # All modalities uv pip install "nemo-curator[all_cuda12]" # CPU-only (slower) uv pip install "nemo-curator[cpu]" Basic text curation pipeline from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd # Load data df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df) # Quality filtering def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs filtered = ScoreFilter(quality_score)(dataset) # Deduplication from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered) # Save deduped.to_parquet("curated_data/") Data curation pipeline Stage 1: Quality filtering from nemo_curator.filters import ( WordCountFilter, RepeatedLinesFilter, UrlRatioFilter, NonAlphaNumericFilter ) # Apply 30+ heuristic filters from nemo_curator import ScoreFilter # Word count filter dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000)) # Remove repetitive content dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3)) # URL ratio filter dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2)) Stage 2: Deduplication Exact deduplication: from nemo_curator.modules import ExactDuplicates # Remove exact duplicates deduped = ExactDuplicates(id_field="id", text_field="text")(dataset) Fuzzy deduplication (16× faster on GPU): from nemo_curator.modules import FuzzyDuplicates # MinHash + LSH deduplication fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" ) deduped = fuzzy_dedup(dataset) Semantic deduplication: from nemo_curator.modules import SemanticDuplicates # Embedding-based deduplication semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold ) deduped = semantic_dedup(dataset) Stage 3: PII redaction from nemo_curator.modules import Modify from nemo_curator.modifiers import PIIRedactor # Redact personally identifiable information pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" ) redacted = Modify(pii_redactor)(dataset) Stage 4: Classifier filtering from nemo_curator.classifiers import QualityClassifier # Quality classification quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" ) # Filter low-quality documents high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5) GPU acceleration GPU vs CPU performance Operation CPU (16 cores) GPU (A100) Speedup Fuzzy dedup (8TB) 120 hours 7.5 hours 16× Exact dedup (1TB) 8 hours 0.5 hours 16× Quality filtering 2 hours 0.2 hours 10× Multi-GPU scaling from nemo_curator import get_client import dask_cuda # Initialize GPU cluster client = get_client(cluster_type="gpu", n_workers=8) # Process with 8 GPUs deduped = FuzzyDuplicates(...)(dataset) Multi-modal curation Image curation from nemo_curator.image import ( AestheticFilter, NSFWFilter, CLIPEmbedder ) # Aesthetic scoring aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset) # NSFW detection nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images) # Generate CLIP embeddings clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images) Video curation from nemo_curator.video import ( SceneDetector, ClipExtractor, InternVideo2Embedder ) # Detect scenes scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset) # Extract clips clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes) # Generate embeddings video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips) Audio curation from nemo_curator.audio import ( ASRInference, WERFilter, DurationFilter ) # ASR transcription asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset) # Filter by WER (word error rate) wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed) # Duration filtering duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio) Common patterns Web scrape curation (Common Crawl) from nemo_curator import ScoreFilter, Modify from nemo_curator.filters import * from nemo_curator.modules import * from nemo_curator.datasets import DocumentDataset # Load Common Crawl data dataset = DocumentDataset.read_parquet("common_crawl/.parquet") # Pipeline pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3), # 2. Language filtering LanguageIdentificationFilter(target_languages=["en"]), # 3. Deduplication ExactDuplicates(id_field="id", text_field="text"), FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260), # 4. PII redaction PIIRedactor(), # 5. NSFW filtering NSFWClassifier(threshold=0.8) ] # Execute for stage in pipeline: dataset = stage(dataset) # Save dataset.to_parquet("curated_common_crawl/") Distributed processing from nemo_curator import get_client from dask_cuda import LocalCUDACluster # Multi-GPU cluster cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster) # Process large dataset dataset = DocumentDataset.read_parquet("s3://large_dataset/.parquet") deduped = FuzzyDuplicates(...)(dataset) # Cleanup client.close() cluster.close() Performance benchmarks Fuzzy deduplication (8TB RedPajama v2) CPU (256 cores): 120 hours GPU (8× A100): 7.5 hours Speedup: 16× Exact deduplication (1TB) CPU (64 cores): 8 hours GPU (4× A100): 0.5 hours Speedup: 16× Quality filtering (100GB) CPU (32 cores): 2 hours GPU (2× A100): 0.2 hours Speedup: 10× Cost comparison CPU-based curation (AWS c5.18xlarge × 10): Cost: $3.60/hour × 10 = $36/hour Time for 8TB: 120 hours Total: $4,320 GPU-based curation (AWS p4d.24xlarge × 2): Cost: $32.77/hour × 2 = $65.54/hour Time for 8TB: 7.5 hours Total: $491.55 Savings: 89% reduction ($3,828 saved) Supported data formats Input: Parquet, JSONL, CSV Output: Parquet (recommended), JSONL WebDataset: TAR archives for multi-modal Use cases Production deployments: NVIDIA used NeMo Curator to prepare Nemotron-4 training data Open-source datasets curated: RedPajama v2, The Pile References Filtering Guide - 30+ quality filters, heuristics Deduplication Guide - Exact, fuzzy, semantic methods Resources GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+ Docs: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/ Version: 0.4.0+ License: Apache 2.0 Weekly Installs172Repositorydavila7/claude-…emplatesGitHub Stars23.0KFirst SeenJan 21, 2026Security AuditsGen Agent Trust HubPassSocketPassSnykWarnInstalled onopencode137claude-code136gemini-cli130cursor122codex117antigravity107

forumUser Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs3.0K
Rating4.4 / 5.0
Version
Updated2026年4月27日
Comparisons1

User Rating

4.4(121)
5
27%
4
51%
3
20%
2
2%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

Timeline

Created2026年3月17日
Last Updated2026年4月27日
🎁 Agent Knowledge Cards