Home/AI 工程/training-llms-megatron
T

training-llms-megatron

by @davila7v
4.8(57)

Utilizes Megatron-Core for large-scale LLM training, supporting 2B to 462B parameter models, optimizing H100 GPU utilization.

LLM TrainingMegatron-LMDeep LearningDistributed TrainingGPU OptimizationGitHub
Installation
npx skills add davila7/claude-code-templates --skill training-llms-megatron
compare_arrows

Before / After Comparison

1
Before

Training large-scale LLMs faces challenges such as low resource utilization and poor scalability, making it difficult to effectively support 2B to 462B parameter models, which is time-consuming and costly.

After

Optimizing large-scale LLM training with Megatron-Core fully leverages H100 GPUs, efficiently supports ultra-large parameter models, significantly shortening training cycles and reducing costs.

description SKILL.md

training-llms-megatron

Megatron-Core - Large-Scale LLM Training Quick start Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies. Installation: # Docker (recommended) docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3 # Or pip pip install megatron-core Simple distributed training: # Train with 2 GPUs using data parallelism torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py # Or LLaMA-3 8B training ./examples/llama/train_llama3_8b_fp8.sh Common workflows Workflow 1: Train LLaMA-style model with 3D parallelism Copy this checklist: LLaMA Training Setup: - [ ] Step 1: Choose parallelism configuration - [ ] Step 2: Configure training hyperparameters - [ ] Step 3: Launch distributed training - [ ] Step 4: Monitor performance metrics Step 1: Choose parallelism configuration Model size determines parallelism strategy: Model Size GPUs Tensor Parallel Pipeline Parallel Data Parallel Context Parallel 7B 8 1 1 8 1 13B 8 2 1 4 1 70B 64 4 4 4 1 405B 128 8 8 2 2 Step 2: Configure training hyperparameters #!/bin/bash # train_llama_70b.sh GPUS_PER_NODE=8 NNODES=8 # 64 GPUs total TP=4 # Tensor parallel PP=4 # Pipeline parallel CP=1 # Context parallel # LLaMA 70B configuration MODEL_SIZE=70 # Billion parameters HIDDEN_SIZE=8192 NUM_LAYERS=80 NUM_HEADS=64 SEQ_LENGTH=4096 # Training hyperparameters MICRO_BATCH=1 GLOBAL_BATCH=1024 LR=3e-4 torchrun \ --nproc_per_node=$GPUS_PER_NODE \ --nnodes=$NNODES \ pretrain_gpt.py \ --tensor-model-parallel-size $TP \ --pipeline-model-parallel-size $PP \ --context-parallel-size $CP \ --sequence-parallel \ --num-layers $NUM_LAYERS \ --hidden-size $HIDDEN_SIZE \ --num-attention-heads $NUM_HEADS \ --seq-length $SEQ_LENGTH \ --max-position-embeddings $SEQ_LENGTH \ --micro-batch-size $MICRO_BATCH \ --global-batch-size $GLOBAL_BATCH \ --lr $LR \ --train-iters 100000 \ --lr-decay-style cosine \ --lr-warmup-iters 2000 \ --weight-decay 0.1 \ --clip-grad 1.0 \ --bf16 \ --use-mcore-models \ --transformer-impl transformer_engine \ --data-path /path/to/data \ --vocab-file /path/to/vocab.json \ --merge-file /path/to/merges.txt Step 3: Launch distributed training # Single node (8 GPUs) bash train_llama_70b.sh # Multi-node with SLURM sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh Step 4: Monitor performance metrics Key metrics to track: Model FLOP Utilization (MFU): Target >40% on H100 Throughput: Tokens/sec/GPU Memory usage: <80GB per GPU for 70B model Loss: Should decrease steadily Workflow 2: Configure Mixture of Experts (MoE) training For sparse MoE models like Mixtral. MoE Training: - [ ] Step 1: Configure expert parallelism - [ ] Step 2: Set MoE hyperparameters - [ ] Step 3: Launch training with EP Step 1: Configure expert parallelism # Mixtral 8x7B example TENSOR_PARALLEL=2 PIPELINE_PARALLEL=1 EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs DATA_PARALLEL=4 TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL)) # = 2 * 1 * 4 * 4 = 32 GPUs Step 2: Set MoE hyperparameters torchrun \ --nproc_per_node=8 \ pretrain_gpt.py \ --tensor-model-parallel-size 2 \ --pipeline-model-parallel-size 1 \ --expert-model-parallel-size 4 \ --num-experts 8 \ --moe-router-topk 2 \ --moe-router-load-balancing-type aux_loss \ --moe-aux-loss-coeff 0.01 \ --hidden-size 4096 \ --num-layers 32 \ --num-attention-heads 32 \ --seq-length 4096 \ --max-position-embeddings 4096 \ --bf16 \ --use-mcore-models \ --transformer-impl transformer_engine \ --data-path /path/to/data \ --vocab-file /path/to/vocab.json \ --merge-file /path/to/merges.txt Step 3: Launch training with EP Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity. Memory without EP: 8 experts × 7B = 56GB per GPU Memory with EP=4: 2 experts × 7B = 14GB per GPU Savings: 75% memory reduction Workflow 3: Optimize for maximum throughput Achieve 47% MFU on H100. Performance Optimization: - [ ] Step 1: Enable Flash Attention - [ ] Step 2: Use FP8 precision (H100) - [ ] Step 3: Optimize micro-batch size - [ ] Step 4: Tune parallelism degrees Step 1: Enable optimizations --use-mcore-models # Use Megatron Core models --transformer-impl transformer_engine # Use Transformer Engine --sequence-parallel # Reduce activation memory (use with TP) Step 2: Use FP8 precision (H100 only) --fp8-hybrid # FP8 mixed precision training # Transformer Engine handles FP8 automatically Result: 1.5-2x speedup on H100 vs BF16. Step 3: Optimize micro-batch size Find largest micro-batch that fits in memory: # Start with 1, increase until OOM for MBS in 1 2 4 8; do echo "Testing micro-batch-size=$MBS" torchrun ... --micro-batch-size $MBS done Typical values: 7B model: 4-8 70B model: 1-2 405B model: 1 Step 4: Tune parallelism degrees Rules of thumb: Tensor Parallel: Use ≤8 (limited by NVLink within node) Pipeline Parallel: Use for >70B models Context Parallel: Use for sequences >8K tokens Data Parallel: Fill remaining GPUs Example 405B on 128 H100s: TP=8 (1 node) PP=8 (across nodes) CP=2 (long sequences) DP=1 Total = 8 × 8 × 2 × 1 = 128 GPUs When to use vs alternatives Use Megatron-Core when: Training models >10B parameters Need maximum efficiency (target >40% MFU) Using NVIDIA GPUs (A100, H100) Production training at scale Want fine-grained parallelism control Use alternatives instead: PyTorch FSDP: Models <70B, simpler API, PyTorch native DeepSpeed: Easier setup, good for <100B models HuggingFace Accelerate: Prototyping, simpler workflows LitGPT: Educational, single-file implementations Common issues Issue: Low GPU utilization (<30% MFU) Causes: Micro-batch too small Too much parallelism overhead Not using Flash Attention Fixes: # Increase micro-batch --micro-batch-size 4 # Was 1 # Enable optimizations --use-flash-attn --sequence-parallel # Reduce TP if >8 --tensor-model-parallel-size 4 # Was 16 Issue: Out of memory Reduce memory with: --tensor-model-parallel-size 2 # Split model across GPUs --recompute-granularity full # Gradient checkpointing --recompute-method block # Checkpoint transformer blocks --recompute-num-layers 1 # Checkpoint every layer Or use CPU/NVMe offloading: --cpu-optimizer # Offload optimizer to CPU --cpu-optimizer-type ADAM # CPU Adam variant Issue: Training slower than expected Check: Network bottleneck: Ensure InfiniBand/NVLink enabled Pipeline bubbles: Use interleaved pipeline schedule --num-layers-per-virtual-pipeline-stage 2 Data loading: Use fast data loader --dataloader-type cyclic Issue: Diverging loss Stabilize training: --lr-warmup-iters 2000 # Longer warmup --clip-grad 1.0 # Gradient clipping --init-method-std 0.006 # Smaller init --attention-dropout 0.0 # No dropout in attention --hidden-dropout 0.0 # No dropout in FFN Advanced topics Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each. Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations. Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B. Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures. Hardware requirements GPU: NVIDIA Ampere+ (A100, H100, B200) Turing works but slower FP8 requires Hopper/Ada/Blackwell Network: InfiniBand or 400Gb+ Ethernet for multi-node Memory per GPU: 7B model: 40GB+ 70B model: 80GB (with TP=4) 405B model: 80GB (with TP=8, PP=8) Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models) Resources Docs: https://docs.nvidia.com/megatron-core/ GitHub: https://github.com/NVIDIA/Megatron-LM Papers: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019) "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021) NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core) Weekly Installs178Repositorydavila7/claude-…emplatesGitHub Stars23.0KFirst SeenJan 21, 2026Security AuditsGen Agent Trust HubPassSocketPassSnykPassInstalled onopencode145claude-code143gemini-cli137cursor129codex126github-copilot116

forumUser Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs1.3K
Rating4.8 / 5.0
Version
Updated2026年3月17日
Comparisons1

User Rating

4.8(57)
5
0%
4
0%
3
0%
2
0%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code
🔧OpenClaw
🔧OpenCode
🔧Codex
🔧Gemini CLI
🔧GitHub Copilot
🔧Amp
🔧Kimi CLI

Timeline

Created2026年3月17日
Last Updated2026年3月17日