Distributed Training

2026年5月26日 · 约 1 分钟阅读

分布式训练技术和大规模模型训练。

📚 现有文档

Megatron Parallel - Megatron 并行策略
NCCL Test - NCCL 通信测试

🔧 主题概览

1. 并行策略

Data Parallelism

DistributedDataParallel (DDP) - PyTorch 数据并行
Horovod - 跨框架分布式训练
Parameter Server - 参数服务器架构

Model Parallelism

Pipeline Parallelism - 流水线并行
Tensor Parallelism - 张量并行
Sequence Parallelism - 序列并行

Advanced Techniques

3D Parallelism - 三维并行策略
Expert Parallelism - 专家并行 (MoE)
Gradient Compression - 梯度压缩

2. 大模型训练

Training Systems

DeepSpeed - Microsoft 分布式训练
FairScale - Facebook 可扩展训练
Megatron-LM - NVIDIA 大模型训练

Communication Optimization

NCCL - NVIDIA 集合通信库
Gloo - Facebook 通信库
MPI - 消息传递接口

3. 集群管理

Kubernetes for ML - K8s 机器学习部署
Slurm - 作业调度系统
Ray - 分布式计算框架

修改历史9 次提交

docs: refresh wiki indexes and embeddings
xiaocheng·07-01·41f3817
chore(wiki): add description to 170 pages (clear missing-description lint)
xiaocheng·06-09·2be0221
fix(wiki): clean all lint errors to enable strict CI (PR-3)
xiaocheng·05-25·75375ef
feat(settings): add Astro settings configuration and update documentation
xiaocheng·01-17·d488aef
refactor: reorganize documentation structure and update Navbar component
xiaocheng·01-17·2fb8f42
chore(project): clean up obsolete configuration and build artifacts
xiaocheng·01-16·3574bd3
update posts
weigao.cwg@alibaba-inc.com·2025-10-16·7642737
add nccl-test docs
weigao.cwg@alibaba-inc.com·2025-10-15·218c19e
refactor AI post
xiaocheng·2025-08-15·a5a7637