模拟器建模指南:显存与吞吐公式
1. 显存五大组成
| 组成 | 生命周期 | 典型占比 | Scope |
|---|---|---|---|
| Weights | 常驻 | 30-80% | per-rank(被 TP/PP/EP 切分) |
| KV Cache | 随请求增长/释放 | 10-60% | per-rank(被 TP/CP 切分) |
| Activations | 前向计算中瞬态 | 1-15% | per-rank |
| Workspace | kernel 临时 buffer | 1-5% | per-rank |
| Communication Buffer | NCCL/DeepEP 双缓冲 | 0.5-3% | per-rank |
常见建模错误:
- KV cache 报告 “per-request per-layer” 大小,但模拟器需 per-rank peak = Sum(所有并发 request 的 KV) / TP_size
- Weights 在 EP 模式下每 rank 只有
n_experts / EP_size个 expert,但 shared expert + attention 不被 EP 切 - Activation peak 发生在 prefill 最大 chunk 时刻,不是 decode
2. Weights Memory [per-rank]
def weights_memory_per_rank(config, tp, ep, pp):
pp_layers = config['n_layers'] // pp
# Attention (per layer, TP-sharded)
attn_params_per_layer = (
config['dim'] * config['q_lora_rank'] + # wq_a
config['q_lora_rank'] * config['n_heads'] * config['head_dim'] // tp + # wq_b
config['dim'] * config['head_dim'] + # wkv
config['n_heads'] * config['head_dim'] // tp * config['o_lora_rank'] + # wo_a
config['o_groups'] * config['o_lora_rank'] // tp * config['dim'] # wo_b
)
# MoE experts (EP-sharded)
n_local_experts = config['n_routed_experts'] // ep
expert_params = 3 * config['dim'] * config['moe_inter_dim'] # w1 + w2 + w3
# Shared expert (TP-sharded, not EP)
shared_expert_params = 3 * config['dim'] * config['moe_inter_dim'] // tp
# Bytes by precision
attn_bytes = attn_params_per_layer * bytes_per_elem(config['attn_dtype'])
expert_bytes = n_local_experts * expert_params * bytes_per_elem(config['expert_dtype'])
shared_bytes = shared_expert_params * bytes_per_elem(config['shared_expert_dtype'])
# Scale overhead
attn_scale = scale_overhead(attn_params_per_layer, config['attn_dtype'])
expert_scale = scale_overhead(n_local_experts * expert_params, config['expert_dtype'])
# Small components (gate, norms, embeddings)
gate_bytes = config['n_routed_experts'] * config['dim'] * 4 # FP32
norm_bytes = 4 * config['dim'] * 4 # 4 norms, FP32
embed_bytes = config['vocab_size'] * config['dim'] // tp * 2 # BF16
# HC (Hyper-Connection) parameters
hc = config['hc_mult'] # 4
hc_per_layer = (2 + hc) * hc * config['dim'] * hc * 4 * 2 # attn + ffn
per_layer = (attn_bytes + expert_bytes + shared_bytes
+ attn_scale + expert_scale + gate_bytes + norm_bytes + hc_per_layer)
total = pp_layers * per_layer + embed_bytes
return total
def bytes_per_elem(dtype):
return {'fp4': 0.5, 'fp8': 1, 'bf16': 2, 'fp32': 4}[dtype]
def scale_overhead(n_params, dtype):
if dtype == 'fp4':
return n_params // 32 * 1 # per-32 E8M0 scale
elif dtype == 'fp8':
return n_params // (128*128) * 2 # per-128x128 block, 2D
return 0
3. KV Cache Memory [per-rank, per-request]
CSA/HCA 异构 KV cache(详见 CSA/HCA 注意力):
def kv_cache_per_request(config, seq_len, tp=1, cp=1):
head_dim = config['head_dim'] # 512
nope_dim = head_dim - config['rope_head_dim'] # 448
bytes_per_entry = nope_dim * 1 + config['rope_head_dim'] * 2 # FP8 + BF16 = 576
total_entries = 0
for ratio in config['compress_ratios']:
win = config['window_size'] # 128
if ratio == 0: total_entries += win
elif ratio == 4: total_entries += win + seq_len // 4
else: total_entries += win + seq_len // ratio
kv_bytes = total_entries * bytes_per_entry // cp
# Indexer (FP4, CSA layers only)
n_csa = sum(1 for r in config['compress_ratios'] if r == 4)
indexer_bytes = n_csa * (seq_len // 4) * config['index_head_dim'] * 0.5
return kv_bytes + indexer_bytes
V4-Pro 1M token 示例:
| Component | Calculation | Size |
|---|---|---|
| CSA (29 layers) | 29 x 250,128 x 576 | ~4.18 GB |
| HCA (31 layers) | 31 x 7,940 x 576 | ~0.14 GB |
| SWA (1 layer) | 1 x 128 x 576 | ~0.00007 GB |
| Indexer (29 layers) | 29 x 250,000 x 128 x 0.5 | ~0.47 GB |
| Total per-request | ~4.79 GB |
4. Activation / Workspace Memory [per-rank]
Peak activation 发生在 prefill 的最大 chunk 中。HC multiplier=4 使 residual stream 占用 4x normal。
Decode (bs=B, seq=1):
hc_residual = B x hc(4) x dim x 4 bytes (FP32)
q_buffer = B x (n_heads/tp) x head_dim x 2 (BF16)
moe_workspace = B x dim x 4 (FP32 accumulator)
shared_expert_buf = B x inter_dim x 2 x 4 (gate+up, FP32)
peak = hc_residual + max(q_buffer, moe_workspace + shared_expert_buf)
Prefill (chunk_size=C):
activation_peak = B x C x dim x hc_mult x 4 (HC residual, FP32)
+ B x C x (n_heads/tp) x head_dim x 2 (Q buffer)
+ B x C x inter_dim x 2 x 4 (MoE gate+up, FP32)
5. MoE Workspace 建模
详见 MoE 推理 的执行流程。
EP all-to-all dispatch buffer [per-rank]:
= max_tokens_per_rank x dim x bytes x 2 (double-buffer)
where max_tokens_per_rank = batch x seq x n_activated_experts / EP_size
(capped by DeepEP ElasticBuffer config)
Grouped GEMM workspace [per-rank]:
= max_tokens_per_expert x inter_dim x 4 (FP32 accumulator)
where max_tokens_per_expert = batch x seq x load_factor / n_local_experts
6. FP4 Capability Matrix
详见 FP4/FP8 量化。
CAPABILITY_MATRIX = {
# (checkpoint_format, hardware) -> (runtime_format, mem_multiplier, speed_vs_fp8)
('fp4', 'B200'): ('fp4_native_mma', 1.0, 2.0), # 理论值
('fp4', 'B200_current'): ('fp4_cast_fp8', 1.0, 1.0), # V4 实际
('fp4', 'H100'): ('fp4_to_fp8_preexpand', 2.0, 1.0),
('fp4', 'H200'): ('fp4_to_fp8_preexpand', 2.0, 1.0),
('fp8', 'B200/H100/H200'): ('fp8_native', 1.0, 1.0),
}
H100/H200 运行 FP4 checkpoint 时 expert 显存翻倍。此因子必须纳入模拟器。
7. MTP Overhead 建模
MTP block 复用主模型最后一层参数,不应简单乘以 (nextn+1)。
def mtp_overhead(config, batch_size, seq_len_or_1):
dim = config['dim']
vocab = config['vocab_size'] # 129,280
nextn = config['num_nextn_predict_layers'] # 1
# Extra params (small): enorm + hnorm + eh_proj
mtp_weight_bytes = (2 * dim + 2 * dim * dim) * 1 # FP8
# Extra activation
mtp_activation = batch_size * seq_len_or_1 * 2 * dim * 2 # BF16
# Logits buffer (persists until verify complete)
logits_buffer = batch_size * seq_len_or_1 * vocab * 4 * nextn # FP32
# Extra KV for speculative tokens (tiny)
extra_kv = nextn * kv_per_token(config) * config['n_layers']
return mtp_weight_bytes + mtp_activation + logits_buffer + extra_kv
关键:训练时 MTP 开销是 (nextn+1) x total_activation,但推理时 MTP 与主模型顺序执行,activation 空间可复用。
8. Per-rank vs Global/Cluster
| 量 | Per-rank | Global/Cluster Total |
|---|---|---|
| Weights | GPU 中实际存储量 | Sum(all ranks) = 模型总参数 x bytes |
| KV cache | 单卡 KV pool 容量 | Sum(all ranks) x TP |
| Activation peak | 单卡 max_chunk prefill 峰值 | 无意义(同步发生,不累加) |
| Throughput | 单卡 tokens/s | 所有 DP replica 之和 |
| Max batch | 受单卡限制 | = per-rank max_batch x DP_size |
常见错误:
- 用 “1.6T params” 算单卡显存(应除以 EP x TP x PP)
- 用 global KV 估算单卡(应除以 TP)
- 将 activation 跨 rank 累加(无意义)
9. 配置项 vs Calibration Table
应做成配置项(随部署变化):
tp, ep, pp, dp, cpbatch_size, max_seq_len, chunk_sizekv_quant_bits, expert_dtype, attn_dtypeblock_size(PagedAttention)prefix_cache_hit_ratio- Hardware:
hbm_capacity, hbm_bandwidth, flops_fp8, flops_fp4, nvlink_bw, pcie_bw
应做成 calibration table(需 profiling):
kernel_efficiency: 实际 vs 理论峰值(50-80%)all_to_all_latency(msg_size, ep_size): 非线性,需实测grouped_gemm_efficiency(n_experts, tokens_per_expert): 负载不均时下降sparse_attn_overhead: CSA indexer + sparse gather 额外比prefix_cache_hit_rate(workload): workload-dependentep_load_balance_factor: 实际 vs 理想均匀
10. 不能硬编码的公式
- Arithmetic Intensity 拐点:
compute_bound if FLOPs/Byte > hardware_oi— OI 随 GPU 型号变化 - Expert load imbalance:top-k 理论均匀,实际倾斜 -> grouped GEMM 效率需 profiling
- Overlap efficiency:DeepEP 通信-计算 overlap 实际比例取决于 kernel launch pattern
- KV cache fragmentation:PagedAttention 利用率取决于 seq_len 分布
- CSA indexer FLOPs:理论
seq/4 x 128 x 64,实际有 Hadamard + FP4 量化开销
11. 与其他主题的关联
- Weights 中 MoE expert 切分详见 MoE 推理
- FP4/FP8 精度对显存的具体影响详见 FP4/FP8 量化
- KV cache 的 CSA/HCA 分层计算详见 CSA/HCA 注意力
- 框架选型对配置参数的约束详见 推理框架对比 2026
← 被以下页面引用(4)
- 推理框架对比 2026:vLLM / SGLang / TensorRT-LLM 及其他ai-systems · synthesis
- CSA/HCA 注意力:DeepSeek-V4 的混合压缩稀疏机制ai-systems · synthesis
- FP4/FP8 量化:低精度推理的存储与计算ai-systems · synthesis
- MoE 推理:Expert 并行与调度机制ai-systems · synthesis
修改历史
修改历史1 次提交
- feat(wiki): ingest 4 raw articles + split inference survey into 5 pagesxiaocheng··
0521533