H20 批量归因分析报告摘要

2026年6月1日 · 约 2 分钟阅读

概述

对 10 个 H20 GPU kernel profile 进行批量归因分析。数据来源为 awp_profiles 表中 gpu_model LIKE '%H20%' 的最新 10 条 completed 记录，经 zhouyi-cli analyze --gpu-type nvidia 分析。

核心结论

项	结论
健康状态	全部不健康。compute% 中位数约 2.5%，最高仅 18.1%
公共根因	`cudaMemcpyAsync` + `cudaStreamSynchronize` 阻塞（典型 H2D dataloader 喂数瓶颈）
MFU 分析	全部不可用（aggregator bug：不支持 gpu_type=nvidia）
降频风险	不触发 R10 三类信号，无需走 Workflow 15

聚类分析

A 群（3 条，compute ~18%）

特征：500+ 个百毫秒级小 idle 窗口，Top 窗口 ~179 ms
判定：真实推理/训练 trace，但 H2D 没有 pipeline 化，每 batch 等数据
根因：DataLoader 同步加载，CPU 发起 cudaMemcpyAsync 后立即 cudaStreamSynchronize 阻塞

B 群（5 条，compute 2-3%）

特征：60-76 个秒级大 idle 窗口，Top 窗口 ~948 ms
判定：有起 workload 但 batch 间隔极大，疑似首 token 推理/极小 batch/debug 模式
需确认：推理框架是否在 debug 模式、batch_size 是否为 1

B+ 群（2 条，无效 profile）

Profile #9：trace 63s，单条 idle 窗口 13.5s
Profile #10：trace 17s，整个 trace 只有 1 个 idle 窗口持续 17s
判定：setup-only / profiling 区间设错的无效 profile

公共根因机制

CPU 调用 cudaMemcpyAsync 发起 H2D
→ 紧接着 cudaStreamSynchronize 阻塞等数据到达
→ GPU 在等 H2D 完成之前没活干 → idle

Self-time（< 50 us）与窗口时长（179 ms ~ 17148 ms）的巨大落差证实 CPU 主要时间花在同步原语等待 GPU 完成。

优化建议

A 群（真实负载）

措施	预期收益（30s trace 基准）
DataLoader `num_workers > 0` + `pin_memory=True`	削掉 70-90% 小 idle 窗口，约 15000 ms
`cudaMemcpyAsync` + `cudaStreamWaitEvent` 替代 `cudaStreamSynchronize`	额外节省约 2000 ms
CUDA Graph capture	A 群 launch% 本身低（< 1.2%），收益小

B+ 群（无效 profile）

重新 profile，让采集窗口避开 setup 阶段
建议 aggregator 加 is_effective_profile 有效性判定字段

Aggregator Bug 反馈

analyze_mfu 不支持 nvidia/amd GPU 型号（所有非默认 GPU 无 MFU/MBU）
critical_path_top_kernels 的 total_dur_us/avg_dur_us 被截零
analyze_interval 对 Top1 idle 窗口的 time_ranges 输入不生效
建议加 is_effective_profile 字段过滤无效 profile

H20 批量归因分析报告摘要

概述

核心结论

聚类分析

A 群（3 条，compute ~18%）

B 群（5 条，compute 2-3%）

B+ 群（2 条，无效 profile）

公共根因机制

优化建议

A 群（真实负载）

B+ 群（无效 profile）

Aggregator Bug 反馈

相关页面

← 被以下页面引用(2)

目录 12

H20 批量归因分析报告摘要

概述

核心结论

聚类分析

A 群（3 条，compute ~18%）

B 群（5 条，compute 2-3%）

B+ 群（2 条，无效 profile）

公共根因机制

优化建议

A 群（真实负载）

B+ 群（无效 profile）

Aggregator Bug 反馈

相关页面

← 被以下页面引用(2)

相关阅读