AMD MI308X 单卡 Profile 分析摘要（kernel_launch_bound 主导）

2026年6月1日 · 约 2 分钟阅读

Profile 概览

字段	值
profile_id	`20260526061917368_53495378fdf81669`
GPU	AMD Instinct MI308X-OAM（单卡）
Trace 时长	20479.51 ms
Workflow	workflow_1（zhouyi）

E2E Bound 分布

类别	百分比	绝对耗时 (ms)
kernel_launch_bound	55.4%	11345
cpu_bound (+ sync)	30.4%	6226
compute	10.1%	2068
memory_copy	3.9%	799
kernel_queue	0.2%	41
communication	0.0%	0

GPU 总 idle 时间 17217 ms（84.1%），已 100% 归因到 kernel_launch_bound 和 cpu_bound 两类。

瓶颈分析

瓶颈 1：Kernel Launch Gap（55.4%）

根因：推理/训练主循环在 CPU 侧串行下发 kernel，未使用 graph capture 或 kernel fusion，GPU 在 launch 间隙空转。

9 个 kernel_launch_bound 窗口，合计 10236 ms
Top idle 窗口 2434 ms / 2323 ms，CPU 栈均为 hipMemcpyAsync + hipEventSynchronize

瓶颈 2：CPU Bound 同步等待（30.4%）

根因：CPU 侧 hipEventSynchronize 显式等 GPU event 完成，GPU 完成后 CPU 才追上下发下一波 kernel。

3 个 cpu_bound 窗口，合计 6217 ms
Top1 窗口 3939.79 ms（占该类 63%），CPU 栈 self_time 仅 56 us 但 wall-time 3.9s

热点 Kernel

关键路径 Top kernel 全部是 MemCopy，75% 为 MemCopy (0 bytes)（同步占位）：

MemCopy (0 bytes) x 55
MemCopy (13074432 bytes) x 5

无 GEMM/Conv 进入 Top——GPU 忙的时候也主要在搬数据（MEMORY 相关 76.1%）。

优化建议

措施	预期收益
HIP Graph capture（step 主体编进 graph）	消掉 50-70% launch overhead，约 5700-7900 ms（28-39% trace 时长）
Kernel fusion（elementwise + activation 融合）	额外节省 1000-1500 ms
去掉 0 字节 MemCopy	直接节省小，但解开同步依赖链的间接收益大
`hipStreamWaitEvent` 替代 `hipEventSynchronize`	消掉 Top1 cpu_bound 窗口约 50%，约 2000 ms

数据可信度

E2E Bound / kernel breakdown / idle 窗口归因：可信
MFU/MBU：不可用（aggregator 不支持 AMD GPU）
analyze_interval 精细化分析：未产出（回退到 top_windows[].cpu_top_functions）
R10 降频：三类信号全部不触发

相关页面

← 被以下页面引用(3)

FT vs VLLM vs SGLang 推理框架对比摘要ai-systems · source-summary
H20 批量归因分析报告摘要ai-systems · source-summary
AWP Profiling APItoolbox · entity

修改历史3 次提交

feat(wiki): enforce lifecycle metadata and search aliases
xiaocheng·刚刚·8098d0c
feat: SEO + cross-references + Lighthouse fix
xiaocheng·06-09·127a349
feat(wiki): ingest 4 raw articles + split inference survey into 5 pages
xiaocheng·06-07·0521533