Critical Path of AI Trace

2026年1月4日 · 约 6 分钟阅读

在 GPU 性能分析中，关键路径（Critical Path） 回答了一个最根本的问题：端到端延迟由哪条事件链决定？ 找出关键路径后，才能知道优化哪个环节能真正缩短总时间。

1. 问题定义与理论基础

1.1 问题目标

针对 GPU trace（Chrome trace 风格），从所有事件中找出 端到端关键路径，并对这条路径上的时间进行 原因归因（CPU/GPU 计算/通信/内核间延迟/Launch Overhead 等）。

1.2 关键路径定义

在一个带权有向图（DAG）上，从任意起点到任意终点，使得 路径总权重（时间）最大 的那条路径：

$\text{CP} = \arg\max_{P \in \mathcal{P}} \sum_{e \in P} \text{duration}(e)$

其中 $\mathcal{P}$ 是所有从源点到汇点的路径集合。

1.3 为什么需要 DAG？

GPU 执行不是单线程的——多个 CUDA stream 并行运行，stream 之间有显式同步，CPU 和 GPU 之间通过 launch 依赖关联。这些关系天然构成一个 有向无环图（DAG）：

节点（Node）：每个 trace event（kernel、memcpy、NCCL、cuda_runtime call 等）
边（Edge）：事件间的依赖关系
权重：节点的 duration（微秒）

2. 业界工具的关键路径实践

不同工具对”关键路径”问题的切入角度各不相同。理解它们的差异有助于选择合适的分析手段。

2.1 PyTorch HTA — 自动化关键路径计算

PyTorch HTA（Holistic Trace Analyzer）是业界少数提供 自动关键路径计算 的工具。

核心流程：

Chrome Trace → DAG 构建 → 拓扑排序 → 最长路径 DP → 时间归因

DAG 构建策略：

边类型	依赖来源	示例
Launch	CPU `cuda_runtime` → GPU kernel（通过 correlation ID 关联）	`aten::mm` → `volta_fp16_sgemm`
Stream Order	同一 stream 上相邻 kernel 隐式串行	kernel A → kernel B（同 stream 7）
Sync	`cudaEventRecord` + `cudaStreamWaitEvent` 显式同步	stream 0 等待 stream 1
NCCL	跨 rank 的 collective 操作依赖	all_reduce rank 0 → all_reduce rank 1

时间归因分类：

类别	条件	含义
Compute-bound	GPU 计算 kernel 在关键路径上	矩阵乘、卷积主导端到端时间
Memory-bound	Memcpy/Memset 在关键路径上	数据移动是瓶颈
Communication-bound	NCCL kernel 在关键路径上	跨卡/跨节点通信主导
Launch Overhead	`cudaLaunchKernel` 结束到 GPU kernel 启动的间隙	太多小 kernel
Idle/Gap	关键路径上无活跃 kernel 的时间段	CPU 没跟上，GPU 空等

2.2 NVIDIA Nsight Systems (NSYS) — 可视化关键路径

NSYS 不做数学意义上的关键路径计算，但它提供的 统一时间线视图 是最直观的”肉眼找关键路径”方式：

CPU Thread 0  |████████████████████████████████████████████████████████████████████|
CUDA Stream 0 |    [kernel_A]     [kernel_B]          [kernel_C]                   |
CUDA Stream 1 |         [kernel_D]     [kernel_E]     [kernel_F]                   |
NCCL          |              [all_reduce_rank0]                                    |

NSYS 的能力边界：

✅ 多 stream/CPU thread/NCCL 的统一时间线
✅ NVTX marker 标注应用级阶段
✅ CUDA graph 执行可视化
❌ 不自动计算关键路径数值
❌ 不做时间归因分类

典型工作流： NSYS 找到”哪个 kernel 最重要” → NCU 分析”这个 kernel 为什么慢” → HTA 做量化关键路径计算。

2.3 NVIDIA Nsight Compute (NCU) — 逐 Kernel 瓶颈定位

NCU 关注的是 单个 kernel 内部 的性能瓶颈，而非全局关键路径：

指标	说明	判断依据
SpeedOfLight	理论 vs 实际吞吐对比	内存带宽利用率 < 50% → memory-bound
Scheduler Stats	Warp 调度停滞原因	long scoreboarding → 等内存；short scoreboarding → 指令依赖
Workload Distribution	SM 利用率分布	不均衡 → 负载分配问题
Source Counters	映射到 SASS/CUDA C++ 源码行	精确定位热点代码行

NCU 与关键路径的关系： NCU 回答 “why”（为什么这个 kernel 慢），HTA/NSYS 回答 “which”（哪个 kernel 最关键）。两者互补。

2.4 Chrome Trace / Kineto — 数据基础

PyTorch Profiler 底层使用 Kineto 收集 trace，输出 Chrome trace JSON 格式。这是所有上层分析的数据源。

核心字段：

{
  "name": "aten::mm",
  "cat": "gpu",
  "ph": "X",
  "ts": 1700000000000000,
  "dur": 500,
  "pid": 0,
  "tid": 7,
  "args": {
    "External id": 123,
    "stream": 7,
    "correlation": 456,
    "cbid": 128
  }
}

字段	用途
`correlation`	链接 CPU runtime call 与对应 GPU kernel
`stream`	标识 CUDA stream
`External id`	链接 op 层级父子关系（forward/backward）
`ts/dur`	微秒级时间戳（单调时钟）

事件类别：

Category	含义	示例
`cpu_op`	PyTorch CPU 算子	`aten::mm`, `aten::addmm`
`cuda_runtime`	CUDA Runtime API 调用	`cudaLaunchKernel`, `cudaMemcpyAsync`
`gpu`	GPU kernel 执行	`volta_fp16_sgemm_128x64_nn`
`nccl`	集合通信	`ncclKernel_AllReduce_RING`
`Memcpy_*`	内存传输	`Memcpy_H2D`, `Memcpy_D2H`

3. 关键路径算法实现

3.1 整体流程

flowchart LR
    A[解析 Chrome Trace] --> B[构建 DAG]
    B --> C[拓扑排序]
    C --> D[正向传播 EST]
    D --> E[反向标记关键路径]
    E --> F[时间归因分类]
    F --> G[可视化输出]

3.2 DAG 构建

class CPNode:
    def __init__(self, event_id, name, duration, node_type):
        self.event_id = event_id
        self.name = name
        self.duration = duration
        self.node_type = node_type  # COMPUTE, MEMORY, COMM, LAUNCH, IDLE
        self.predecessors = []  # [(node, edge_type), ...]
        self.successors = []
        self.earliest_start = 0
        self.earliest_finish = 0
        self.on_critical_path = False

def build_dag(trace_events, correlation_map):
    nodes = {}

    # Step 1: Create nodes for all events
    for event in trace_events:
        node = CPNode(
            event_id=event['id'],
            name=event['name'],
            duration=event['dur'],
            node_type=classify_kernel(event['name'])
        )
        nodes[event['id']] = node

    # Step 2: Build edges
    for event in trace_events:
        node = nodes[event['id']]

        # Type 1: CPU -> GPU launch dependency (via correlation ID)
        if event.get('args', {}).get('correlation'):
            corr_id = event['args']['correlation']
            if corr_id in correlation_map:
                cpu_node = correlation_map[corr_id]
                cpu_node.add_successor(node, edge_type='launch')

        # Type 2: Sequential within same stream
        prev = previous_kernel_on_same_stream(event)
        if prev:
            nodes[prev['id']].add_successor(node, edge_type='stream_order')

        # Type 3: Stream synchronization
        sync_deps = find_sync_dependencies(event)
        for dep in sync_deps:
            nodes[dep['id']].add_successor(node, edge_type='sync')

        # Type 4: NCCL collective dependencies
        nccl_deps = find_nccl_dependencies(event)
        for dep in nccl_deps:
            nodes[dep['id']].add_successor(node, edge_type='nccl')

    return nodes

3.3 最长路径计算（拓扑排序 + DP）

def compute_critical_path(nodes):
    # Topological sort (Kahn's algorithm)
    topo_order = topological_sort(nodes)

    # Forward pass: compute earliest start/finish
    for node in topo_order:
        if node.predecessors:
            node.earliest_start = max(
                pred.earliest_finish for pred, _ in node.predecessors
            )
        else:
            node.earliest_start = 0
        node.earliest_finish = node.earliest_start + node.duration

    # Find sink node (maximum earliest_finish = total execution time)
    sink = max(topo_order, key=lambda n: n.earliest_finish)
    total_time = sink.earliest_finish

    # Backward pass: mark critical path nodes
    current = sink
    while current.predecessors:
        current.on_critical_path = True
        # The predecessor that determines current.earliest_start
        critical_pred = max(
            current.predecessors,
            key=lambda p: p[0].earliest_finish
        )[0]
        if critical_pred.earliest_finish == current.earliest_start:
            current = critical_pred
        else:
            break
    current.on_critical_path = True

    return total_time, [n for n in topo_order if n.on_critical_path]

3.4 时间归因

关键路径上的每个节点按以下规则分类：

def classify_kernel(name):
    name_lower = name.lower()
    if name_lower.startswith('nccl'):
        return 'communication'
    if name_lower.startswith(('memcpy', 'memcp_async')):
        return 'memory'
    if name_lower.startswith('memset'):
        return 'memory'
    if any(x in name_lower for x in ['gemm', 'matmul', 'conv', 'linear']):
        return 'compute'
    return 'compute'  # default assumption

归因统计汇总：

关键路径总时长: 1250ms
├── Compute-bound:    780ms  (62.4%)  ← 主要瓶颈
├── Communication:    250ms  (20.0%)
├── Memory-bound:     120ms   (9.6%)
├── Launch Overhead:   60ms   (4.8%)
└── Idle/Gap:          40ms   (3.2%)

4. 工具对比总结

维度	HTA	NSYS	NCU	Kineto
关键路径计算	✅ 自动	❌ 手动观察	❌ 不支持	❌ 仅数据采集
时间归因	✅ 5 类	❌	⚠️ 单 kernel	❌
多 Stream	✅	✅	❌	✅
NCCL/分布式	✅ 多 rank	✅	❌	✅
逐 Kernel 细节	❌	⚠️ 有限	✅ 完整	⚠️ 有限
适用场景	训练迭代分析	全局时序排查	Kernel 级调优	数据采集基础

5. 业界最佳实践

5.1 NVIDIA 推荐工作流

NSYS 先跑一次，看全局时间线，识别最耗时的阶段
HTA 计算关键路径，量化各阶段占比
对关键路径上的 top-N kernel，用 NCU 逐一分析瓶颈原因
根据归因结果选择优化方向

5.2 常见优化策略

瓶颈类型	优化手段
Compute-bound	Kernel fusion、混合精度、更大 batch、Tensor Cores
Memory-bound	内存布局优化（NHWC）、coalesced access、shared memory tiling
Communication-bound	通信/计算重叠、gradient bucketing、NVLink 拓扑感知放置
Launch Overhead	CUDA Graph、kernel fusion、减少小 op 数量
Host Wait (Idle)	DataLoader 优化（num_workers、pin_memory）、异步预取

5.3 PyTorch Profiling 建议

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True,
    experimental_config=torch.profiler._ExperimentalConfig(verbose=True),
) as prof:
    # Skip warmup iteration
    train_step()
    # Profile actual iterations
    for _ in range(num_iterations):
        train_step()

# Export for HTA / custom analysis
prof.export_chrome_trace("trace.json")

使用 torch.cuda.synchronize() 确保 profiling 边界准确
分布式训练时同时采集所有 rank 的 trace
跳过 warmup iteration，只分析稳态迭代

6. 可视化方案

关键路径可视化的目标是让读者一眼看出 “时间花在哪” 和 “哪条链最慢”。

6.1 时间线高亮

标准 Gantt 时间线上，将关键路径节点标红：

Stream 0  |░░░░[████████][  idle  ][██████][gap][████████████]
Stream 1  |  ░░[████████████][████][idle][████████][██████]
            ↑       ↑             ↑          ↑
          start   non-CP      non-CP       end
                  █ = critical path nodes (red)
                  ░ = non-critical (gray)

6.2 瀑布图（Waterfall）

按归因类别展示累计贡献：

Compute     ████████████████████████████████████████ 62.4%
Comm        ████████████ 20.0%
Memory      ██████ 9.6%
Launch      ███ 4.8%
Idle        ██ 3.2%
            0ms        500ms       1000ms      1250ms

6.3 与现有工具的兼容

输出标准 Chrome trace 格式，用 color 字段区分关键路径状态
可导入 Perfetto / Chrome devtools 查看
也可用 Mermaid gantt 在文档中静态展示

参考资源

← 被以下页面引用(7)

Cprof C++ Profiling 核心技术ai-systems · concept
GPU Trace 时间分解与通信计算重叠分析ai-systems · concept
NVTX 原理分析ai-systems · concept
Temporal Breakdown 计算分析ai-systems · concept
HTA 算法原理与实现ai-systems · entity
Agentic AWP：规模化 Profiling 驱动的 GPU 效率 Breakdown 与能力体系ai-systems · source-summary
Agentic Infra：LLM 推理性能优化与 GPU 利用率提升ai-systems · source-summary

修改历史9 次提交

feat(wiki): enforce lifecycle metadata and search aliases
xiaocheng·刚刚·8098d0c
feat(wiki): connect core topics and add reading series
xiaocheng·6 小时前·e947096
docs: refresh wiki indexes and embeddings
xiaocheng·07-01·41f3817
fix(wiki): clean all lint errors to enable strict CI (PR-3)
xiaocheng·05-25·75375ef
docs(profiling): expand critical path article with HTA/NCU/NSYS industry practices
xiaocheng·04-11·926e22c
docs(profiling): add HTA algorithm analysis and update critical path metadata
xiaocheng·01-18·61ce6a0
refactor: reorganize documentation structure and update Navbar component
xiaocheng·01-17·2fb8f42
chore(project): clean up obsolete configuration and build artifacts
xiaocheng·01-16·3574bd3
feat: add profiling docs for AI tracing and critical paths
xiaocheng·01-15·fd52202

Critical Path of AI Trace

1. 问题定义与理论基础

1.1 问题目标

1.2 关键路径定义

1.3 为什么需要 DAG？

2. 业界工具的关键路径实践

2.1 PyTorch HTA — 自动化关键路径计算

2.2 NVIDIA Nsight Systems (NSYS) — 可视化关键路径

2.3 NVIDIA Nsight Compute (NCU) — 逐 Kernel 瓶颈定位

2.4 Chrome Trace / Kineto — 数据基础

3. 关键路径算法实现

3.1 整体流程

3.2 DAG 构建

3.3 最长路径计算（拓扑排序 + DP）

3.4 时间归因

4. 工具对比总结

5. 业界最佳实践

5.1 NVIDIA 推荐工作流

5.2 常见优化策略

5.3 PyTorch Profiling 建议

6. 可视化方案

6.1 时间线高亮

6.2 瀑布图（Waterfall）

6.3 与现有工具的兼容

参考资源

相关页面

← 被以下页面引用(7)

目录 25

Critical Path of AI Trace

1. 问题定义与理论基础

1.1 问题目标

1.2 关键路径定义

1.3 为什么需要 DAG？

2. 业界工具的关键路径实践

2.1 PyTorch HTA — 自动化关键路径计算

2.2 NVIDIA Nsight Systems (NSYS) — 可视化关键路径

2.3 NVIDIA Nsight Compute (NCU) — 逐 Kernel 瓶颈定位

2.4 Chrome Trace / Kineto — 数据基础

3. 关键路径算法实现

3.1 整体流程

3.2 DAG 构建

3.3 最长路径计算（拓扑排序 + DP）

3.4 时间归因

4. 工具对比总结

5. 业界最佳实践

5.1 NVIDIA 推荐工作流

5.2 常见优化策略

5.3 PyTorch Profiling 建议

6. 可视化方案

6.1 时间线高亮

6.2 瀑布图（Waterfall）

6.3 与现有工具的兼容

参考资源

相关页面

← 被以下页面引用(7)

相关阅读