第二章：NVIDIA 生态——CUDA、TensorRT 与 Triton

NVIDIA 的护城河不是 GPU 硬件本身，而是 20 年积累的软件生态。CUDA、cuDNN、TensorRT——这些工具库让 NVIDIA GPU 比竞争对手快 2-5 倍，同时把开发者锁定在 NVIDIA 平台。

一、CUDA 软件栈层次

应用层        PyTorch / TensorFlow / JAX
              ↕
框架层        cuDNN（神经网络原语）/ cuBLAS（线性代数）/ NCCL（多GPU通信）
              ↕
运行时层      CUDA Runtime API
              ↕
驱动层        NVIDIA Driver
              ↕
硬件层        GPU（SM / Tensor Core / HBM）

各层的作用

cuDNN：深度学习的加速库

# 你写 PyTorch，PyTorch 调用 cuDNN
import torch

# 这一行代码背后：
x = torch.nn.functional.conv2d(input, weight, bias, stride=1)
# PyTorch → cuDNN → CUDA → GPU
# cuDNN 已经把卷积操作优化到接近理论峰值性能

cuBLAS：矩阵运算的核心

# Transformer 中的矩阵乘法最终调用 cuBLAS
# Q = X @ W_q
# K = X @ W_k
# V = X @ W_v
# Attention = softmax(Q @ K.T / sqrt(d_k)) @ V

# 这些运算的速度直接决定了推理速度
# cuBLAS 用 Tensor Core 实现，比手写 CUDA 快 3-5 倍

二、TensorRT——推理优化的利器

TensorRT 是 NVIDIA 专门为推理优化设计的框架，可以把 PyTorch 模型转换为高度优化的 TensorRT Engine。

优化技术

算子融合（Operator Fusion）：

原始计算图（PyTorch）：
  Linear → Add → LayerNorm → GELU → Linear → Add → LayerNorm
  = 6 个独立的 GPU kernel，6 次内存读写

TensorRT 融合后：
  [Linear + Add + LayerNorm] → [GELU] → [Linear + Add + LayerNorm]
  = 3 个 kernel，减少内存读写
  
性能提升：30-50%

层精度混合（Mixed Precision）：

# 不是所有层都需要 FP16
# TensorRT 自动识别：
# - 计算密集层（MatMul）→ FP16（Tensor Core）
# - 精度敏感层（SoftMax、LayerNorm）→ FP32

# 效果：接近 FP16 的速度，接近 FP32 的精度

实际使用

# 方法一：torch.compile（最简单，PyTorch 2.0+）
import torch

model = load_your_model()
model = model.half().cuda()

# 使用 TensorRT 后端编译
compiled_model = torch.compile(model, backend="inductor", options={
    "use_cuda_graphs": True,
    "triton.cudagraphs": True
})

# 第一次调用：慢（编译 TensorRT Engine，需要 30-120 秒）
# 后续调用：快（直接使用编译好的 Engine）
with torch.no_grad():
    output = compiled_model(input_ids)

# 方法二：使用 torch-tensorrt（更精细的控制）
import torch_tensorrt

trt_model = torch_tensorrt.compile(
    model,
    inputs=[
        torch_tensorrt.Input(
            min_shape=[1, 1],
            opt_shape=[8, 512],    # 最优批处理大小
            max_shape=[32, 2048],  # 最大输入
            dtype=torch.int32
        )
    ],
    enabled_precisions={torch.float16},
    truncate_long_and_double=True
)

torch.jit.save(trt_model, "model_trt.pt")

性能对比

# 实测：BERT-base 推理（batch_size=32, seq_len=128）
benchmarks = {
    "PyTorch FP32": {"latency_ms": 45, "throughput_rps": 711},
    "PyTorch FP16": {"latency_ms": 18, "throughput_rps": 1778},
    "TensorRT FP16": {"latency_ms": 8, "throughput_rps": 4000},
    "TensorRT INT8": {"latency_ms": 5, "throughput_rps": 6400},
}

# TensorRT INT8 vs PyTorch FP32：快 9x！
# 精度损失（BERT-base on SQuAD）：F1 从 88.5 → 88.1（差异 0.4）

三、NVIDIA Triton 推理服务器

Triton 是 NVIDIA 提供的开源推理服务框架，支持多种模型格式，内置负载均衡和动态批处理。

架构

客户端请求 (gRPC / HTTP)
        ↓
Triton Inference Server
  ├── 请求队列
  ├── 动态批处理（Dynamic Batching）
  ├── 并发执行控制
  └── 模型库
       ├── model_1/ (TensorRT Engine)
       ├── model_2/ (PyTorch TorchScript)
       ├── model_3/ (ONNX Runtime)
       └── model_4/ (Python Backend - 任意代码)

部署示例

# 模型仓库结构
model_repository/
└── bert_base/
    ├── config.pbtxt           # 模型配置
    └── 1/
        └── model.plan         # TensorRT Engine 文件

# config.pbtxt
name: "bert_base"
platform: "tensorrt_plan"

max_batch_size: 32

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [128]  # 序列长度
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [128]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [2]  # 二分类
  }
]

# 动态批处理配置
dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 5000  # 最多等 5ms 积累批次
}

# 并发执行（多个模型实例）
instance_group [
  {
    count: 2     # 同时运行 2 个模型实例
    kind: KIND_GPU
    gpus: [0]    # 使用 GPU 0
  }
]

# Python 客户端
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

# 准备输入
input_ids = np.array([[101, 2023, 2003, 102, 0, 0, ...]], dtype=np.int32)
attention_mask = np.array([[1, 1, 1, 1, 0, 0, ...]], dtype=np.int32)

inputs = [
    grpcclient.InferInput("input_ids", input_ids.shape, "INT32"),
    grpcclient.InferInput("attention_mask", attention_mask.shape, "INT32"),
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(attention_mask)

outputs = [grpcclient.InferRequestedOutput("logits")]

response = client.infer(
    model_name="bert_base",
    inputs=inputs,
    outputs=outputs
)

logits = response.as_numpy("logits")

四、NCCL——多 GPU 通信

NCCL（NVIDIA Collective Communications Library）是多 GPU / 多节点通信的基础库。

通信模式

# All-Reduce：分布式训练的核心
# 场景：4 个 GPU 各有梯度，需要求平均后更新参数
#
# GPU 0: [1.0, 2.0, 3.0, 4.0]  \
# GPU 1: [2.0, 3.0, 4.0, 5.0]   → All-Reduce (Sum) → 每个 GPU 都得到 [10, 14, 18, 22]
# GPU 2: [3.0, 4.0, 5.0, 6.0]  /
# GPU 3: [4.0, 5.0, 6.0, 7.0]

# All-Gather：张量并行推理的核心
# 场景：每个 GPU 存了模型的一部分，需要合并输出
# GPU 0: [片段0]  \
# GPU 1: [片段1]   → All-Gather → 每个 GPU 都得到完整输出
# GPU 2: [片段2]  /

NVLink vs PCIe

PCIe 4.0 x16: 32 GB/s（单向）
NVLink 4.0 (H100): 900 GB/s（总带宽）

为什么 NVLink 对推理很重要：
- 张量并行需要频繁的 GPU 间通信
- PCIe 成为瓶颈：GPU 算力强，但通信慢
- NVLink 消除通信瓶颈：多 GPU 推理的扩展效率从 60% → 90%+

NVLink 只在服务器级别支持：
- A100 SXM / H100 SXM（数据中心版）：有 NVLink
- A100 PCIe / H100 PCIe（PCIe 版本）：无 NVLink（便宜 30%）
- 推理大模型：优先选 SXM 版本

五、开发环境配置

# 推荐的 Dockerfile：NVIDIA PyTorch 官方镜像
FROM nvcr.io/nvidia/pytorch:24.01-py3

# 安装推理相关库
RUN pip install \
    transformers==4.38.0 \
    vllm==0.3.3 \
    triton \
    torch-tensorrt \
    bitsandbytes \
    accelerate

# 验证 GPU 可用
RUN python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

# 运行容器（映射 GPU）
docker run \
    --gpus all \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -p 8000:8000 \
    -v /data:/data \
    my-inference-image

# 监控 GPU 使用情况
watch -n 1 nvidia-smi

# 更详细的 GPU 监控
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.free \
           --format=csv -l 1

关键认知

NVIDIA 的护城河是软件，不是硬件：

AMD GPU（MI300X）的硬件规格已经超过 H100（192GB HBM3 vs 80GB），但 PyTorch 生态、cuDNN 优化、TensorRT 成熟度让大多数 AI 公司仍然选择 NVIDIA。

从工程角度：

先用 PyTorch FP16 基准（简单，性能已经不错）
瓶颈出现后：TensorRT 编译（可以再快 2-5x）
多 GPU 场景：Triton + NCCL（标准化服务接口）

“CUDA 生态是 AI 时代的 Windows：不是最好的，但最多人用，最多工具支持，转换成本极高。这就是 NVIDIA 的万亿市值。”