346 lines
9.8 KiB
Markdown
Executable File
346 lines
9.8 KiB
Markdown
Executable File
# Transformer 深度学习架构
|
||
|
||
> Attention Is All You Need
|
||
>
|
||
> 归档:2026-04-22
|
||
|
||
---
|
||
|
||
## 基本信息
|
||
|
||
| 项目 | 说明 |
|
||
|------|------|
|
||
| 论文 | Attention Is All You Need |
|
||
| 作者 | Google Brain / Google Research |
|
||
| 发表 | 2017年6月 (NeurIPS) |
|
||
| 论文链接 | https://arxiv.org/abs/1706.03762 |
|
||
| 引用量 | 100,000+ |
|
||
| 核心思想 | 完全基于注意力机制,摒弃循环和卷积 |
|
||
|
||
---
|
||
|
||
## 核心创新
|
||
|
||
### 传统架构的问题
|
||
|
||
| 架构 | 问题 |
|
||
|------|------|
|
||
| RNN/LSTM | 顺序计算,无法并行;长距离依赖困难 |
|
||
| CNN | 感受野有限,远距离依赖需要多层卷积 |
|
||
| Encoder-Decoder + Attention | 仍依赖 RNN 作为编码器 |
|
||
|
||
### Transformer 的突破
|
||
|
||
> **"Attention Is All You Need"** — 仅用注意力机制,完全摒弃循环和卷积
|
||
|
||
- ✅ **并行化**:训练速度大幅提升
|
||
- ✅ **长距离依赖**:O(1) 路径长度
|
||
- ✅ **可扩展性**:易于堆叠更深层
|
||
|
||
---
|
||
|
||
## 模型架构
|
||
|
||
### 整体结构
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Transformer │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ Input Embedding + Positional Encoding │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ ┌─────────────────┐ ┌─────────────────┐ │
|
||
│ │ Encoder │ │ Decoder │ │
|
||
│ │ (N=6 layers) │ │ (N=6 layers) │ │
|
||
│ │ │ │ │ │
|
||
│ │ Multi-Head │ │ Masked Multi- │ │
|
||
│ │ Self-Attention │────▶│ Head Self- │ │
|
||
│ │ │ │ Attention │ │
|
||
│ │ Feed-Forward │ │ │ │
|
||
│ │ Network │ │ Encoder-Decoder │ │
|
||
│ │ │ │ Attention │ │
|
||
│ │ │ │ │ │
|
||
│ │ │ │ Feed-Forward │ │
|
||
│ │ │ │ Network │ │
|
||
│ └─────────────────┘ └─────────────────┘ │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ Output Linear + Softmax │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Encoder 结构
|
||
|
||
每层包含两个子层:
|
||
|
||
1. **Multi-Head Self-Attention** — 多头自注意力
|
||
2. **Feed-Forward Network** — 位置前馈网络
|
||
|
||
每个子层都有残差连接 + LayerNorm:
|
||
```
|
||
Output = LayerNorm(x + Sublayer(x))
|
||
```
|
||
|
||
### Decoder 结构
|
||
|
||
每层包含三个子层:
|
||
|
||
1. **Masked Multi-Head Self-Attention** — 带掩码的多头自注意力
|
||
2. **Multi-Head Encoder-Decoder Attention** — 编码器-解码器注意力
|
||
3. **Feed-Forward Network** — 位置前馈网络
|
||
|
||
---
|
||
|
||
## 核心组件
|
||
|
||
### 1. Scaled Dot-Product Attention
|
||
|
||
```
|
||
Attention(Q, K, V) = softmax(QK^T / √d_k) V
|
||
```
|
||
|
||
| 步骤 | 说明 |
|
||
|------|------|
|
||
| 1 | 计算 QK^T(点积) |
|
||
| 2 | 除以 √d_k(缩放) |
|
||
| 3 | softmax 得到权重 |
|
||
| 4 | 乘以 V 得到输出 |
|
||
|
||
**为什么要缩放?**
|
||
- 当 d_k 较大时,点积值可能很大
|
||
- 会导致 softmax 进入梯度饱和区
|
||
- 除以 √d_k 保持数值稳定
|
||
|
||
### 2. Multi-Head Attention
|
||
|
||
```
|
||
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
|
||
|
||
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
|
||
```
|
||
|
||
| 参数 | 值 |
|
||
|------|-----|
|
||
| h (注意力头数) | 8 |
|
||
| d_model | 512 |
|
||
| d_k = d_v | 64 |
|
||
| 总参数量 | 与单头注意力相似 |
|
||
|
||
**多头注意力的好处**:
|
||
- 允许模型同时关注不同位置的不同表示子空间
|
||
- 捕捉多种依赖关系
|
||
|
||
### 3. 位置编码 (Positional Encoding)
|
||
|
||
由于没有循环或卷积,需要注入位置信息:
|
||
|
||
```
|
||
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
|
||
PE(pos, 2i+1) = cos(pos / 10000^(2i/d/d_model))
|
||
```
|
||
|
||
**选择正弦版本的原因**:
|
||
- 可以处理任意长度的序列
|
||
- 模型可以学习相对位置关系
|
||
|
||
### 4. Feed-Forward Network
|
||
|
||
```python
|
||
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
|
||
```
|
||
|
||
| 参数 | 值 |
|
||
|------|-----|
|
||
| 输入/输出维度 | d_model = 512 |
|
||
| 中间层维度 | d_ff = 2048 |
|
||
| 激活函数 | ReLU |
|
||
|
||
---
|
||
|
||
## 训练细节
|
||
|
||
### 超参数
|
||
|
||
| 参数 | 值 |
|
||
|------|-----|
|
||
| Encoder/Decoder 层数 | 6 |
|
||
| 注意力头数 | 8 |
|
||
| 模型维度 | 512 |
|
||
| FFN 内部维度 | 2048 |
|
||
| dropout | 0.1 |
|
||
| 批量大小 | 约 25000 tokens/batch |
|
||
|
||
### 优化器
|
||
|
||
- **Adam** (β_1=0.9, β_2=0.98, ε=10^-9)
|
||
- **学习率调度**:warmup_steps = 4000
|
||
```
|
||
lr = d_model^(-0.5) * min(step^(-0.5), step * warmup^(-1.5))
|
||
```
|
||
|
||
### 正则化
|
||
|
||
- **Label Smoothing** (ε_ls = 0.1)
|
||
- **Dropout** (0.1)
|
||
- **残差 dropout**
|
||
|
||
---
|
||
|
||
## 性能对比
|
||
|
||
| 模型 | WMT 2014 En-De | WMT 2014 En-Fr |
|
||
|------|----------------|----------------|
|
||
| Transformer (Base) | 25.8 BLEU | 38.1 BLEU |
|
||
| Transformer (Big) | 28.4 BLEU | 41.8 BLEU |
|
||
| 之前的最佳 | 26.4 BLEU | 40.7 BLEU |
|
||
|
||
**训练成本**:8 块 P100 GPU,3.5 天
|
||
|
||
---
|
||
|
||
## 三种注意力使用方式
|
||
|
||
| 类型 | 位置 | Q 来源 | K/V 来源 |
|
||
|------|------|--------|----------|
|
||
| **Encoder Self-Attention** | Encoder | 同一层前一层 | 同一层前一层 |
|
||
| **Decoder Self-Attention** | Decoder | 同一层前一层(带掩码) | 同一层前一层 |
|
||
| **Encoder-Decoder Attention** | Decoder | 前一层 Decoder | Encoder 输出 |
|
||
|
||
### Decoder 中的 Mask
|
||
|
||
**目的**:防止看到未来位置(保持自回归特性)
|
||
|
||
```
|
||
设置所有非法连接为 -∞,使 softmax 输出为 0
|
||
```
|
||
|
||
---
|
||
|
||
## 与其他架构对比
|
||
|
||
| 架构 | 最大路径长度 | 每层复杂度 | 最小顺序操作数 |
|
||
|------|-------------|-----------|---------------|
|
||
| **Transformer** | O(1) | O(d·n²) | O(1) |
|
||
| 循环层 (RNN) | O(n) | O(d·n) | O(n) |
|
||
| 卷积 (k=3) | O(log_k(n)) | O(k·d·n) | O(1) |
|
||
| 自注意力 | O(1) | O(d·n²) | O(1) |
|
||
|
||
---
|
||
|
||
## 发展历程
|
||
|
||
### 时间线
|
||
|
||
| 年份 | 模型 | 关键创新 |
|
||
|------|------|---------|
|
||
| 2017 | Transformer | 注意力机制替代 RNN |
|
||
| 2018 | BERT | 仅用 Encoder,双向预训练 |
|
||
| 2018 | GPT | 仅用 Decoder,生成式预训练 |
|
||
| 2019 | GPT-2 | 更大模型,更多数据 |
|
||
| 2020 | GPT-3 | 1750 亿参数,few-shot 学习 |
|
||
| 2020 | T5 | Encoder-Decoder 统一框架 |
|
||
| 2021 | ViT | Transformer 用于图像 |
|
||
| 2022 | ChatGPT | RLHF 对齐 |
|
||
| 2023 | GPT-4 | 多模态,更强推理 |
|
||
| 2023 | LLaMA | 开源 LLM 底座 |
|
||
| 2024 | Sora | Transformer + Diffusion |
|
||
|
||
### 变体架构
|
||
|
||
| 架构 | 改进点 |
|
||
|------|--------|
|
||
| **BERT** | 仅 Encoder,双向理解 |
|
||
| **GPT 系列** | 仅 Decoder,生成式 |
|
||
| **T5** | Encoder-Decoder 统一 |
|
||
| **Longformer** | 长序列处理 |
|
||
| **Reformer** | 高效注意力 |
|
||
| **FlashAttention** | IO 高效注意力 |
|
||
| **MoE (Mixture of Experts)** | 稀疏激活 |
|
||
|
||
---
|
||
|
||
## 数学公式汇总
|
||
|
||
### 1. Scaled Dot-Product Attention
|
||
```
|
||
Attention(Q, K, V) = softmax(QK^T / √d_k)V
|
||
```
|
||
|
||
### 2. Multi-Head Attention
|
||
```
|
||
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
|
||
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
|
||
```
|
||
|
||
### 3. Feed-Forward Network
|
||
```
|
||
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
|
||
```
|
||
|
||
### 4. Positional Encoding
|
||
```
|
||
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
|
||
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
|
||
```
|
||
|
||
### 5. Residual Connection + LayerNorm
|
||
```
|
||
Output = LayerNorm(x + Sublayer(x))
|
||
```
|
||
|
||
---
|
||
|
||
## 代码示例 (PyTorch)
|
||
|
||
```python
|
||
import torch
|
||
import torch.nn as nn
|
||
import math
|
||
|
||
class MultiHeadAttention(nn.Module):
|
||
def __init__(self, d_model, num_heads):
|
||
super().__init__()
|
||
self.num_heads = num_heads
|
||
self.d_k = d_model // num_heads
|
||
self.W_q = nn.Linear(d_model, d_model)
|
||
self.W_k = nn.Linear(d_model, d_model)
|
||
self.W_v = nn.Linear(d_model, d_model)
|
||
self.W_o = nn.Linear(d_model, d_model)
|
||
|
||
def scaled_dot_product_attention(self, Q, K, V, mask=None):
|
||
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
|
||
if mask is not None:
|
||
scores = scores.masked_fill(mask == 0, -1e9)
|
||
scores = torch.softmax(scores, dim=-1)
|
||
return torch.matmul(scores, V)
|
||
|
||
def forward(self, Q, K, V, mask=None):
|
||
batch_size = Q.size(0)
|
||
Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||
K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||
V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
|
||
|
||
out = self.scaled_dot_product_attention(Q, K, V, mask)
|
||
out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
|
||
return self.W_o(out)
|
||
```
|
||
|
||
---
|
||
|
||
## 参考资料
|
||
|
||
| 资源 | 链接 |
|
||
|------|------|
|
||
| 原论文 | https://arxiv.org/abs/1706.03762 |
|
||
| HTML 版本 | https://arxiv.org/html/1706.03762v7 |
|
||
| PyTorch 实现 | Attention Is All You Need |
|
||
|
||
---
|
||
|
||
## 基础知识索引
|
||
|
||
其他基础知识:
|
||
- [[INDEX_基础知识]] - 基础知识库索引
|
||
|
||
---
|
||
|
||
*整理:知识库管理员 | 来源:Google Brain | 归档:2026-04-22* |