RNN通过时间反向传播

循环神经网络的梯度分析

循环神经网络(RNN)的前向传播满足: \[ \begin{aligned} &\mathbf{h}_t = \phi(\mathbf{x}_t\mathbf{W}_{xh} + \mathbf{h}_{t-1}\mathbf{W}_{hh} + \mathbf{b}_h)\\ &\mathbf{o}_t = \mathbf{h}_t\mathbf{W}_{o} + \mathbf{b}_o \end{aligned} \] 即,\(\mathbf{h}_t\)是关于\(\mathbf{x}_t, \mathbf{h}_{t-1}\)的函数,令 \[ \mathbf{W}_h = \left[ \begin{matrix} \mathbf{W}_{xh}\\ \mathbf{W}_{hh}\\ \mathbf{b}_h \end{matrix} \right] \] 则, \[ \mathbf{h}_t = [ \begin{matrix} \mathbf{x}_t&\mathbf{h}_{t-1}&\mathbf{1} \end{matrix}]\cdot \mathbf{W}_h \]

可以写成: \[ \begin{aligned} \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)\\ \mathbf{o}_t = g(\mathbf{h}_t, \mathbf{W}_o) \end{aligned} \] 其中\(f\)\(g\)分别是隐藏层和输出层的变换。

损失函数: \[ L(\mathbf{x}_1, \cdots, \mathbf{x}_T, \mathbf{y}_1, \cdots, \mathbf{y}_T, \mathbf{W}_h, \mathbf{W}_o) = \frac{1}{T}\sum_{t=1}^T l(\mathbf{y}_t, \mathbf{o}_t) \] 对于反向传播: \[ \begin{aligned} \frac{\partial L}{\partial \mathbf{W}_h} &= \frac{1}{T}\sum_{t=1}^T\frac{\partial l(\mathbf{y_t}, \mathbf{o}_t)}{\partial \mathbf{W}_h}\\ &= \frac{1}{T}\sum_{t=1}^T\frac{\partial l(\mathbf{y}_t, \mathbf{o}_t)}{\partial \mathbf{o}_t}\frac{\partial g(\mathbf{h}_t, \mathbf{W}_o)}{\partial \mathbf{h}_t}\frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_h} \end{aligned} \] 该式子中的第一项和第二项很容易计算,而第三项\(\partial \mathbf{h}_t / \partial \mathbf{W}_t\)的计算则很麻烦,因为\(\mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)\),而\(\mathbf{h}_{t-1}\)又与\(\mathbf{W}_h\)\(\mathbf{h}_{t-2}\)有关,依此类推,我们需要循环地计算参数\(\mathbf{W}_h\)\(\mathbf{h}_t\)的影响,使用链式法则: \[ \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_h} = \frac{\partial f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)}{\partial \mathbf{W}_h} + \frac{\partial f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)}{\partial \mathbf{h}_{t-1}}\frac{\partial \mathbf{h}_{t-1}}{\partial \mathbf{W}_h} + \frac{\partial f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)}{\partial \mathbf{h}_{t-1}}\frac{\partial f(\mathbf{x}_{t-1}, \mathbf{h}_{t-2}, \mathbf{W}_h)}{\partial \mathbf{h}_{t-2}}\frac{\partial \mathbf{h}_{t-2}}{\partial \mathbf{W}_h} + \cdots\cdots \] 该式子中第\(i+1\)项可以写为: \[ \begin{aligned} &\frac{\partial f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)}{\partial \mathbf{h}_{t-1}}\times \cdots \times \frac{f(\mathbf{x}_{t-i+1}, \mathbf{h}_{t-i}, \mathbf{W}_h)}{\partial \mathbf{W}_h}\\ &= \Bigg(\prod_{j=i+1}^{t}\frac{\partial f(\mathbf{x}_j, \mathbf{h}_{j-1}, \mathbf{W}_h)}{\partial \mathbf{h}_{j-1}}\Bigg)\frac{\partial f(\mathbf{x}_{t-i+1}, \mathbf{h}_{t-i}, \mathbf{W}_h)}{\partial \mathbf{W}_h} \end{aligned} \]\[ \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_h} = \frac{\partial f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)}{\partial \mathbf{W}_h} + \sum_{i=1}^{t-1}\Bigg(\prod_{j=i+1}^{t}\frac{\partial f(\mathbf{x}_j, \mathbf{h}_{j-1}, \mathbf{W}_h)}{\partial \mathbf{h}_{j-1}}\Bigg)\frac{\partial f(\mathbf{x}_{t-i+1}, \mathbf{h}_{t-i}, \mathbf{W}_h)}{\partial \mathbf{W}_h} \] 虽然我们可以用链式法则递归地计算\(\partial \mathbf{h}_t / \partial \mathbf{W}_t\),但当\(t\)很大时这个链就会变得很长,计算量很大。

循环神经网络的反向传播

完全计算

对于一个很长的序列,如果我们使用完整的计算图进行反向传播更新参数,这样的计算会非常缓慢,并且可能会发生梯度爆炸,因为初始条件的微小变化就会导致结果发生不成比例的变化。这对于我们想要估计的模型而言是非常不可取的。因此,在实践中,这种方法几乎从未使用。

截断时间步

截断时间步(Truncated Backpropagation Through Time,TBPTT),我们可以在\(\tau\)步后截断\(\partial \mathbf{h}_t / \partial \mathbf{W}_t\)式子中的求和计算: \[ \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_h} = \frac{\partial f(\mathbf{x}_t, \mathbf{h}_{t-1}, \mathbf{W}_h)}{\partial \mathbf{W}_h} + \sum_{i=1}^{\tau}\Bigg(\prod_{j=i+1}^{t}\frac{\partial f(\mathbf{x}_j, \mathbf{h}_{j-1}, \mathbf{W}_h)}{\partial \mathbf{h}_{j-1}}\Bigg)\frac{\partial f(\mathbf{x}_{t-i+1}, \mathbf{h}_{t-i}, \mathbf{W}_h)}{\partial \mathbf{W}_h} \] 即将求和终止为\(\partial \mathbf{h}_{t-\tau}/\partial \mathbf{W}_h\),在实践中,这种方式工作得很好,它通常被称为截断的通过时间反向传播。这样做导致该模型主要侧重于短期影响,而不是长期影响。这在现实中是可取的,因为它会将估值偏向更简单和更稳定的模型。

截断时间步的实现:

与直接将长序列分成很多个固定短序列不同,截断时间步会按顺序分成短序列,上一个短序列输入模型得到输出和最终隐藏状态后,会将隐藏状态传递给下一个短序列模型输入的隐藏状态,但这个隐藏状态不会在两个短序列之间保留梯度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import jieba
from collections import Counter

class TBPTTDataset(Dataset):
def __init__(self, data, block_size):
self.data = data
self.block_size = block_size # 每个块的尺寸

def __len__(self):
return (len(self.data) - 1) // self.block_size

def __getitem__(self, index):
start = index * self.block_size
end = start + self.block_size
return (
torch.tensor(self.data[start: end], dtype=torch.long),
torch.tensor(self.data[start + 1: end + 1])
)

class WordRNN(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers

self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = nn.RNN(embed_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)

def forward(self, x, hidden):
x = self.embedding(x)
out, hidden = self.rnn(x, hidden)
out = self.fc(out)
return out, hidden

def init_hidden(self, batch_size):
return torch.zeros(self.num_layers, batch_size, self.hidden_size)

def build_vocab(tokenized_text, vocab_limit):
word_count = Counter(tokenized_text)
vocab = sorted(word_count, key=word_count.get, reverse=True)[: vocab_limit - 1]
vocab = ['<PAD>', '<UNK>'] + vocab
word_to_idx = {word: i for i, word in enumerate(vocab)}
return word_to_idx, vocab

with open('./data1.txt', 'r') as f:
text = ''
text = f.read()
with open('./data2.txt', 'r') as f:
text += f.read()
with open('./data3.txt', 'r') as f:
text += f.read()


# 字符级建模
tokenized_text = list(text)
print(f'总时间步:{len(tokenized_text)}')
word_to_idx, vocab = build_vocab(tokenized_text, 20000)
vocab_size = len(vocab)
idx_to_word = {i: word for i, word in enumerate(vocab)}

data = [word_to_idx[word] if word in word_to_idx else word_to_idx['<UNK>'] for word in tokenized_text]

batch_size = 256
# 分块,
truncated_length = 1000
dataset = TBPTTDataset(data, truncated_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

num_epochs = 30
loss_function = nn.CrossEntropyLoss(ignore_index=word_to_idx['<PAD>'])
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = WordRNN(vocab_size, embed_size=256, hidden_size=512, num_layers=3).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)


# 训练
for epoch in range(num_epochs):
for batch, (block_input, block_target) in enumerate(dataloader):
block_input, block_target = block_input.to(device), block_target.to(device)
seq_length = 20
hidden = model.init_hidden(batch_size).to(device) # 初始化隐藏状态
num_steps = truncated_length // seq_length
block_loss = 0
optimizer.zero_grad()

# 使用梯度截断的方式进行训练,每seq_length个时间步进行梯度截断,往后传递隐藏状态,但是不传递梯度
for step in range(num_steps):
start = step * seq_length
end = start + seq_length

# 获取当前输入与目标
x = block_input[:, start: end]
y = block_target[:, start: end]

# 去除隐藏状态的梯度
hidden = hidden.detach()

output, hidden = model(x, hidden)

loss = loss_function(output.reshape(-1, vocab_size), y.reshape(-1))

# block_loss是每个块的平均损失,loss是截断时间步后每个小步的平均损失
block_loss += loss.item()

# 反向传播,保持梯度累计
loss.backward(retain_graph=True)

# 更新参数
nn.utils.clip_grad_norm_(model.parameters(), 5)
optimizer.step()
block_loss /= num_steps

# 打印训练信息
if batch % 100 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Batch {batch}, Loss: {block_loss:.4f}')


import numpy as np
# ----------------- 文本生成函数 -----------------
def generate_text(model, start_text, length=50, temperature=1.0):
model.eval()
words = tokenize(start_text)
hidden = model.init_hidden(1).to(device)

# 处理初始文本
for word in words:
if word not in word_to_idx:
word = '<UNK>'
x = torch.tensor([[word_to_idx[word]]]).to(device)
_, hidden = model(x, hidden)

# 生成新文本
for _ in range(length):
# 前向传播
output, hidden = model(x, hidden)

# 应用温度采样
probs = torch.softmax(output.squeeze() / temperature, dim=0).detach().cpu().numpy()
next_idx = np.random.choice(vocab_size, p=probs)

# 处理生成结果
next_word = idx_to_word[next_idx] if next_idx in idx_to_word else '<UNK>'
words.append(next_word)

# 准备下一轮输入
x = torch.tensor([[next_idx]]).to(device)

# 合并分词结果
return ''.join(words) if len(''.join(words)) == len(words) else ' '.join(words)


# 示例生成
print(generate_text(model, "你好", length=500, temperature=0.8))

随机截断

随机截断时间步(Stochastic Truncated Backpropagation Through Time, STBPTT),在传统截断时间步反向传播(TBPTT)的基础上引入随机性,以增强模型的泛化能力和训练效率。

比较策略

第一行采用随机截断,方法是将文本划分为不同长度的片段。

第二行采用常规截断,方法是将文本分解为相同长度的子序列。

第三行采用通过时间的完全反向传播,结果是产生在计算上不可行的表达式。

虽然随截断在理论上具有吸引力,但很可能是由于多种因素在实践中并不比常规截断更好。首先,在对过去若干个时间步经过反向传播后,观测结果足以捕获实际的依赖关系。其次,增加的方差抵消了时间步数越多梯度越精确的事实。第三,我们真正想要的是只有短范围交互的模型。因此,模型需要的正是截断的通过时间反向传播方法所具备的轻度正则化效果。


RNN通过时间反向传播
https://blog.shinebook.net/2025/04/02/人工智能/理论基础/深度学习/RNN通过时间反向传播/
作者
X
发布于
2025年4月2日
许可协议