循环神经网络的多卡训练

对于DataParallel,会默认将数据按照dim=0的维度进行拆分,输出也将按照dim=0的维度进行拼接。

如果传入隐藏状态,常规方式就会出现问题:hidden会被当成输入数据处理,程序会将hidden按照dim=0的维度拆分后传入模型,比如,如果hidden的维度是(num_layers, batch_size, hidden_size),如果有\(n\)张显卡,那么一张显卡上的模型得到的hidden数据维度就是(num_layers / n, batch_size, hidden_size),而这显然是不合理的。

同样的,由于hidden的维度与批量大小batch_size有关,那么对于输入数据x和输入标签y,这两个也会在dim=0的维度上拆分数据,如果循环神经网络设置了batch_first=True,那么x的维度就是(batch_size, seq_length, vocab_size)y的维度是(batch_size, seq_length, vocab_size),拆分后:x的维度是(batch_size / n, seq_length, vocab_size)y的维度是(batch_size / n, seq_length, vocab_size)

综上,hidden维度应该是(num_layers * n, batch_size / n, hidden_size)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import torch
import torch.nn

class GRU(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers

self.embedding = nn.Embedding(vocab_size, embed_size)
self.gru = nn.GRU(embed_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)

def forward(self, x, hidden):
x = self.embedding(x)
out, hidden = self.gru(x, hidden)
out = self.fc(out)
return out, hidden

def init_hidden(num_layers, batch_size, hidden_size):
num_layers = int(num_layers)
batch_size = int(batch_size)
hidden_size = int(hidden_size)
return torch.zeros(num_layers, batch_size, hidden_size)

vocab_size = 10000
embed_size=256
hidden_size=512
num_layers=3
batch_size = 256 # 需要注意batch_size是显卡数量的整数倍
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
n = torch.cuda.device_count() if torch.cuda.is_available() else 1

model = GRU(vocab_size, embed_size, hidden_size, num_layers).to(device)
model = nn.DataParallel(model)

# 省略代码...

for epoch in range(num_epochs):
for batch, (block_input, block_target) in enumerate(dataloader):
block_input, block_target = block_input.to(device), block_target.to(device)
seq_length = 35
hidden = init_hidden(num_layers * n, batch_size / n, hidden_size)

num_steps = truncated_length // seq_length
block_loss = 0
optimizer.zero_grad()

for step in range(num_steps):
start = step * seq_length
end = start + seq_length

# 获取当前输入与目标
x = block_input[:, start: end]
y = block_target[:, start: end]

# 去除隐藏状态的梯度
hidden = hidden.detach()

output, hidden = model(x, hidden)

loss = loss_function(output.reshape(-1, vocab_size), y.reshape(-1))

# block_loss是每个块的平均损失,loss是截断时间步后每个小步的平均损失
block_loss += loss.item()

# 反向传播,保持梯度累计
loss.backward(retain_graph=True)

# 更新参数
nn.utils.clip_grad_norm_(model.parameters(), 5)
optimizer.step()
block_loss /= num_steps

# 打印训练信息
if batch % 100 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Batch {batch}, Loss: {block_loss:.4f}')

# 省略代码......

循环神经网络的多卡训练
https://blog.shinebook.net/2025/04/03/人工智能/pytorch/循环神经网络的多卡训练/
作者
X
发布于
2025年4月3日
许可协议