神经网络的前向传播与反向传播

注:搬迁服务器图片已经找不到了,这个笔记的图像一开始是辅助理解,但现在已经不重要了

什么是前向传播,什么是反向传播

前向传播与反向传播

前向传播

  • 前向传播,也叫正向传播,英文术语为Forward propagation.
  • 前向传播可以认为是正着递推,也就是从输入层开始,经过逐层神经元的计算,得到最终预测结果的过程
前向传播

反向传播

  • 反向传播,也叫逆向传播,英文术语为:Backpropagation或者Backward propagation
  • 反向传播可以认为是反向递推,也就是从输出层得到的结果开始,根据损失函数、代价函数,反向逐层梯度下降、优化参数的过程
反向传播
反向传播

神经网络的训练过程

在一次完整的迭代中,神经网络先进行一次整体的前向传播,得到预测的结果,用预测结果和实际结果计算出损失函数、代价函数,然后再进行反向传播,逐层梯度下降、优化权重和偏差。

经过许多次的迭代,我们就能够获得比较理想的神经网络模型。

考虑一个神经元

前向传播/正向传播(Forward propagation)

考虑一个神经元

根据上图,这个神经元接收四个上一层神经元的输出,然后输出激活函数的值。

假设这个神经元在第\(l\)层,则前4个神经元在第\(l-1\)层,\(l-1\)层的4个神经元从上到下的输出分别为:\(a_{1}^{[l-1]},a_{2}^{[l-1]},a_{3}^{[l-1]},a_{4}^{[l-1]}\),其中,\(a_{j}^{[i]}\)上标\([i]\)代表第\(i\)层,下标\(j\)代表从上往下第\(j\)个神经元

那么,对于这个神经元而言,线性部分的结果为: \[ z^{[l]} = w_1a_1^{[l-1]}+w_2a_2^{[l-1]}+w_3a_3^{[l-1]}+w_4a_4^{[l-1]}+b^{[l]} \] 非线性部分的结果为: \[ a^{[l]} = g(z^{[l]}) \] 其中函数\(g\)为激活函数

因为这个神经元是这一层的唯一的一个神经元,也就是这一层的第一个神经元,我们给其加上下标1: \[ \left\{ \begin{aligned} &z_1^{[l]} = w_{1,1}a_1^{[l-1]}+w_{1,2}a_2^{[l-1]}+w_{1,3}a_3^{[l-1]}+w_{1,4}a_4^{[l-1]}+b_1^{[l]} \\\ &a_1^{[l]} = g(z_1^{[l]}) \end{aligned} \right. \] 其中,\(w_{i,j}\)代表该层第\(i\)个神经元的第\(j\)个权重

向量化:

上一层的四个神经元各有一个输出,一共有4个输出,我们把这四个输出放入一个列向量中: \[ a^{[l-1]} = \left[ \begin{matrix} a_1^{[l-1]} \\\ a_2^{[l-1]} \\\ a_3^{[l-1]} \\\ a_4^{[l-1]} \end{matrix} \right] \] 其中,\(a^{[i]}\)中的\(i\)代表第\(i\)层的全部输出构成的列向量

那么,对于这个神经元而言: \[ \left\{ \begin{aligned} &z_1^{[l]} = w_1^{[l]T}a^{[l-1]}+b_1^{[l]} \\\ &a_1^{[l]} = g(z_1^{[l]}) \end{aligned} \right. \] 其中,\(w_i^{[l]}\)代表由该层第\(i\)个神经元的所有权重构成的列向量,即: \[ w_i^{[l]} = \left[ \begin{matrix} w_{i,1} \\\ w_{i,2} \\\ \vdots \\\ w_{i,n} \end{matrix} \right] \] 对于这个神经元而言, \[ w_1^{[l]} = \left[ \begin{matrix} w_{1,1} \\\ w_{1,2} \\\ w_{1,3} \\\ w_{1,4} \end{matrix} \right] \]

反向传播/逆向传播(Backward propagation)

对于该神经元,假设最终的损失函数为\(L = L(\hat y, y)\)

那么,我们可以用链式法则来求解损失函数对权重和偏差的导数: $$ \[\begin{aligned} \frac{\partial L}{\partial z_1^{[l]}} &= \frac{\partial L}{\partial a_1^{[l]}} \times \frac{d a_1^{[l]}}{d z_1^{[l]}} \\\ &= \frac{\partial L}{\partial a_1^{[l]}} \times g'(z_1^{[l]}) \end{aligned}\]

\end{aligned} $$

\[ \begin{aligned} \frac{\partial L}{\partial w_{1,j}^{[l]}} &= \frac{\partial L}{\partial z_1^{[l]}} \times \frac{\partial z_1^{[l]}}{\partial w_{1,j}^{[l]}} \\\ &= \frac{\partial L}{\partial z_1^{[l]}} \times \frac{\partial(w_{1,1}a_1^{[l-1]}+w_{1,2}a_2^{[l-1]}+w_{1,3}a_3^{[l-1]}+w_{1,4}a_4^{[l-1]}+b_1^{[l]})}{\partial w_{1,j}^{[l]}} \\\ &= \frac{\partial L}{\partial z_1^{[l]}} \times a_j^{[l-1]} \end{aligned} \]

\[ \begin{aligned} \frac{\partial L}{\partial b_1^{[l]}} &= \frac{\partial L}{\partial z_1^{[l]}} \times \frac{\partial z_1^{[l]}}{\partial b_1^{[l]}} \\\ &= \frac{\partial L}{\partial z_1^{[l]}} \times \frac{\partial(w_{1,1}a_1^{[l-1]}+w_{1,2}a_2^{[l-1]}+w_{1,3}a_3^{[l-1]}+w_{1,4}a_4^{[l-1]}+b_1^{[l]})}{\partial b_1^{[l]}} \\\ &= \frac{\partial L}{\partial z_1^{[l]}} \end{aligned} \]

向量化: \[ \frac{\partial L}{\partial w_i^{[l]}} = \left[ \begin{matrix} \frac{\partial L}{\partial w_{i,1}^{[l]}} \\\ \frac{\partial L}{\partial w_{i,2}^{[l]}} \\\ \vdots \\\ \frac{\partial L}{\partial w_{i,n}^{[l]}} \end{matrix} \right] \] 如果该神经元不在输出层:

考虑单神经元

由图可知,该神经元的输出\(a_1^{[l]}\)传递给了下层的四个神经元,由这四个神经元分别可以算出\(\frac{\partial L}{\partial z_1^{[l+1]}},\frac{\partial L}{\partial z_2^{[l+1]}},\frac{\partial L}{\partial z_3^{[l+1]}},\frac{\partial L}{\partial z_4^{[l+1]}}\),那么,\(\frac{\partial L}{\partial a_1^{[l]}}\)到底等于什么呢?

后面这四个神经元的输出分别为\(a_1^{[l+1]},a_2^{[l+1]},a_3^{[l+1]},a_4^{[l+1]}\),我们可以认为损失函数是与这四个量有关: \[ L = f_1(a_1^{[l+1]})+f_2(a_2^{[l+1]})+f_3(a_3^{[l+1]})+f_4(a_4^{[l+1]}) \] 则: $$ \[\begin{aligned} \frac{\partial L}{\partial a_1^{[l]}} &= \frac{\partial f_1(a_1^{[l+1]})}{\partial a_1^{[l]}}+\frac{\partial f_2(a_2^{[l+1]})}{\partial a_1^{[l]}}+\frac{\partial f_3(a_3^{[l+1]})}{\partial a_1^{[l]}}+\frac{\partial f_4(a_4^{[l+1]})}{\partial a_1^{[l]}}\\\ &= \frac{\partial L}{\partial z_1^{[l+1]}}\cdot\frac{\partial z_1^{[l+1]}}{\partial a_1^{[l]}}+\frac{\partial L}{\partial z_2^{[l+1]}}\cdot\frac{\partial z_2^{[l+1]}}{\partial a_1^{[l]}}+\frac{\partial L}{\partial z_3^{[l+1]}}\cdot\frac{\partial z_3^{[l+1]}}{\partial a_1^{[l]}}+\frac{\partial L}{\partial z_4^{[l+1]}}\cdot\frac{\partial z_4^{[l+1]}}{\partial a_1^{[l]}} \end{aligned}\]

\end{aligned} $$ 因此,我们可以这么认为:

损失函数对该层神经元相关参数的偏导,是损失函数对下层所有神经元线性结果\(z\)的偏导乘上\(z\)对本神经元相关参数的偏导之和

考虑多个神经元,一个样本

前向传播

多个神经元,一个样本

假设中间的这层神经元位于第\(l\)层,则前面4个神经元位于第\(l-1\)层,后面的3个神经元位于第\(l+1\)

由前面的结论可得: \[ \left\{ \begin{aligned} &z_i^{[l]} = w_i^{[l]T}a^{[l-1]}+b_i^{[l]} \\\ &a_i^{[l]} = g(z_i^{[l]}) \end{aligned} \right. \] 其中,\(i\)代表该层第\(i\)个神经元

向量化:

\(n^{[l]}\)代表第\(l\)层神经元的个数

\[ z^{[l]} = \left[ \begin{matrix} z_1^{[l]} \\\ z_2^{[l]} \\\ \vdots \\\ z_{n^{[l]}}^{[l]} \end{matrix} \right] \] 其中\(z^{[l]}\)代表该层所有神经元线性部分结果的列向量

\[ a^{[l]} = g(z^{[l]}) \] 即: \[ a^{[l]} = \left[ \begin{matrix} a_1^{[l]}\\\ a_2^{[l]}\\\ \vdots\\\ a_{n^{[l]}}^{[l]} \end{matrix} \right] \]

\[ W^{[l]} = \left[ \begin{matrix} w_{1,1}^{[l]}&w_{1,2}^{[l]}&\cdots&w_{1,n^{[l-1]}}^{[l]}\\\ w_{2,1}^{[l]}&w_{2,2}^{[l]}&\cdots&w_{2,n^{[l-1]}}^{[l]}\\\ \vdots&\vdots&&\vdots\\\ w_{n,1}^{[l]}&w_{n,2}^{[l]}&\cdots&w_{n,n^{[l-1]}}^{[l]} \end{matrix} \right] \]

即: \[ W^{[l]} = \left[ \begin{matrix} w_1^{[l]T} \\\ w_2^{[l]T} \\\ \vdots \\\ w_{n^{l}}^{[l]T} \end{matrix} \right] \]\[ b^{[l]} = \left[ \begin{matrix} b_1^{[l]} \\\ b_2^{[l]} \\\ \vdots\\\ b_{n^{[l]}}^{[l]} \end{matrix} \right] \] 则: $$

$$

\[ z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]} = \left[ \begin{matrix} w_1^{[l]T}a^{[l-1]}+b_1^{[l]}\\\ w_2^{[l]T}a^{[l-1]}+b_2^{[l]}\\\ \vdots\\\ w_{n^{[l]}}^{[l]T}a^{[l-1]}+b_1^{[l]} \end{matrix} \right] = \left[ \begin{matrix} z_1^{[l]} \\\ z_2^{[l]} \\\ \vdots \\\ z_{n^{[l]}}^{[l]} \end{matrix} \right] \]

\[ a^{[l]} = g(z^{[l]}) = \left[ \begin{matrix} a_1^{[l]}\\\ a_2^{[l]}\\\ \vdots\\\ a_{n^{[l]}}^{[l]} \end{matrix} \right] \]

反向传播

假设该样本正向传播后的损失函数为\(L = L(\hat y, y)\)

则, \[ \frac{\partial L}{\partial z^{[l]}} = \left[ \begin{matrix} \frac{\partial L}{\partial z_1^{[l]}}\\\ \frac{\partial L}{\partial z_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial z_{n^{[l]}}^{[l]}} \end{matrix} \right] = \left[ \begin{matrix} \frac{\partial L}{\partial a_1^{[l]}} \times g'(z_1^{[l]})\\\ \frac{\partial L}{\partial a_2^{[l]}} \times g'(z_2^{[l]})\\\ \vdots\\\ \frac{\partial L}{\partial a_{n^{[l]}}^{[l]}} \times g'(z_{n^{[l]}}^{[l]}) \end{matrix} \right] \] 利用numpy中的数组内对应元素相乘的运算(不是矩阵乘法): \[ \frac{\partial L}{\partial z^{[l]}} = \left[ \begin{matrix} \frac{\partial L}{\partial a_1^{[l]}}\\\ \frac{\partial L}{\partial a_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial a_{n^{[l]}}^{[l]}} \end{matrix} \right]* g'(z^{[l]}) = \frac{\partial L}{\partial a^{[l]}} * g'(z^{[l]}) \]

其中, \[ g'(z^{[l]}) = \left[ \begin{matrix} g'(z_1^{[l]})\\\ g'(z_2^{[l]})\\\ \vdots\\\ g'(z_{n^{[l]}}^{[l]}) \end{matrix} \right] = \left[ \begin{aligned} \frac{da_1^{[l]}}{dz_1^{[l]}}\\\ \frac{da_2^{[l]}}{dz_2^{[l]}}\\\ \vdots\\\ \frac{da_{n^{[l]}}^{[l]}}{dz_{n^{[l]}}^{[l]}} \end{aligned} \right] \]

如果第\(l\)层不是输出层,则 \[ z^{[l+1]} = W^{[l+1]}a^{[l]} + b^{[l+1]} = \left[ \begin{matrix} w_1^{[l+1]T}a^{[l]}+b_1^{[l+1]}\\\ w_2^{[l+1]T}a^{[l]}+b_2^{[l+1]}\\\ \vdots\\\ w_{n^{[l+1]}}^{[l+1]T}a^{[l]}+b_1^{[l+1]} \end{matrix} \right] = \left[ \begin{matrix} z_1^{[l+1]} \\\ z_2^{[l+1]} \\\ \vdots \\\ z_{n^{[l+1]}}^{[l+1]} \end{matrix} \right] \] 则: \[ \frac{\partial L}{\partial a^{[l]}} = \left[ \begin{matrix} \frac{\partial L}{\partial a_1^{[l]}}\\\ \frac{\partial L}{\partial a_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial a_{n^{[l]}}^{[l]}} \end{matrix} \right] = \left[ \begin{matrix} \frac{\partial L}{\partial z^{[l+1]}}\cdot\frac{\partial z^{[l+1]}}{\partial a_1^{[l]}}\\\ \frac{\partial L}{\partial z^{[l+1]}}\cdot\frac{\partial z^{[l+1]}}{\partial a_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial z^{[l+1]}}\cdot\frac{\partial z^{[l+1]}}{\partial a_{n^{[l]}}^{[l]}} \end{matrix} \right] = \left[ \begin{matrix} \frac{\partial L}{\partial z_1^{[l+1]}}\cdot\frac{\partial z_1^{[l+1]}}{\partial a_1^{[l]}}+\frac{\partial L}{\partial z_2^{[l+1]}}\cdot\frac{\partial z_2^{[l+1]}}{\partial a_1^{[l]}}+\cdots+\frac{\partial L}{\partial z_{n^{[l+1]}}^{[l+1]}}\cdot\frac{\partial z_{n^{[l+1]}}^{[l+1]}}{\partial a_1^{[l]}}\\\ \frac{\partial L}{\partial z_1^{[l+1]}}\cdot\frac{\partial z_1^{[l+1]}}{\partial a_2^{[l]}}+\frac{\partial L}{\partial z_2^{[l+1]}}\cdot\frac{\partial z_2^{[l+1]}}{\partial a_2^{[l]}}+\cdots+\frac{\partial L}{\partial z_{n^{[l+1]}}^{[l+1]}}\cdot\frac{\partial z_{n^{[l+1]}}^{[l+1]}}{\partial a_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial z_1^{[l+1]}}\cdot\frac{\partial z_1^{[l+1]}}{\partial a_{n^{[l]}}^{[l]}}+\frac{\partial L}{\partial z_2^{[l+1]}}\cdot\frac{\partial z_2^{[l+1]}}{\partial a_{n^{[l]}}^{[l]}}+\cdots+\frac{\partial L}{\partial z_{n^{[l+1]}}^{[l+1]}}\cdot\frac{\partial z_{n^{[l+1]}}^{[l+1]}}{\partial a_{n^{[l]}}^{[l]}} \end{matrix} \right] \]

即: \[ \begin{aligned} \frac{\partial L}{\partial a^{[l]}} &= \left[ \begin{matrix} \frac{\partial L}{\partial z_1^{[l+1]}}\cdot w_{1,1}^{[l+1]}+\frac{\partial L}{\partial z_2^{[l+1]}}\cdot w_{2,1}^{[l+1]}+\cdots+\frac{\partial L}{\partial z_{n^{[l+1]}}^{[l+1]}}\cdot w_{n^{[l+1]},1}^{[l+1]}\\\ \frac{\partial L}{\partial z_1^{[l+1]}}\cdot w_{1,2}^{[l+1]}+\frac{\partial L}{\partial z_2^{[l+1]}}\cdot w_{2,2}^{[l+1]}+\cdots+\frac{\partial L}{\partial z_{n^{[l+1]}}^{[l+1]}}\cdot w_{n^{[l+1]},2}^{[l+1]}\\\ \vdots\\\ \frac{\partial L}{\partial z_1^{[l+1]}}\cdot w_{1,n^{[l]}}^{[l+1]}+\frac{\partial L}{\partial z_2^{[l+1]}}\cdot w_{2,n^{[l]}}^{[l+1]}+\cdots+\frac{\partial L}{\partial z_{n^{[l+1]}}^{[l+1]}}\cdot w_{n^{[l+1]},n^{[l]}}^{[l+1]} \end{matrix} \right] \\\ &= \left[ \begin{matrix} w_{1,1}^{[l+1]}&w_{2,1}^{[l+1]}&\cdots&w_{n^{[l+1],1}}^{[l+1]}\\\ w_{1,2}^{[l+1]}&w_{2,2}^{[l+1]}&\cdots&w_{n^{[l+1],2}}^{[l+1]}\\\ \vdots&\vdots&&\vdots\\\ w_{1,n^{[l]}}^{[l+1]}&w_{2,n^{[l]}}^{[l+1]}&\cdots&w_{n^{[l+1],n^{[l]}}}^{[l+1]} \end{matrix} \right]\cdot\left[ \begin{matrix} \frac{\partial L}{\partial z_1^{[l+1]}}\\\ \frac{\partial L}{\partial z_2^{[l+1]}}\\\ \vdots\\\ \frac{\partial L}{\partial z_{n^{[l+1]}}^{[l+1]}} \end{matrix} \right] \\\ &= W^{[l+1]T}\cdot\frac{\partial L}{\partial z^{[l+1]}} \end{aligned} \]

因此: \[ \frac{\partial L}{\partial z^{[l]}} = np.dot(W^{[l+1]T},\frac{\partial L}{\partial z^{l+1}}) * g'(z^{[l]}) \]

其中,np.dot(a,b)是numpy中矩阵\(a\)\(b\)的乘法,\(*\)为numpy中数组内对应元素分别相乘

接下来我们计算\(\frac{\partial L}{\partial W^{[l]}}\)\(\frac{\partial L}{\partial b^{[l]}}\)\[ \begin{aligned} \frac{\partial L}{\partial W^{[l]}} &= \left[ \begin{matrix} \frac{\partial L}{\partial w_{1,1}^{[l]}}&\frac{\partial L}{\partial w_{1,2}^{[l]}}&\cdots&\frac{\partial L}{\partial w_{1,n^{[l-1]}}^{[l]}}\\\ \frac{\partial L}{\partial w_{2,1}^{[l]}}&\frac{\partial L}{\partial w_{2,2}^{[l]}}&\cdots&\frac{\partial L}{\partial w_{2,n^{[l-1]}}^{[l]}}\\\ \vdots&\vdots&&\vdots\\\ \frac{\partial L}{\partial w_{n^{[l]},1}^{[l]}}&\frac{\partial L}{\partial w_{n^{[l]},2}^{[l]}}&\cdots&\frac{\partial L}{\partial w_{n^{[l]},n^{[l-1]}}^{[l]}} \end{matrix} \right]\\\ &= \left[ \begin{matrix} \frac{\partial L}{\partial z_1^{[1]}}\cdot a_1^{[l-1]}&\frac{\partial L}{\partial z_1^{[1]}}\cdot a_2^{[l-1]}&\cdots&\frac{\partial L}{\partial z_1^{[1]}}\cdot a_{n^{[l-1]}}^{[l-1]}\\\ \frac{\partial L}{\partial z_2^{[1]}}\cdot a_1^{[l-1]}&\frac{\partial L}{\partial z_2^{[1]}}\cdot a_2^{[l-1]}&\cdots&\frac{\partial L}{\partial z_2^{[1]}}\cdot a_{n^{[l-1]}}^{[l-1]}\\\ \vdots&\vdots&&\vdots\\\ \frac{\partial L}{\partial z_{n^{[l]}}^{[1]}}\cdot a_1^{[l-1]}&\frac{\partial L}{\partial z_{n^{[l]}}^{[1]}}\cdot a_2^{[l-1]}&\cdots&\frac{\partial L}{\partial z_{n^{[l]}}^{[1]}}\cdot a_{n^{[l-1]}}^{[l-1]} \end{matrix} \right]\\\ &= \left[ \begin{matrix} \frac{\partial L}{\partial z_1^{[l]}}\\\ \frac{\partial L}{\partial z_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial z_{n^{[l]}}^{[l]}} \end{matrix} \right]\cdot\left[ \begin{matrix} a_1^{[l-1]}&a_2^{[l-1]}&\cdots&a_{n^{[l-1]}}^{[l-1]} \end{matrix} \right] \\\ &= \frac{\partial L}{\partial z^{[l]}}\cdot a^{[l-1]T} \end{aligned} \]

\[ \frac{\partial L}{\partial b^{[l]}} = \left[ \begin{matrix} \frac{\partial L}{\partial b_1^{[l]}}\\\ \frac{\partial L}{\partial b_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial b_{n^{[l]}}^{[l]}} \end{matrix} \right] = \left[ \begin{matrix} \frac{\partial L}{\partial z_1^{[l]}}\\\ \frac{\partial L}{\partial z_2^{[l]}}\\\ \vdots\\\ \frac{\partial L}{\partial z_{n^{[l]}}^{[l]}} \end{matrix} \right] = \frac{\partial L}{\partial z^{[l]}} \]

考虑多个神经元,多个样本

假设有\(m\)个样本,第\(i\)个样本对应的相关数据分别为:\(z^{[l](i)},a^{[l](i)}\),即第\(i\)个样本的第\(l\)层的第一步线性结果和第二步非线性结果

其中, \[ z^{[l](i)} = \left[ \begin{matrix} z_1^{[l](i)}\\\ z_2^{[l](i)}\\\ \vdots\\\ z_{n^{[l]}}^{[l](i)} \end{matrix} \right] \]

\[ a^{[l](i)} = g(z^{[l](i)}) = \left[ \begin{matrix} a_1^{[l](i)}\\\ a_2^{[l](i)}\\\ \vdots\\\ a_{n^{[l]}}^{[l](i)} \end{matrix} \right] \]

前向传播

为了用矩阵一次性考虑多个样本,减少循环次数,令: \[ Z^{[l]} = [\begin{matrix} z^{[l](1)}&z^{[l](2)}&\cdots&z^{[l](m)} \end{matrix}] \]

\[ A^{[l]} = [\begin{matrix} a^{[l](1)}&a^{[l](2)}&\cdots&a^{[l](m)} \end{matrix}] \]

其中,\(Z^{[l]}\)\(A^{[l]}\)都是\(n^{[l]}\times m\)的矩阵

令第\(i\)个样本的损失函数为\(L^{(i)} = L(\hat y^{(i)},y^{(i)})\)

则: \[ Z^{[l]} = W^{[l]}A^{[l-1]}+b^{[l]} \]

\[ A^{[l]} = g(Z^{[l]}) \]

注意:这里的\(b^{[l]}\)在python中为\(n^{[l]}\times 1\)的矩阵,而\(W^{[l]}A^{[l-1]}\)\(n^{[l]}\times m\)的矩阵,这是因为,python会自动将\(b^{[l]}\)广播为:\([\begin{matrix} b^{[l]}&b^{[l]}&\cdots&b^{[l]} \end{matrix}]\),也就是自动横向扩充为\(n^{[l]}\times m\)的矩阵,在后面的推导中,\(b^{[l]}\)都默认为\(n^{l}\times 1\)的矩阵,并且会自动广播扩充为\(n^{[l]}\times m\)的矩阵

反向传播

令损失函数的矩阵为: \[ L = [\begin{matrix} L^{(1)}&L^{(2)}&\cdots&L^{(m)} \end{matrix}] \]

则: $$ \[\begin{aligned} \frac{\partial L}{\partial Z^{[l]}} &= [\begin{matrix} \frac{\partial L^{(1)}}{z^{[l](1)}}&\frac{\partial L^{(2)}}{z^{[l](2)}}&\cdots&\frac{\partial L^{(m)}}{z^{[l](m)}} \end{matrix}]\\\ &= [\begin{matrix} \frac{\partial L^{(1)}}{\partial a^{[l](1)}}*g'(z^{[l](1)})&\frac{\partial L^{(2)}}{\partial a^{[l](2)}}*g'(z^{[l](2)})&\cdots&\frac{\partial L^{(m)}}{\partial a^{[l](m)}}*g'(z^{[l](m)}) \end{matrix}] \\\ &= [\begin{matrix} \frac{\partial L^{(1)}}{\partial a^{[l](1)}}&\frac{\partial L^{(2)}}{\partial a^{[l](2)}}&\cdots&\frac{\partial L^{(m)}}{\partial a^{[l](m)}} \end{matrix}] * g'(Z^{[l]}) \end{aligned}\]
  • g'(Z^{[l]})
\end{aligned} \[ 如果第$l$层不为输出层,则: \] \[\begin{aligned} \frac{\partial L}{\partial A^{[l]}} &= [\begin{matrix} \frac{\partial L^{(1)}}{\partial a^{[l](1)}}&\frac{\partial L^{(2)}}{\partial a^{[l](2)}}&\cdots&\frac{\partial L^{(m)}}{\partial a^{[l](m)}} \end{matrix}]\\\ &= [\begin{matrix} W^{[l+1]T}\cdot\frac{\partial L^{(1)}}{\partial z^{[l+1](1)}}&W^{[l+1]T}\cdot\frac{\partial L^{(2)}}{\partial z^{[l+1](2)}}&\cdots&W^{[l+1]T}\cdot\frac{\partial L^{(m)}}{\partial z^{[l+1](m)}} \end{matrix}]\\\ &= W^{[l+1]T}\cdot[\begin{matrix} \frac{\partial L^{(1)}}{\partial z^{[l+1](1)}}&\frac{\partial L^{(2)}}{\partial z^{[l+1](2)}}&\cdots&\frac{\partial L^{(m)}}{\partial z^{[l+1](m)}} \end{matrix}]\\\ &= W^{[l+1]T}\cdot\frac{\partial L}{\partial Z^{[l+1]}} \end{aligned}\]

l L{(1)}}{z{l+1}}&&& \end{matrix}]\
&= W^{[l+1]T} \end{aligned} \[ 因此: \] = np.dot(W^{[l+1]T},) * g'(Z^{[l]}) $$ 接下来我们求梯度:\(\frac{\partial J}{\partial W^{[l]}},\frac{\partial J}{\partial b^{[l]}}\)

其中\(J\)是代价函数,其与损失函数的关系为: \[ J = \frac{1}{m} \times \displaystyle\sum_{i=1}^{m}{L^{(i)}} \]

\[ \begin{aligned} \frac{\partial J}{\partial W^{[l]}} &= \frac{1}{m}\Bigg(\frac{\partial L^{(1)}}{\partial W^{[l]}}+\frac{\partial L^{(2)}}{\partial W^{[l]}}+\cdots+\frac{\partial L^{(m)}}{\partial W^{[l]}}\Bigg)\\\ &= \frac{1}{m}\Bigg(\frac{\partial L^{(1)}}{\partial z^{[l](1)}}\cdot a^{[l-1](1)T}+\frac{\partial L^{(2)}}{\partial z^{[l](2)}}\cdot a^{[l-1](2)T}+\cdots+\frac{\partial L^{(m)}}{\partial z^{[l](m)}}\cdot a^{[l-1](m)T}\Bigg)\\\ &= \frac{1}{m}\Bigg([\begin{matrix} \frac{\partial L^{(1)}}{\partial z^{[l](1)}}&\frac{\partial L^{(2)}}{\partial z^{[l](2)}}&\cdots&\frac{\partial L^{(m)}}{\partial z^{[l](m)}} \end{matrix}]\cdot\left[ \begin{matrix} a^{[l-1](1)}\\\ a^{[l-1](2)}\\\ \vdots\\\ a^{[l-1](m)} \end{matrix} \right]\Bigg)\\\ &= \frac{1}{m}\Bigg(\frac{\partial L}{\partial Z^{[l]}}\cdot A^{[l-1]T}\Bigg) \end{aligned} \]

\[ \begin{aligned} \frac{\partial J}{\partial b^{[l]}} &= \frac{1}{m}\Bigg(\frac{\partial L^{(1)}}{\partial z^{[l](1)}}+\frac{\partial L^{(2)}}{\partial z^{[l](2)}}+\cdots+\frac{\partial L^{(m)}}{\partial z^{[l](m)}}\Bigg)\\\ &= \frac{1}{m}\times np.sum(\frac{\partial L}{\partial Z^{[l]}}, axis=1, keepdims=True) \end{aligned} \]

梯度下降: \[ \begin{aligned} &W^{[l]} = W^{[l]} - \alpha\times\frac{\partial J}{\partial W^{[l]}}\\\ &b^{[l]} = b^{[l]} - \alpha\times\frac{\partial J}{\partial b^{[l]}} \end{aligned} \]

结论

为了方便在python中实现,令: \[ \left\{ \begin{aligned} &W[l] = W^{[l]}\\\ &b[l] = b^{[l]}\\\ &Z[l] = Z^{[l]}\\\ &A[l] = A^{[l]}\\\ &dZ[l] = \frac{\partial L}{\partial Z^{[l]}}\\\ &dW[l] = \frac{\partial J}{\partial W^{[l]}}\\\ &db[l] = \frac{\partial J}{\partial b^{[l]}} \end{aligned} \right. \] 其中:

矩阵 维度
\(W^{[l]}\) \(n^{[l]}\times n^{[l-1]}\)
\(b^{[l]}\) \(n^{[l]}\times 1\)
\(Z^{[l]}\) \(n^{[l]}\times m\)
\(A^{[l]}\) \(n^{[l]}\times m\)
\(\frac{\partial L}{\partial Z^{[l]}}\) \(n^{[l]}\times m\)
\(\frac{\partial J}{\partial W^{[l]}}\) \(n^{[l]}\times n^{[l-1]}\)
\(\frac{\partial J}{\partial b^{[l]}}\) \(n^{[l]}\times 1\)

则:

前向传播

\[ \left\{ \begin{aligned} &Z^{[l]} = W^{[l]}A^{[l-1]}+b^{[l]} \\\ &A^{[l]} = g(Z^{[l]}) \end{aligned} \right. \]

1
2
Z[l] = np.dot(W[l], A[l-1]) + b[l]
A[l] = g(Z[l])

反向传播

\[ \left\{ \begin{aligned} &\frac{\partial L}{\partial Z^{[l]}} = W^{[l+1]T}\cdot\frac{\partial L}{\partial Z^{[l+1]}}*g'(Z^{[l]}) \\\ &\frac{\partial J}{\partial W^{[l]}} = \frac{1}{m}\Bigg(\frac{\partial L}{\partial Z^{[l]}}\cdot A^{[l-1]T}\Bigg)\\\ &\frac{\partial J}{\partial b^{[l]}} = \frac{1}{m}\times np.sum(\frac{\partial L}{\partial Z^{[l]}}, axis=1, keepdims=True) \end{aligned} \right. \]

1
2
3
dZ[l] = np.dot(W[l+1].T, dZ[l+1]) * g_derivative(Z[l])
dW[l] = 1 / m * np.dot(dZ[l], A[l-1].T)
db[l] = 1 / m * np.sum(dZ[l], axis=1, keepdims=True)

梯度下降:

1
2
W[l] = W[l] - alpha * dW[l]
b[l] = b[l] - alpha * db[l]

神经网络的前向传播与反向传播
https://blog.shinebook.net/2025/03/15/人工智能/理论基础/深度学习/神经网络的前向传播与反向传播/
作者
X
发布于
2025年3月15日
许可协议