Soptq

Soptq

Probably a full-stack, mainly focusing on Distributed System / Consensus / Privacy-preserving Tech etc. Decentralization is a trend, privacy must be protected.
twitter
github
bilibili

反向传播算法的数学推导

This article heavily references CSDN Backpropagation Algorithm (Process and Formula Derivation)

Basic Definitions#

一个简单神经网络的示例

In the simple neural network shown in the above diagram, layer 1 is the input layer, layer 2 is the hidden layer, and layer 3 is the output layer. We use the diagram to explain the meaning of some variable names:

NameMeaning
bilb_{i}^{l}Bias of the ii-th neuron in layer ll
wjilw_{ji}^{l}Connection between the ii-th neuron in layer l1l-1 and the jj-th neuron in layer ll
zilz_{i}^{l}Input of the ii-th neuron in layer ll
aila_{i}^{l}Output of the ii-th neuron in layer ll
σ\sigmaActivation function

Based on the above definitions, we can know that:

zjl=iwjilail1+bjlz_{j}^{l} = \sum_{i}w_{ji}^{l}a_{i}^{l-1} + b_{j}^{l}

ajl=σzjl=σ(iwjilail1+bjl)a_{j}^{l} = \sigma z_{j}^{l} = \sigma \left( \sum_{i}w_{ji}^{l}a_{i}^{l-1} + b_{j}^{l} \right)

We define the loss function as the quadratic cost function:

J=12nxy(x)aL(x)2J = \frac{1}{2n} \sum_{x} \lvert \lvert y(x) - a^{L}(x) \rvert \rvert ^ {2}

Where xx represents the input sample, y(x)y(x) represents the actual classification, aL(x)a^{L}(x) represents the predicted classification, and LL represents the maximum number of layers in the network. When there is only one input sample, the loss function JJ is denoted as:

J=12xy(x)aL(x)2J = \frac{1}{2} \sum_{x} \lvert \lvert y(x) - a^{L}(x) \rvert \rvert ^ {2}

Finally, we define the error generated in the ii-th neuron in the ll-th layer as:

δilJzil\delta_{i}^{l} \equiv \frac{\partial{J}}{\partial{z_{i}^{l}}}

Formula Derivation#

The error of the last layer of the neural network with respect to the loss function is:

δiL=JziL=JaiLaiLziL=J(aiL)σ(ziL)\begin{aligned}\delta_{i}^{L} &= \frac{\partial{J}}{\partial{z_{i}^{L}}}\\&=\frac{\partial{J}}{\partial{a_{i}^{L}}} \cdot \frac{\partial{a_{i}^{L}}}{\partial{z_{i}^{L}}}\\&=\nabla J(a_{i}^{L}) \sigma^{'}(z_{i}^{L})\end{aligned}

δL=J(aL)σ(zL)\delta^{L} = \nabla J(a^{L}) \odot \sigma^{'}(z^{L})

The error of the jj-th layer of the network with respect to the loss function is:

δjl=Jzjl=Jajlajlzjl=iJzil+1zil+1ajlajlzjl=iδil+1wijl+1ajl+bil+1ajlσ(zjl)=iδil+1wijl+1σ(zjl)\begin{aligned}\delta_{j}^{l} &= \frac{\partial{J}}{\partial{z_{j}^{l}}} \\ &= \frac{\partial{J}}{\partial{a_{j}^{l}}} \cdot \frac{\partial{a_{j}^{l}}}{\partial{z_{j}^{l}}} \\ &= \sum_{i} \frac{\partial{J}}{\partial{z_{i}^{l+1}}} \cdot \frac{\partial{z_{i}^{l+1}}}{\partial{a_{j}^{l}}} \cdot \frac{\partial{a_{j}^{l}}}{\partial{z_{j}^{l}}} \\ &= \sum_{i} \delta_{i}^{l+1} \cdot \frac{\partial{w_{ij}^{l+1}a_{j}^{l} + b_{i}^{l+1}}}{\partial{a_{j}^{l}}} \cdot \sigma^{'}(z_{j}^{l}) \\ &=\sum_{i} \delta_{i}^{l+1} \cdot w_{ij}^{l+1} \cdot \sigma^{'}(z_{j}^{l}) \end{aligned}

δl=((wl+1)Tδl+1)σ(zl)\delta^{l} = \left( \left( w^{l+1} \right)^{T} \delta^{l+1} \right) \odot \sigma^{'}(z^{l})

Therefore, we can calculate the gradient of the weights through the loss function:

Jwjil=Jzjlzjlwjil=δjl(wjilail1+bjl)wjil=δjlail1\begin{aligned} \frac{\partial{J}}{\partial{w_{ji}^{l}}} &= \frac{\partial{J}}{\partial{z_{j}^{l}}} \cdot \frac{\partial{z_{j}^{l}}}{\partial{w_{ji}^{l}}} \\ &= \delta_{j}^{l} \cdot \frac{\partial{\left( w_{ji}^{l}a_{i}^{l-1} + b_{j}^{l} \right)}}{\partial{w_{ji}^{l}}} \\ &= \delta_{j}^{l} \cdot a_{i}^{l-1} \end{aligned}

Jwjil=δjlail1\frac{\partial{J}}{\partial{w_{ji}^{l}}} = \delta_{j}^{l} \cdot a_{i}^{l-1}

Finally, we can calculate the gradient of the biases through the loss function:

Jbjl=Jzjlzjlbjl=δjlwjilail1+bjlbjl=δjl\begin{aligned} \frac{\partial{J}}{\partial{b_{j}^{l}}} &= \frac{\partial{J}}{\partial{z_{j}^{l}}} \cdot \frac{\partial{z_{j}^{l}}}{\partial{b_{j}^{l}}} \\ &= \delta_{j}^{l} \cdot \frac{\partial{w_{ji}^{l} a_{i}^{l-1} + b_{j}^{l}}}{\partial{b_{j}^{l}}} \\ &=\delta_{j}^{l} \end{aligned}

加载中...
此文章数据所有权由区块链加密技术和智能合约保障仅归创作者所有。