RNN
Assume the RNN is with the following architecture, with hyperbolic tangent activation function, output is discrete.
From time $t=1$ to $t=\tau$ , we apply the update equations:
$\begin{aligned} \boldsymbol{a}^{(t)} &=\boldsymbol{b}+\boldsymbol{W} \boldsymbol{h}^{(t-1)}+\boldsymbol{U} \boldsymbol{x}^{(t)} \\ \boldsymbol{h}^{(t)} &=\tanh \left(\boldsymbol{a}^{(t)}\right) \\ \boldsymbol{o}^{(t)} &=\boldsymbol{c}+\boldsymbol{V} \boldsymbol{h}^{(t)} \\ \hat{\boldsymbol{y}}^{(t)} &=\operatorname{softmax}\left(\boldsymbol{o}^{(t)}\right) \end{aligned}$
The total loss for a given sequence of $x$ values paired with a sequence of $y$ values would then be just the sum of the losses over all the time steps. If $L^{(t)}$ is the negative log-likelihood of $y^{(t)}$ given $x^{(1)}, x^{(2)}, ..., x^{(t)}$, then
$\begin{aligned} & L\left(\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(\tau)}\right\},\left\{\boldsymbol{y}^{(1)}, \ldots, \boldsymbol{y}^{(\tau)}\right\}\right) \\=& \sum_{t} L^{(t)} \\=&-\sum_{t} \log p_{\text {model }}\left(y^{(t)} |\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(t)}\right\}\right) \end{aligned}$
Before deriving the derivative for a RNN, let's see what is the derivative for a softmax function.
Denote:
$D_{j} S_{i}=\frac{\partial S_{i}}{\partial a_{j}}=\frac{\partial \frac{e^{a_{i}}}{\sum_{k=1}^{N} e^{a_{k}}}}{\partial a_{j}}$
$D_{j} S_{i}=\left\{\begin{array}{cc}{S_{i}\left(1-S_{j}\right)} & {i=j} \\ {-S_{j} S_{i}} & {i \neq j}\end{array}\right.$
Namely:
$D_{j} S_{i}=S_{i}\left(\delta_{i j}-S_{j}\right)$
Gradient in RNN
For each node $\boldsymbol{N}$:
$$\frac{\partial L}{\partial L^{(t)}}=1$$
\begin{equation}
\begin{split}
\left(\nabla_{o^{(t)}} L\right)_{i} & =\frac{\partial L}{\partial o_{i}^{(t)}} \\
& =\frac{\partial L}{\partial L^{(t)}} \frac{\partial L^{(t)}}{\partial o_{j}^{(t)}} \\
&= 1 \times \frac{\partial (- log \hat{y}^{(t)}_{y^{(t)}})} {\partial \hat{y}^{(t)}_{y^{(t)}}}
\frac{\partial \hat{y}^{(t)}_{y^{(t)}}} {\partial o_{i}^{(t)}} \\
& = \hat{y}_{i}^{(t)}-\mathbf{1}_{i,y^{(t)}}
\end{split}
\end{equation}
LSTM
|
LSTM architecture from Deep Learning book |
Forget gate:
$$
f_{i}^{(t)}=\sigma\left(b_{i}^{f}+\sum_{j} U_{i, j}^{f} x_{j}^{(t)}+\sum_{j} W_{i, j}^{f} h_{j}^{(t-1)}\right)
$$
where $\sigma$ is the sigmoid function
Input gate:
$$
g_{i}^{(t)}=\sigma\left(b_{i}^{g}+\sum_{j} U_{i, j}^{g} x_{j}^{(t)}+\sum_{j} W_{i, j}^{g} h_{j}^{(t-1)}\right)
$$
Output gate:
$$
q_{i}^{(t)}=\sigma\left(b_{i}^{o}+\sum_{j} U_{i, j}^{o} x_{j}^{(t)}+\sum_{j} W_{i, j}^{o} h_{j}^{(t-1)}\right)
$$
LSTM Cell Internal State:
$$
s_{i}^{(t)}=f_{i}^{(t)} s_{i}^{(t-1)}+g_{i}^{(t)} \sigma\left(b_{i}+\sum_{j} U_{i, j} x_{j}^{(t)}+\sum_{j} W_{i, j} h_{j}^{(t-1)}\right)
$$
Output(hidden state):
$$
h_{i}^{(t)}=\tanh \left(s_{i}^{(t)}\right) q_{i}^{(t)}
$$
Ref:
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
- https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/