//

## Monday, September 16, 2019

### RNN

Assume the RNN is with the following architecture, with hyperbolic tangent activation function, output is discrete.

From time $t=1$ to $t=\tau$ , we apply the update equations:
\begin{aligned} \boldsymbol{a}^{(t)} &=\boldsymbol{b}+\boldsymbol{W} \boldsymbol{h}^{(t-1)}+\boldsymbol{U} \boldsymbol{x}^{(t)} \\ \boldsymbol{h}^{(t)} &=\tanh \left(\boldsymbol{a}^{(t)}\right) \\ \boldsymbol{o}^{(t)} &=\boldsymbol{c}+\boldsymbol{V} \boldsymbol{h}^{(t)} \\ \hat{\boldsymbol{y}}^{(t)} &=\operatorname{softmax}\left(\boldsymbol{o}^{(t)}\right) \end{aligned}

The total loss for a given sequence of $x$ values paired with a sequence of $y$ values would then be just the sum of the losses over all the time steps. If $L^{(t)}$  is the negative log-likelihood of $y^{(t)}$ given $x^{(1)}, x^{(2)}, ..., x^{(t)}$, then

\begin{aligned} & L\left(\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(\tau)}\right\},\left\{\boldsymbol{y}^{(1)}, \ldots, \boldsymbol{y}^{(\tau)}\right\}\right) \\=& \sum_{t} L^{(t)} \\=&-\sum_{t} \log p_{\text {model }}\left(y^{(t)} |\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(t)}\right\}\right) \end{aligned}

Before deriving the derivative for a RNN, let's see what is the derivative for a softmax function.

Denote:
$D_{j} S_{i}=\frac{\partial S_{i}}{\partial a_{j}}=\frac{\partial \frac{e^{a_{i}}}{\sum_{k=1}^{N} e^{a_{k}}}}{\partial a_{j}}$

$D_{j} S_{i}=\left\{\begin{array}{cc}{S_{i}\left(1-S_{j}\right)} & {i=j} \\ {-S_{j} S_{i}} & {i \neq j}\end{array}\right.$

Namely:

$D_{j} S_{i}=S_{i}\left(\delta_{i j}-S_{j}\right)$

For each node $\boldsymbol{N}$:

$$\frac{\partial L}{\partial L^{(t)}}=1$$

\begin{split}

\left(\nabla_{o^{(t)}} L\right)_{i} & =\frac{\partial L}{\partial o_{i}^{(t)}} \\

& =\frac{\partial L}{\partial L^{(t)}} \frac{\partial L^{(t)}}{\partial o_{j}^{(t)}}  \\

&= 1 \times \frac{\partial (- log \hat{y}^{(t)}_{y^{(t)}})} {\partial \hat{y}^{(t)}_{y^{(t)}}}

\frac{\partial \hat{y}^{(t)}_{y^{(t)}}} {\partial o_{i}^{(t)}} \\

& = \hat{y}_{i}^{(t)}-\mathbf{1}_{i,y^{(t)}}

\end{split}

### LSTM

 LSTM architecture from Deep Learning book

Forget gate:   $$f_{i}^{(t)}=\sigma\left(b_{i}^{f}+\sum_{j} U_{i, j}^{f} x_{j}^{(t)}+\sum_{j} W_{i, j}^{f} h_{j}^{(t-1)}\right)$$

where $\sigma$ is the sigmoid function

Input gate:
$$g_{i}^{(t)}=\sigma\left(b_{i}^{g}+\sum_{j} U_{i, j}^{g} x_{j}^{(t)}+\sum_{j} W_{i, j}^{g} h_{j}^{(t-1)}\right)$$

Output gate:
$$q_{i}^{(t)}=\sigma\left(b_{i}^{o}+\sum_{j} U_{i, j}^{o} x_{j}^{(t)}+\sum_{j} W_{i, j}^{o} h_{j}^{(t-1)}\right)$$

LSTM Cell Internal State:
$$s_{i}^{(t)}=f_{i}^{(t)} s_{i}^{(t-1)}+g_{i}^{(t)} \sigma\left(b_{i}+\sum_{j} U_{i, j} x_{j}^{(t)}+\sum_{j} W_{i, j} h_{j}^{(t-1)}\right)$$

Output(hidden state):
$$h_{i}^{(t)}=\tanh \left(s_{i}^{(t)}\right) q_{i}^{(t)}$$

Ref:

1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
2. https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/