## Tuesday, September 1, 2020

### Gated Recurrent Unit

• Proposed by Cho et al. in 2014 as a simpler alternative to the LSTM
•  No cell states, on each time step t, we have input $x^{(t)}$ and hidden states $h^{(t)}$
Update Gate: controls what parts of hidden state are updated vs preserved

$\boldsymbol{u}^{(t)}=\sigma\left(\boldsymbol{W}_{u} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{u} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{u}\right)$

Reset Gate: controls what parts of previous hidden state are used to compute new content

$\boldsymbol{r}^{(t)}=\sigma\left(\boldsymbol{W}_{r} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{r} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{r}\right)$

New hidden states content: reset gate selects useful parts of previous hidden state. Use this and current input to compute new hidden content

$\tilde{\boldsymbol{h}}^{(t)}=\tanh \left(\boldsymbol{W}_{h}\left(\boldsymbol{r}^{(t)} \circ \boldsymbol{h}^{(t-1)}\right)+\boldsymbol{U}_{h} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{h}\right)$

Hidden state: update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content

$\boldsymbol{h}^{(t)}=\left(1-\boldsymbol{u}^{(t)}\right) \circ \boldsymbol{h}^{(t-1)}+\boldsymbol{u}^{(t)} \circ \tilde{\boldsymbol{h}}^{(t)}$

## Friday, March 6, 2020

### Summary:

• This paper introduces a variant of LSTM which can unsupervisely learn latent sentences structures(?), further benefiting grammar related tasks(unsupervised constituency parsing, target syntactic evaluation, etc.)
• It introduces two additional gates: Master Forget Gate and Master Input Gate controlling updating mechanism of LSTM
• It divides hidden state into an unequal place: higher location(those with bigger indexes) are used for storing history information and harder to erase, lower locations are for newer information.  The division is not hard(by binary gates I mean), but by softmax gates for back propagating purpose.

### Overall Designing Principles

• Languages are hierarchical and structured.
And the author wants to incorporate this prior to the model. For each time step t. Some parts of the hidden states should be preserved(correspond to top layers in the hierarchy, like the S which stands for subject in the figure. This of course should be kept further.)

To this end, Master Forget Gate and Master Input Gate controlling updating mechanism of LSTM have been proposed.
We wan to define $d^f_t$ and $d_t^i$ which represent the split points for
(0,...,0, 1,...,1)   (idealized Master Forget Gate)and
(1, ..., 1, 0, ..,0, 0) (idealized Master Input Gate)respectively.

Consider two different cases:
Case I: $d^f_t \le d_t^i$, input level $d_t^i$ is greater or equal to forget(history) level $d^f_t$ .
0  0  0  1  1  1  1  1 $d^f_t=4$
1  1  1  1  0  0  0  0  $d_t^i=4$
N N N C O O O O
Where N means update with new element, C means standard LSTM update meachanism, O stands for keep original information.

Case II: $d^f_t \gt d_t^i$, input level $d_t^i$ is less than forget(history) level $d^f_t$ .
0  0  0  0  1  1  1  1 $d^f_t=5$
1  1  1  0  0  0  0  0 $d_t^i=3$
N N N X O O O O

Where X means putting 0 in that position.

But outputing binary is differentiable in neural nets, so author chooses softmaxen-ed operations instead.
Define operation cumax :

def cumsoftmax(x, dim=-1):


Define Master Forget Gate and Master Input Gate as:
$$\begin{array}{l} \tilde{f}_{t}=\overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\dot{f}} x_{t}+U_{\tilde{f}} h_{t-1}+b_{\tilde{f}}\right)\right) \\ \tilde{i}_{t}=\overleftarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right) =1- \overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right) \end{array}$$

### Comparison to LSTM

Note:
• Blue stands for Gate Mechanism (LSTM)
• Red stands for content updating mechanism (LSTM)
• Fuchsia stands for ON-LSTM introduced Gate Mechanism (ON-LSTM)
• Green stands for ON-LSTM impacted Gates (ON-LSTM)
• This yellowish #e69138 color stands for impacted content updating mechanism (ON-LSTM)

#### LSTM:

Forget gate: controls what is kept vs forgotten, from previous cell state

$\boldsymbol{f}^{(t)}=\sigma \left(\boldsymbol{W}_{f} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{f} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{f}\right)$

Input gate: controls what parts of the new cell content are written to cell

$\boldsymbol{i}^{(t)}=\sigma \left(\boldsymbol{W}_{i} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{i} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{i}\right)$

Output gate: controls what parts of cell are output to hidden state

$\boldsymbol{o}^{(t)}=\sigma \left(\boldsymbol{W}_{o} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{o} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{o}\right)$

New cell content: this is the new content to be written to the cell

$\tilde{\boldsymbol{c}}^{(t)}=\tanh \left(\boldsymbol{W}_{c} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{c} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{c}\right)$

Cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content

$\boldsymbol{c}^{(t)}=\boldsymbol{f}^{(t)} \circ \boldsymbol{c}^{(t-1)}+\boldsymbol{i}^{(t)} \circ \tilde{\boldsymbol{c}}^{(t)}$

Hidden state: read (“output”) some content from the cell

$\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)} \circ \tanh \boldsymbol{c}^{(t)}$ LSTM,  Source: https://kexue.fm/archives/6621

#### ON-LSTM

Intersection(Optional): Hadamard Product between Master Forget Gate and Master Input Gate
$\omega_t= \tilde{f_t} \cdot \tilde{i_t}$

Master Forget gate: Ordered Forget Gate Control Mechanism

$\tilde{f}_{t}=\overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\dot{f}} x_{t}+U_{\tilde{f}} h_{t-1}+b_{\tilde{f}}\right)\right)$

Forget gate: controls what is kept vs forgotten, from previous cell state

$\hat{f}_{t}=f_{t} \circ \omega_{t}+\left(\tilde{f}_{t}-\omega_{t}\right)=\tilde{f}_{t} \circ\left(f_{t} \circ \tilde{i}_{t}+1-\tilde{i}_{t}\right)$

Master Input gate: Ordered Input Gate Control Mechanism

$\tilde{i}_{t}=\overleftarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right) =1- \overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right)$

Input gate: controls what parts of the new cell content are written to cell

$\hat{i}_{t}=i_{t} \circ \omega_{t}+\left(\tilde{i}_{t}-\omega_{t}\right)=\tilde{i}_{t} \circ\left(i_{t} \circ \tilde{f}_{t}+1-\tilde{f}_{t}\right)$

Output gate: controls what parts of cell are output to hidden state

$\boldsymbol{o}^{(t)}=\sigma \left(\boldsymbol{W}_{o} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{o} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{o}\right)$

New cell content: this is the new content to be written to the cell

$\tilde{\boldsymbol{c}}^{(t)}=\tanh \left(\boldsymbol{W}_{c} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{c} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{c}\right)$

Cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content

$c_{t}=\hat{f}_{t} \circ c_{t-1}+\hat{i}_{t} \circ \hat{c}_{t}$

Hidden state: read (“output”) some content from the cell

$\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)} \circ \tanh \boldsymbol{c}^{(t)}$ ON-LSTM, Source: https://kexue.fm/archives/6621/comment-page-1

### What's Next?

• Is "cumax"(cusum + softmax) really necessary?
From what I have observed., in reality the forget and input gates are not in the (0,0,0,1,1,1) or (1,1,1,0,0,0) regime. Take one for example:
[0.3182306 , 0.4296763 , 0.5659539 , 0.6581644 , 0.8738386 ,
0.94163144, 0.9522468 , 0.95804715, 0.960161  , 0.96124136,
0.96252906, 0.963744  , 0.964411  , 0.96599156, 0.9674326 ,
0.9684573 , 0.96898276, 0.9697275 , 0.97011393, 0.9703722 ,
0.9713492 , 0.9736657 , 0.97405773, 0.9745897 , 0.9753741 ,
0.97613144, 0.9766177 , 0.97877145, 0.98247737, 0.9835741 ,
0.9863124 , 0.9878812 , 0.9899598 , 0.9946286 , 0.99710023,
0.99824625, 0.9990951 , 0.9997795 , 0.99986076, 1.0000001 ]
This is not idealized version of gates. Too many of "middle states", limiting the ordering ability of ON-LSTM. It's possible to use softmax to assign to splitting point, but that leads to non-differentiability.

• Is the hidden states in LSTM ordered?
Only cell states are ordered.

Reference:

## Long Short-Term Memory (LSTM)

On step t, there is a hidden state $h^{(t)}$ and a cell state $c^{(t)}$:
• Both are vectors of length n
• The cell stores long-term information
• The LSTM can erase, write or read information from the cell
Formulas(Note, all $\sigma$ represent sigmoid function)
Forget gate: controls what is kept vs forgotten, from previous cell state

$\boldsymbol{f}^{(t)}=\sigma \left(\boldsymbol{W}_{f} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{f} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{f}\right)$

Input gate: controls what parts of the new cell content are written to cell

$\boldsymbol{i}^{(t)}=\sigma \left(\boldsymbol{W}_{i} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{i} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{i}\right)$

Output gate: controls what parts of cell are output to hidden state

$\boldsymbol{o}^{(t)}=\sigma \left(\boldsymbol{W}_{o} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{o} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{o}\right)$

New cell content: this is the new content to be written to the cell

$\tilde{\boldsymbol{c}}^{(t)}=\tanh \left(\boldsymbol{W}_{c} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{c} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{c}\right)$

Cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content

$\boldsymbol{c}^{(t)}=\boldsymbol{f}^{(t)} \circ \boldsymbol{c}^{(t-1)}+\boldsymbol{i}^{(t)} \circ \tilde{\boldsymbol{c}}^{(t)}$

Hidden state: read (“output”) some content from the cell

$\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)} \circ \tanh \boldsymbol{c}^{(t)}$ LSTM Diagram. Source: 苏剑林

Ref:

## Friday, October 25, 2019

### Models:

B2T2: Bounding Boxes in Text Transformer

### Concepts:

Late fusion: Text and image that are integrated late
Early fusion: The processing of one be conditioned on the analysis of the other

Data:= $(I , B, T, I)$

• $I$ is an image
• $B=[b_1, b_2, ..., b_m ]$ a list of bounding boxes referring to regions of $I$, where each $b_i$ is identified by the lower left corner, height and width
• $T=[t_1, ..., t_n]$ is a passage of tokenized text (some tokens are not natural language but references to $B$)
• $l$ is a numeric class label in $\{ 0, 1\}$

B2T2:=

model the class distribution as:

$$p(l | T, I , B, R)= \frac{e^{\phi (E'(T, I, B, R))*a_l + b_l}} {\sum_{l'} e^{\phi (E'(T, I, B, R))*a_{l'} + b_{l'}} }$$

$$E'(T, I, B, R)= E(T) + \sum_{i=1}^m R_i [M \Phi(crop(I, b_i)) + \pi (b_i)]^T$$

where $M$ is a learnt matrix, $\Phi$ can be a resnet, $\pi(b_i)$ denotes the embedding of $b_i$'s shape and position information in $R^d$

## Monday, October 21, 2019

A kind of boosting using functional gradient descent. It uses weak learner to fit gradient residuals and then add to original learner, thus making an ensemble.

### XGBOOST

$$\text{obj}(\theta) = L(\theta) + \Omega(\theta)$$

where $L$ is the training loss function, and $Ω$ is the regularization term. The training loss measures how predictive our model is with respect to the training data. A common choice of L is the mean squared error, which is given by
$$L(\theta) = \sum_i (y_i-\hat{y}_i)^2$$

$$\text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i)$$

$$\begin{split}\hat{y}_i^{(0)} &= 0\\ \hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\ \hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\ &\dots\\ \hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)\end{split}$$

For step t:
$$\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}\end{split}$$

For L2 loss:
$$\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}\end{split}$$

$$\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant}$$

where
$$\begin{split}g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\ h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)})\end{split}$$

So:
$$\textrm{obt}^{(t)}=\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)$$

Regularization

Redefine each tree as:
$$f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\}$$

Here w is the vector of scores on leaves, q is a function assigning each data point to the corresponding leaf, and T is the number of leaves. In XGBoost, we define the complexity as

$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2$$

Reformatting:

$$\begin{split}\text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\ &= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T\end{split}$$

where Ij={i|q(xi)=j} is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining:

$G_j = \sum_{i\in I_j} g_i$

$H_j = \sum_{i\in I_j} h_i$

$$\text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T$$

The theoretical best solution is:
$$\begin{split}w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\ \text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T\end{split}$$

Splitting criterion:
$$Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma$$

## Thursday, September 26, 2019

### Why attention?

#### Resource saving

Only need sensors where relevant bits are
(e.g. fovea vs. peripheral vision)
• Only compute relevant bits of information
(e.g. fovea has many more ‘pixels’ than periphery)

#### Variable state manipulation

• Manipulate environment (for all grains do: eat)
• Learn modular subroutines (not state)

### Some forms of attention that you might not notice:

•  Pooling
• Iterative Pooling

Ref:

## Monday, September 23, 2019

### Mutual Information

Mutual Information
$$I(X ; Y)=D_{\mathrm{KL}}\left(P_{(X, Y)} \| P_{X} \otimes P_{Y}\right)$$

Intuitively, mutual information measures the information that $X$ and $Y$ share: It measures how much knowing one of these variables reduces uncertainty about the other.

\begin{aligned} \mathrm{I}(X ; Y) & \equiv \mathrm{H}(X)-\mathrm{H}(X | Y) \\ & \equiv \mathrm{H}(Y)-\mathrm{H}(Y | X) \\ & \equiv \mathrm{H}(X)+\mathrm{H}(Y)-\mathrm{H}(X, Y) \\ & \equiv \mathrm{H}(X, Y)-\mathrm{H}(X | Y)-\mathrm{H}(Y | X) \end{aligned}

where $H(Y|X)$ means Conditional Entropy, defined as:

$$\mathrm{H}(Y | X)=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)}$$

\begin{aligned} \mathrm{H}(Y | X) & \equiv \sum_{x \in \mathcal{X}} p(x) \mathrm{H}(Y | X=x) \\ &=-\sum_{x \in \mathcal{X}} p(x) \sum_{y \in \mathcal{Y}} p(y | x) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)} \\ &=\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x)}{p(x, y)} \end{aligned}