Tuesday, September 1, 2020

Gated Recurrent Unit

  • Proposed by Cho et al. in 2014 as a simpler alternative to the LSTM
  •  No cell states, on each time step t, we have input $x^{(t)}$ and hidden states $h^{(t)}$
Update Gate: controls what parts of hidden state are updated vs preserved

$\boldsymbol{u}^{(t)}=\sigma\left(\boldsymbol{W}_{u} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{u} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{u}\right)$

Reset Gate: controls what parts of previous hidden state are used to compute new content

$\boldsymbol{r}^{(t)}=\sigma\left(\boldsymbol{W}_{r} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{r} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{r}\right)$

New hidden states content: reset gate selects useful parts of previous hidden state. Use this and current input to compute new hidden content

$\tilde{\boldsymbol{h}}^{(t)}=\tanh \left(\boldsymbol{W}_{h}\left(\boldsymbol{r}^{(t)} \circ \boldsymbol{h}^{(t-1)}\right)+\boldsymbol{U}_{h} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{h}\right)$

Hidden state: update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content

$\boldsymbol{h}^{(t)}=\left(1-\boldsymbol{u}^{(t)}\right) \circ \boldsymbol{h}^{(t-1)}+\boldsymbol{u}^{(t)} \circ \tilde{\boldsymbol{h}}^{(t)}$

Friday, March 6, 2020

ON-LSTM

Summary:

  • This paper introduces a variant of LSTM which can unsupervisely learn latent sentences structures(?), further benefiting grammar related tasks(unsupervised constituency parsing, target syntactic evaluation, etc.)
  • It introduces two additional gates: Master Forget Gate and Master Input Gate controlling updating mechanism of LSTM
  • It divides hidden state into an unequal place: higher location(those with bigger indexes) are used for storing history information and harder to erase, lower locations are for newer information.  The division is not hard(by binary gates I mean), but by softmax gates for back propagating purpose.

Overall Designing Principles



  • Languages are hierarchical and structured.  
And the author wants to incorporate this prior to the model. For each time step t. Some parts of the hidden states should be preserved(correspond to top layers in the hierarchy, like the S which stands for subject in the figure. This of course should be kept further.)

To this end, Master Forget Gate and Master Input Gate controlling updating mechanism of LSTM have been proposed.
We wan to define $d^f_t$ and $d_t^i$ which represent the split points for
(0,...,0, 1,...,1)   (idealized Master Forget Gate)and
(1, ..., 1, 0, ..,0, 0) (idealized Master Input Gate)respectively.

Consider two different cases:
Case I: $d^f_t \le d_t^i$, input level $d_t^i$ is greater or equal to forget(history) level $d^f_t$ .
0  0  0  1  1  1  1  1 $d^f_t=4$
1  1  1  1  0  0  0  0  $d_t^i=4$
N N N C O O O O
Where N means update with new element, C means standard LSTM update meachanism, O stands for keep original information.

Case II: $d^f_t \gt d_t^i$, input level $d_t^i$ is less than forget(history) level $d^f_t$ .
0  0  0  0  1  1  1  1 $d^f_t=5$
1  1  1  0  0  0  0  0 $d_t^i=3$
N N N X O O O O

Where X means putting 0 in that position.

But outputing binary is differentiable in neural nets, so author chooses softmaxen-ed operations instead.
Define operation cumax :

def cumsoftmax(x, dim=-1):
    return torch.cumsum(F.softmax(x, dim=dim), dim=dim)

Define Master Forget Gate and Master Input Gate as:
$$\begin{array}{l}
\tilde{f}_{t}=\overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\dot{f}} x_{t}+U_{\tilde{f}} h_{t-1}+b_{\tilde{f}}\right)\right) \\
\tilde{i}_{t}=\overleftarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right) =1-
\overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right)
\end{array}$$

Comparison to LSTM

Note:
  • Blue stands for Gate Mechanism (LSTM)
  • Red stands for content updating mechanism (LSTM)
  • Fuchsia stands for ON-LSTM introduced Gate Mechanism (ON-LSTM)
  • Green stands for ON-LSTM impacted Gates (ON-LSTM)
  • This yellowish #e69138 color stands for impacted content updating mechanism (ON-LSTM)

LSTM:


Forget gate: controls what is kept vs forgotten, from previous cell state

$\boldsymbol{f}^{(t)}=\sigma \left(\boldsymbol{W}_{f} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{f} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{f}\right)$

Input gate: controls what parts of the new cell content are written to cell

$\boldsymbol{i}^{(t)}=\sigma \left(\boldsymbol{W}_{i} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{i} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{i}\right)$

Output gate: controls what parts of cell are output to hidden state

$\boldsymbol{o}^{(t)}=\sigma \left(\boldsymbol{W}_{o} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{o} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{o}\right)$

New cell content: this is the new content to be written to the cell

$\tilde{\boldsymbol{c}}^{(t)}=\tanh \left(\boldsymbol{W}_{c} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{c} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{c}\right)$

Cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content

$\boldsymbol{c}^{(t)}=\boldsymbol{f}^{(t)} \circ \boldsymbol{c}^{(t-1)}+\boldsymbol{i}^{(t)} \circ \tilde{\boldsymbol{c}}^{(t)}$

Hidden state: read (“output”) some content from the cell

$\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)} \circ \tanh \boldsymbol{c}^{(t)}$


LSTM,  Source: https://kexue.fm/archives/6621

ON-LSTM

Intersection(Optional): Hadamard Product between Master Forget Gate and Master Input Gate
$\omega_t= \tilde{f_t} \cdot \tilde{i_t}$

Master Forget gate: Ordered Forget Gate Control Mechanism 

$\tilde{f}_{t}=\overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\dot{f}} x_{t}+U_{\tilde{f}} h_{t-1}+b_{\tilde{f}}\right)\right) $

Forget gate: controls what is kept vs forgotten, from previous cell state

$\hat{f}_{t}=f_{t} \circ \omega_{t}+\left(\tilde{f}_{t}-\omega_{t}\right)=\tilde{f}_{t} \circ\left(f_{t} \circ \tilde{i}_{t}+1-\tilde{i}_{t}\right)$



Master Input gate: Ordered Input Gate Control Mechanism 

$\tilde{i}_{t}=\overleftarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right) =1-
\overrightarrow{\mathrm{cumsum}}\left(\operatorname{softmax}\left(W_{\tilde{i}} x_{t}+U_{\tilde{i}} h_{t-1}+b_{\tilde{i}}\right)\right)$

Input gate: controls what parts of the new cell content are written to cell

$\hat{i}_{t}=i_{t} \circ \omega_{t}+\left(\tilde{i}_{t}-\omega_{t}\right)=\tilde{i}_{t} \circ\left(i_{t} \circ \tilde{f}_{t}+1-\tilde{f}_{t}\right)$

Output gate: controls what parts of cell are output to hidden state

$\boldsymbol{o}^{(t)}=\sigma \left(\boldsymbol{W}_{o} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{o} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{o}\right)$

New cell content: this is the new content to be written to the cell

$\tilde{\boldsymbol{c}}^{(t)}=\tanh \left(\boldsymbol{W}_{c} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{c} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{c}\right)$

Cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content

$c_{t}=\hat{f}_{t} \circ c_{t-1}+\hat{i}_{t} \circ \hat{c}_{t}$

Hidden state: read (“output”) some content from the cell

$\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)} \circ \tanh \boldsymbol{c}^{(t)}$

ON-LSTM, Source: https://kexue.fm/archives/6621/comment-page-1

What's Next? 

  • Is "cumax"(cusum + softmax) really necessary?
From what I have observed., in reality the forget and input gates are not in the (0,0,0,1,1,1) or (1,1,1,0,0,0) regime. Take one for example:
[0.3182306 , 0.4296763 , 0.5659539 , 0.6581644 , 0.8738386 ,
         0.94163144, 0.9522468 , 0.95804715, 0.960161  , 0.96124136,
         0.96252906, 0.963744  , 0.964411  , 0.96599156, 0.9674326 ,
         0.9684573 , 0.96898276, 0.9697275 , 0.97011393, 0.9703722 ,
         0.9713492 , 0.9736657 , 0.97405773, 0.9745897 , 0.9753741 ,
         0.97613144, 0.9766177 , 0.97877145, 0.98247737, 0.9835741 ,
         0.9863124 , 0.9878812 , 0.9899598 , 0.9946286 , 0.99710023,
         0.99824625, 0.9990951 , 0.9997795 , 0.99986076, 1.0000001 ]
This is not idealized version of gates. Too many of "middle states", limiting the ordering ability of ON-LSTM. It's possible to use softmax to assign to splitting point, but that leads to non-differentiability.

  • Is the hidden states in LSTM ordered?

Friday, December 27, 2019

LSTM (2)

Long Short-Term Memory (LSTM)

On step t, there is a hidden state $h^{(t)}$ and a cell state $c^{(t)}$:
  • Both are vectors of length n 
  • The cell stores long-term information 
  • The LSTM can erase, write or read information from the cell 
Formulas(Note, all $\sigma $ represent sigmoid function)
Forget gate: controls what is kept vs forgotten, from previous cell state

$\boldsymbol{f}^{(t)}=\sigma \left(\boldsymbol{W}_{f} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{f} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{f}\right)$

Input gate: controls what parts of the new cell content are written to cell

$\boldsymbol{i}^{(t)}=\sigma \left(\boldsymbol{W}_{i} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{i} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{i}\right)$

Output gate: controls what parts of cell are output to hidden state

$\boldsymbol{o}^{(t)}=\sigma \left(\boldsymbol{W}_{o} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{o} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{o}\right)$

New cell content: this is the new content to be written to the cell

$\tilde{\boldsymbol{c}}^{(t)}=\tanh \left(\boldsymbol{W}_{c} \boldsymbol{h}^{(t-1)}+\boldsymbol{U}_{c} \boldsymbol{x}^{(t)}+\boldsymbol{b}_{c}\right)$

Cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content

$\boldsymbol{c}^{(t)}=\boldsymbol{f}^{(t)} \circ \boldsymbol{c}^{(t-1)}+\boldsymbol{i}^{(t)} \circ \tilde{\boldsymbol{c}}^{(t)}$

Hidden state: read (“output”) some content from the cell

$\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)} \circ \tanh \boldsymbol{c}^{(t)}$

LSTM Diagram. Source: 苏剑林


Ref: 

Friday, October 25, 2019

Fusion of Detected Objects in Text for Visual Question Answering

Models:

B2T2: Bounding Boxes in Text Transformer 

Concepts:

Late fusion: Text and image that are integrated late
Early fusion: The processing of one be conditioned on the analysis of the other 

Data:= $(I , B, T, I)$


  • $I$ is an image
  • $B=[b_1, b_2, ..., b_m ]$ a list of bounding boxes referring to regions of $I$, where each $b_i $ is identified by the lower left corner, height and width
  • $T=[t_1, ..., t_n]$ is a passage of tokenized text (some tokens are not natural language but references to $B$)
  • $l$ is a numeric class label in $\{ 0, 1\}$



B2T2:=

model the class distribution as:

$$p(l | T, I , B, R)= \frac{e^{\phi (E'(T, I, B, R))*a_l + b_l}}
{\sum_{l'}  e^{\phi (E'(T, I, B, R))*a_{l'} + b_{l'}} }
$$

$$
E'(T, I, B, R)= E(T) + \sum_{i=1}^m R_i [M \Phi(crop(I, b_i)) + \pi (b_i)]^T
$$

where $M$ is a learnt matrix, $\Phi$ can be a resnet, $\pi(b_i)$ denotes the embedding of $b_i$'s shape and position information in $R^d$



Monday, October 21, 2019

Gradient Boosting Trees

What is gradient boosting

A kind of boosting using functional gradient descent. It uses weak learner to fit gradient residuals and then add to original learner, thus making an ensemble. 
Source: Wikipedia

XGBOOST



$$\text{obj}(\theta) = L(\theta) + \Omega(\theta)$$

where $L$ is the training loss function, and $Ω$ is the regularization term. The training loss measures how predictive our model is with respect to the training data. A common choice of L is the mean squared error, which is given by
$$
L(\theta) = \sum_i (y_i-\hat{y}_i)^2
$$

$$
\text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i)
$$


Additive Training:
$$
\begin{split}\hat{y}_i^{(0)} &= 0\\
\hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\
\hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\
&\dots\\
\hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)\end{split}
$$

For step t:
$$
\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\
          & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}\end{split}
$$

For L2 loss:
$$
\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\
          & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}\end{split}
$$

$$
\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant}
$$

where
$$
\begin{split}g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\
h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)})\end{split}
$$

So:
$$
\textrm{obt}^{(t)}=\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)
$$

Regularization 



Redefine each tree as:
$$
f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\}
$$

Here w is the vector of scores on leaves, q is a function assigning each data point to the corresponding leaf, and T is the number of leaves. In XGBoost, we define the complexity as

$$
\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2
$$

Reformatting:

$$
\begin{split}\text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\
&= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T\end{split}
$$

where Ij={i|q(xi)=j} is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining:

$G_j = \sum_{i\in I_j} g_i$

$H_j = \sum_{i\in I_j} h_i$

$$
\text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T
$$


The theoretical best solution is:
$$
\begin{split}w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\
\text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T\end{split}
$$


Splitting criterion:
$$
Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma
$$

Thursday, September 26, 2019

Attention Mechanism

Why attention? 

Resource saving

Only need sensors where relevant bits are
(e.g. fovea vs. peripheral vision)
• Only compute relevant bits of information
(e.g. fovea has many more ‘pixels’ than periphery)

Variable state manipulation

• Manipulate environment (for all grains do: eat)
• Learn modular subroutines (not state)

Some forms of attention that you might not notice:

  • Watson Nadaraya Estimator
  •  Pooling
  • Iterative Pooling 


Ref:

  1. Alex Smola and Aston Zhang, Attention in Deep Learning, ICML 2019 Tutorial , Slides link 

Monday, September 23, 2019

Mutual Information

Mutual Information 
$$
I(X ; Y)=D_{\mathrm{KL}}\left(P_{(X, Y)} \| P_{X} \otimes P_{Y}\right)

$$


Intuitively, mutual information measures the information that $X$ and $Y$ share: It measures how much knowing one of these variables reduces uncertainty about the other.

\begin{aligned} \mathrm{I}(X ; Y) & \equiv \mathrm{H}(X)-\mathrm{H}(X | Y) \\ & \equiv \mathrm{H}(Y)-\mathrm{H}(Y | X) \\ & \equiv \mathrm{H}(X)+\mathrm{H}(Y)-\mathrm{H}(X, Y) \\ & \equiv \mathrm{H}(X, Y)-\mathrm{H}(X | Y)-\mathrm{H}(Y | X) \end{aligned}

where $H(Y|X)$ means Conditional Entropy, defined as:


$$

\mathrm{H}(Y | X)=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)}

$$

$$
\begin{aligned} \mathrm{H}(Y | X) & \equiv \sum_{x \in \mathcal{X}} p(x) \mathrm{H}(Y | X=x) \\ &=-\sum_{x \in \mathcal{X}} p(x) \sum_{y \in \mathcal{Y}} p(y | x) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)} \\ &=\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x)}{p(x, y)} \end{aligned}

$$