Ultimate Intelligence: October 2019

Models:

B2T2: Bounding Boxes in Text Transformer

Concepts:

Late fusion: Text and image that are integrated late
Early fusion: The processing of one be conditioned on the analysis of the other

Data:= $(I , B, T, I)$

$I$ is an image
$B=[b_1, b_2, ..., b_m ]$ a list of bounding boxes referring to regions of $I$, where each $b_i $ is identified by the lower left corner, height and width
$T=[t_1, ..., t_n]$ is a passage of tokenized text (some tokens are not natural language but references to $B$)
$l$ is a numeric class label in $\{ 0, 1\}$

B2T2:=

model the class distribution as:

$$p(l | T, I , B, R)= \frac{e^{\phi (E'(T, I, B, R))*a_l + b_l}}
{\sum_{l'} e^{\phi (E'(T, I, B, R))*a_{l'} + b_{l'}} }
$$

$$
E'(T, I, B, R)= E(T) + \sum_{i=1}^m R_i [M \Phi(crop(I, b_i)) + \pi (b_i)]^T
$$

where $M$ is a learnt matrix, $\Phi$ can be a resnet, $\pi(b_i)$ denotes the embedding of $b_i$'s shape and position information in $R^d$

What is gradient boosting

A kind of boosting using functional gradient descent. It uses weak learner to fit gradient residuals and then add to original learner, thus making an ensemble.

Source: Wikipedia

XGBOOST

$$\text{obj}(\theta) = L(\theta) + \Omega(\theta)$$

where $L$ is the training loss function, and $Ω$ is the regularization term. The training loss measures how predictive our model is with respect to the training data. A common choice of L is the mean squared error, which is given by
$$
L(\theta) = \sum_i (y_i-\hat{y}_i)^2
$$

$$
\text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i)
$$

Additive Training:
$$
\begin{split}\hat{y}_i^{(0)} &= 0\\
\hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\
\hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\
&\dots\\
\hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)\end{split}
$$

For step t:
$$
\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\
& = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}\end{split}
$$

For L2 loss:
$$
\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\
& = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}\end{split}
$$

$$
\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant}
$$

where
$$
\begin{split}g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\
h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)})\end{split}
$$

So:
$$
\textrm{obt}^{(t)}=\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)
$$

Regularization

Redefine each tree as:
$$
f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\}
$$

Here w is the vector of scores on leaves, q is a function assigning each data point to the corresponding leaf, and T is the number of leaves. In XGBoost, we define the complexity as

$$
\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2
$$

Reformatting:

$$
\begin{split}\text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\
&= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T\end{split}
$$

where Ij={i|q(xi)=j} is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining:

$G_j = \sum_{i\in I_j} g_i$

$H_j = \sum_{i\in I_j} h_i$

$$
\text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T
$$

The theoretical best solution is:
$$
\begin{split}w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\
\text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T\end{split}
$$

Splitting criterion:
$$
Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma
$$

Ref:

Ultimate Intelligence

Friday, October 25, 2019

Fusion of Detected Objects in Text for Visual Question Answering

Models:

Concepts:

Monday, October 21, 2019

Gradient Boosting Trees

What is gradient boosting

XGBOOST

Blog Archive