## Friday, October 25, 2019

### Models:

B2T2: Bounding Boxes in Text Transformer

### Concepts:

Late fusion: Text and image that are integrated late
Early fusion: The processing of one be conditioned on the analysis of the other

Data:= $(I , B, T, I)$

• $I$ is an image
• $B=[b_1, b_2, ..., b_m ]$ a list of bounding boxes referring to regions of $I$, where each $b_i$ is identified by the lower left corner, height and width
• $T=[t_1, ..., t_n]$ is a passage of tokenized text (some tokens are not natural language but references to $B$)
• $l$ is a numeric class label in $\{ 0, 1\}$

B2T2:=

model the class distribution as:

$$p(l | T, I , B, R)= \frac{e^{\phi (E'(T, I, B, R))*a_l + b_l}} {\sum_{l'} e^{\phi (E'(T, I, B, R))*a_{l'} + b_{l'}} }$$

$$E'(T, I, B, R)= E(T) + \sum_{i=1}^m R_i [M \Phi(crop(I, b_i)) + \pi (b_i)]^T$$

where $M$ is a learnt matrix, $\Phi$ can be a resnet, $\pi(b_i)$ denotes the embedding of $b_i$'s shape and position information in $R^d$