Models:
B2T2: Bounding Boxes in Text Transformer
Concepts:
Late fusion: Text and image that are integrated late
Early fusion: The processing of one be conditioned on the analysis of the other
Early fusion: The processing of one be conditioned on the analysis of the other
Data:= $(I , B, T, I)$
- $I$ is an image
- $B=[b_1, b_2, ..., b_m ]$ a list of bounding boxes referring to regions of $I$, where each $b_i $ is identified by the lower left corner, height and width
- $T=[t_1, ..., t_n]$ is a passage of tokenized text (some tokens are not natural language but references to $B$)
- $l$ is a numeric class label in $\{ 0, 1\}$
B2T2:=
model the class distribution as:
$$p(l | T, I , B, R)= \frac{e^{\phi (E'(T, I, B, R))*a_l + b_l}}
{\sum_{l'} e^{\phi (E'(T, I, B, R))*a_{l'} + b_{l'}} }
$$
$$
E'(T, I, B, R)= E(T) + \sum_{i=1}^m R_i [M \Phi(crop(I, b_i)) + \pi (b_i)]^T
$$
where $M$ is a learnt matrix, $\Phi$ can be a resnet, $\pi(b_i)$ denotes the embedding of $b_i$'s shape and position information in $R^d$