Friday, October 25, 2019

Fusion of Detected Objects in Text for Visual Question Answering


B2T2: Bounding Boxes in Text Transformer 


Late fusion: Text and image that are integrated late
Early fusion: The processing of one be conditioned on the analysis of the other 

Data:= $(I , B, T, I)$

  • $I$ is an image
  • $B=[b_1, b_2, ..., b_m ]$ a list of bounding boxes referring to regions of $I$, where each $b_i $ is identified by the lower left corner, height and width
  • $T=[t_1, ..., t_n]$ is a passage of tokenized text (some tokens are not natural language but references to $B$)
  • $l$ is a numeric class label in $\{ 0, 1\}$


model the class distribution as:

$$p(l | T, I , B, R)= \frac{e^{\phi (E'(T, I, B, R))*a_l + b_l}}
{\sum_{l'}  e^{\phi (E'(T, I, B, R))*a_{l'} + b_{l'}} }

E'(T, I, B, R)= E(T) + \sum_{i=1}^m R_i [M \Phi(crop(I, b_i)) + \pi (b_i)]^T

where $M$ is a learnt matrix, $\Phi$ can be a resnet, $\pi(b_i)$ denotes the embedding of $b_i$'s shape and position information in $R^d$

No comments:

Post a Comment