TY - GEN
T1 - RELATION ENHANCED VISION LANGUAGE PRE-TRAINING
AU - Lee, Ju Hee
AU - Kang, Je Won
N1 - Funding Information:
This work was partly supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub) and partly supported by the MSIT, Korea, under the ITRC(Information Technology Research Center) support program(IITP-2022-2020-0-01460) supervised by the IITP and by the NRF grant funded by MSIT (No.NRF-2022R1A2C4002052).
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In this paper, we propose a relation enhanced vision-language pre-training (VLP) method for a transformer model (TM) to improve performance in vision-language (V+L) tasks. Current VLP studies attempted to generate a multimodal representation with individual objects as input and relied on a self-attention to learn semantic representation in a brute force manner. However, the relations among objects in an image are largely ignored. To address the problem, we generate a paired visual feature (PVF) that is organized to express the relations between objects. Prior knowledge that reflects co-occurrences of paired objects and a pair-wise distance matrix adjusts the relations, and a triplet is used for sentence embedding. Experimental results demonstrate that the proposed method is efficiently used for VLP by bridging relations between objects, and thus improves performance on V+L downstream tasks.
AB - In this paper, we propose a relation enhanced vision-language pre-training (VLP) method for a transformer model (TM) to improve performance in vision-language (V+L) tasks. Current VLP studies attempted to generate a multimodal representation with individual objects as input and relied on a self-attention to learn semantic representation in a brute force manner. However, the relations among objects in an image are largely ignored. To address the problem, we generate a paired visual feature (PVF) that is organized to express the relations between objects. Prior knowledge that reflects co-occurrences of paired objects and a pair-wise distance matrix adjusts the relations, and a triplet is used for sentence embedding. Experimental results demonstrate that the proposed method is efficiently used for VLP by bridging relations between objects, and thus improves performance on V+L downstream tasks.
KW - vision-language pre-training
UR - http://www.scopus.com/inward/record.url?scp=85146637362&partnerID=8YFLogxK
U2 - 10.1109/ICIP46576.2022.9897623
DO - 10.1109/ICIP46576.2022.9897623
M3 - Conference contribution
AN - SCOPUS:85146637362
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 2286
EP - 2290
BT - 2022 IEEE International Conference on Image Processing, ICIP 2022 - Proceedings
PB - IEEE Computer Society
T2 - 29th IEEE International Conference on Image Processing, ICIP 2022
Y2 - 16 October 2022 through 19 October 2022
ER -