Abstract
In this paper, we propose a relation enhanced vision-language pre-training (VLP) method for a transformer model (TM) to improve performance in vision-language (V+L) tasks. Current VLP studies attempted to generate a multimodal representation with individual objects as input and relied on a self-attention to learn semantic representation in a brute force manner. However, the relations among objects in an image are largely ignored. To address the problem, we generate a paired visual feature (PVF) that is organized to express the relations between objects. Prior knowledge that reflects co-occurrences of paired objects and a pair-wise distance matrix adjusts the relations, and a triplet is used for sentence embedding. Experimental results demonstrate that the proposed method is efficiently used for VLP by bridging relations between objects, and thus improves performance on V+L downstream tasks.
Original language | English |
---|---|
Title of host publication | 2022 IEEE International Conference on Image Processing, ICIP 2022 - Proceedings |
Publisher | IEEE Computer Society |
Pages | 2286-2290 |
Number of pages | 5 |
ISBN (Electronic) | 9781665496209 |
DOIs | |
State | Published - 2022 |
Event | 29th IEEE International Conference on Image Processing, ICIP 2022 - Bordeaux, France Duration: 16 Oct 2022 → 19 Oct 2022 |
Publication series
Name | Proceedings - International Conference on Image Processing, ICIP |
---|---|
ISSN (Print) | 1522-4880 |
Conference
Conference | 29th IEEE International Conference on Image Processing, ICIP 2022 |
---|---|
Country/Territory | France |
City | Bordeaux |
Period | 16/10/22 → 19/10/22 |
Bibliographical note
Funding Information:This work was partly supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub) and partly supported by the MSIT, Korea, under the ITRC(Information Technology Research Center) support program(IITP-2022-2020-0-01460) supervised by the IITP and by the NRF grant funded by MSIT (No.NRF-2022R1A2C4002052).
Publisher Copyright:
© 2022 IEEE.
Keywords
- vision-language pre-training