Abstract
In this paper, we propose a relation enhanced vision-language pre-training (VLP) method for a transformer model (TM) to improve performance in vision-language (V+L) tasks. Current VLP studies attempted to generate a multimodal representation with individual objects as input and relied on a self-attention to learn semantic representation in a brute force manner. However, the relations among objects in an image are largely ignored. To address the problem, we generate a paired visual feature (PVF) that is organized to express the relations between objects. Prior knowledge that reflects co-occurrences of paired objects and a pair-wise distance matrix adjusts the relations, and a triplet is used for sentence embedding. Experimental results demonstrate that the proposed method is efficiently used for VLP by bridging relations between objects, and thus improves performance on V+L downstream tasks.
| Original language | English |
|---|---|
| Title of host publication | 2022 IEEE International Conference on Image Processing, ICIP 2022 - Proceedings |
| Publisher | IEEE Computer Society |
| Pages | 2286-2290 |
| Number of pages | 5 |
| ISBN (Electronic) | 9781665496209 |
| DOIs | |
| State | Published - 2022 |
| Event | 29th IEEE International Conference on Image Processing, ICIP 2022 - Bordeaux, France Duration: 16 Oct 2022 → 19 Oct 2022 |
Publication series
| Name | Proceedings - International Conference on Image Processing, ICIP |
|---|---|
| ISSN (Print) | 1522-4880 |
Conference
| Conference | 29th IEEE International Conference on Image Processing, ICIP 2022 |
|---|---|
| Country/Territory | France |
| City | Bordeaux |
| Period | 16/10/22 → 19/10/22 |
Bibliographical note
Funding Information:This work was partly supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub) and partly supported by the MSIT, Korea, under the ITRC(Information Technology Research Center) support program(IITP-2022-2020-0-01460) supervised by the IITP and by the NRF grant funded by MSIT (No.NRF-2022R1A2C4002052).
Publisher Copyright:
© 2022 IEEE.
Keywords
- vision-language pre-training