Ju Hee Lee, Je Won Kang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations


In this paper, we propose a relation enhanced vision-language pre-training (VLP) method for a transformer model (TM) to improve performance in vision-language (V+L) tasks. Current VLP studies attempted to generate a multimodal representation with individual objects as input and relied on a self-attention to learn semantic representation in a brute force manner. However, the relations among objects in an image are largely ignored. To address the problem, we generate a paired visual feature (PVF) that is organized to express the relations between objects. Prior knowledge that reflects co-occurrences of paired objects and a pair-wise distance matrix adjusts the relations, and a triplet is used for sentence embedding. Experimental results demonstrate that the proposed method is efficiently used for VLP by bridging relations between objects, and thus improves performance on V+L downstream tasks.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Image Processing, ICIP 2022 - Proceedings
PublisherIEEE Computer Society
Number of pages5
ISBN (Electronic)9781665496209
StatePublished - 2022
Event29th IEEE International Conference on Image Processing, ICIP 2022 - Bordeaux, France
Duration: 16 Oct 202219 Oct 2022

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880


Conference29th IEEE International Conference on Image Processing, ICIP 2022

Bibliographical note

Funding Information:
This work was partly supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub) and partly supported by the MSIT, Korea, under the ITRC(Information Technology Research Center) support program(IITP-2022-2020-0-01460) supervised by the IITP and by the NRF grant funded by MSIT (No.NRF-2022R1A2C4002052).

Publisher Copyright:
© 2022 IEEE.


  • vision-language pre-training


Dive into the research topics of 'RELATION ENHANCED VISION LANGUAGE PRE-TRAINING'. Together they form a unique fingerprint.

Cite this