SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling

Ju Hee Lee, Je Won Kang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, large-scale video-language pre-training (VidLP) has received considerable attention for its effectiveness in relevant tasks. In this paper, we propose a novel action-centric VidLP framework that employs video tube features for temporal modeling and language features based on semantic role labeling (SRL). Our video encoder generates multiple tube features along object trajectories, identifying action-related regions within videos, to over-come the limitations of existing temporal attention mechanisms. Additionally, our text encoder incorporates high-level, action-related language knowledge, previously under-utilized in current VidLP models. The SRL captures action-verbs and related semantics among objects in sentences and enhances the ability to perform instance-level text matching, thus enriching the cross-modal (CM) alignment process. We also introduce two novel pre-training objectives and a self-supervision strategy to produce a more faithful CM representation. Experimental results demonstrate that our method outperforms existing VidLP frameworks in various downstream tasks and datasets, establishing our model a baseline in the modern VidLP framework.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages13689-13699
Number of pages11
ISBN (Electronic)9798350353006
ISBN (Print)9798350353006
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period16/06/2422/06/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • video-language pre-training

Fingerprint

Dive into the research topics of 'SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling'. Together they form a unique fingerprint.

Cite this