Abstract
In recent years, large-scale video-language pre-training (VidLP) has received considerable attention for its effectiveness in relevant tasks. In this paper, we propose a novel action-centric VidLP framework that employs video tube features for temporal modeling and language features based on semantic role labeling (SRL). Our video encoder generates multiple tube features along object trajectories, identifying action-related regions within videos, to over-come the limitations of existing temporal attention mechanisms. Additionally, our text encoder incorporates high-level, action-related language knowledge, previously under-utilized in current VidLP models. The SRL captures action-verbs and related semantics among objects in sentences and enhances the ability to perform instance-level text matching, thus enriching the cross-modal (CM) alignment process. We also introduce two novel pre-training objectives and a self-supervision strategy to produce a more faithful CM representation. Experimental results demonstrate that our method outperforms existing VidLP frameworks in various downstream tasks and datasets, establishing our model a baseline in the modern VidLP framework.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 |
| Publisher | IEEE Computer Society |
| Pages | 13689-13699 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798350353006 |
| ISBN (Print) | 9798350353006 |
| DOIs | |
| State | Published - 2024 |
| Event | 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States Duration: 16 Jun 2024 → 22 Jun 2024 |
Publication series
| Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
|---|---|
| ISSN (Print) | 1063-6919 |
Conference
| Conference | 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 |
|---|---|
| Country/Territory | United States |
| City | Seattle |
| Period | 16/06/24 → 22/06/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- video-language pre-training