Abstract
Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 2538-2547 |
| Number of pages | 10 |
| ISBN (Electronic) | 9781665493468 |
| DOIs | |
| State | Published - 2023 |
| Event | 23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023 - Waikoloa, United States Duration: 3 Jan 2023 → 7 Jan 2023 |
Publication series
| Name | Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023 |
|---|
Conference
| Conference | 23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023 |
|---|---|
| Country/Territory | United States |
| City | Waikoloa |
| Period | 3/01/23 → 7/01/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Keywords
- Algorithms: Vision + language and/or other modalities
- and algorithms (including transfer, low-shot, semi-, self-, and un-supervised learning)
- formulations
- Machine learning architectures