Language-free Training for Zero-shot Video Grounding

Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

21 Scopus citations

Abstract

Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages2538-2547
Number of pages10
ISBN (Electronic)9781665493468
DOIs
StatePublished - 2023
Event23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023 - Waikoloa, United States
Duration: 3 Jan 20237 Jan 2023

Publication series

NameProceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023

Conference

Conference23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023
Country/TerritoryUnited States
CityWaikoloa
Period3/01/237/01/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • Algorithms: Vision + language and/or other modalities
  • and algorithms (including transfer, low-shot, semi-, self-, and un-supervised learning)
  • formulations
  • Machine learning architectures

Fingerprint

Dive into the research topics of 'Language-free Training for Zero-shot Video Grounding'. Together they form a unique fingerprint.

Cite this