Temporal Attention Feature Encoding for Video Captioning

Nayoung Kim, Seong Jong Ha, Je Won Kang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we propose a novel video captioning algorithm including a feature encoder (FENC) and a decoder architecture to provide more accurate and richer representation. Our network model incorporates feature temporal attention (FTA) to efficiently embed important events to a feature vector. In FTA, the proposed feature is given as the weighted fusion of the video features extracted from 3D CNN, and, therefore it allows a decoder to know when the feature is activated. In a decoder, similarly, a feature word attention (FWA) is used for weighting some elements of the encoded feature vector. The FWA determines which elements in the feature should be activated to generate the appropriate word. The training is further facilitated by a new loss function, reducing the variance of the frequencies of words. It is demonstrated with experimental results that the proposed algorithms outperforms the conventional algorithms in VATEX that is a recent large-scale dataset for long-term video sentence generation.

Original languageEnglish
Title of host publication2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1279-1282
Number of pages4
ISBN (Electronic)9789881476883
StatePublished - 7 Dec 2020
Event2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Virtual, Auckland, New Zealand
Duration: 7 Dec 202010 Dec 2020

Publication series

Name2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings

Conference

Conference2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020
Country/TerritoryNew Zealand
CityVirtual, Auckland
Period7/12/2010/12/20

Bibliographical note

Publisher Copyright:
© 2020 APSIPA.

Fingerprint

Dive into the research topics of 'Temporal Attention Feature Encoding for Video Captioning'. Together they form a unique fingerprint.

Cite this