Abstract
In this paper, we propose a novel video captioning algorithm including a feature encoder (FENC) and a decoder architecture to provide more accurate and richer representation. Our network model incorporates feature temporal attention (FTA) to efficiently embed important events to a feature vector. In FTA, the proposed feature is given as the weighted fusion of the video features extracted from 3D CNN, and, therefore it allows a decoder to know when the feature is activated. In a decoder, similarly, a feature word attention (FWA) is used for weighting some elements of the encoded feature vector. The FWA determines which elements in the feature should be activated to generate the appropriate word. The training is further facilitated by a new loss function, reducing the variance of the frequencies of words. It is demonstrated with experimental results that the proposed algorithms outperforms the conventional algorithms in VATEX that is a recent large-scale dataset for long-term video sentence generation.
Original language | English |
---|---|
Title of host publication | 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1279-1282 |
Number of pages | 4 |
ISBN (Electronic) | 9789881476883 |
State | Published - 7 Dec 2020 |
Event | 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Virtual, Auckland, New Zealand Duration: 7 Dec 2020 → 10 Dec 2020 |
Publication series
Name | 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 - Proceedings |
---|
Conference
Conference | 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2020 |
---|---|
Country/Territory | New Zealand |
City | Virtual, Auckland |
Period | 7/12/20 → 10/12/20 |
Bibliographical note
Publisher Copyright:© 2020 APSIPA.