Abstract
We present a spatiotemporal attention based multimodal deep neural networks for dimensional emotion recognition in multimodal audio-visual video sequence. To learn the temporal attention that discriminatively focuses on emotional sailient parts within speech audios, we formulate the temporal attention network using deep neural networks (DNNs). In addition, to learn the spatiotemporal attention that selectively focuses on emotional sailient parts within facial videos, the spatiotemporal encoder-decoder network is formulated using Convolutional LSTM (ConvLSTM) modules, and learned implicitly without any pixel-level annotations. By leveraging the spatiotemporal attention, the 3D convolutional neural networks (3D-CNNs) is also formulated to robustly recognize the dimensional emotion in facial videos. Furthermore, to exploit multimodal information, we fuse the audio and video features to emotion regression model. The experimental results show that our method can achieve the state-of-the-art results in dimensional emotion recognition with the highest concordance correlation coefficient (CCC) on AV+EC 2017 dataset.
| Original language | English |
|---|---|
| Title of host publication | AVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 27-32 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781450359771 |
| DOIs | |
| State | Published - 26 Oct 2018 |
| Event | 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, AVSU 2018, co-located with MM 2018 - Seoul, Korea, Republic of Duration: 26 Oct 2018 → … |
Publication series
| Name | AVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018 |
|---|
Conference
| Conference | 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, AVSU 2018, co-located with MM 2018 |
|---|---|
| Country/Territory | Korea, Republic of |
| City | Seoul |
| Period | 26/10/18 → … |
Bibliographical note
Publisher Copyright:© 2018 Association for Computing Machinery.
Keywords
- Convolutional Long Short-Term Memory
- Multimodal emotion recognition
- Recurrent Neural Network
- Spatiotemporal attention