Audio-visual attention networks for emotion recognition

Jiyoung Lee, Sunok Kim, Seungryong Kim, Kwanghoon Sohn

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

We present a spatiotemporal attention based multimodal deep neural networks for dimensional emotion recognition in multimodal audio-visual video sequence. To learn the temporal attention that discriminatively focuses on emotional sailient parts within speech audios, we formulate the temporal attention network using deep neural networks (DNNs). In addition, to learn the spatiotemporal attention that selectively focuses on emotional sailient parts within facial videos, the spatiotemporal encoder-decoder network is formulated using Convolutional LSTM (ConvLSTM) modules, and learned implicitly without any pixel-level annotations. By leveraging the spatiotemporal attention, the 3D convolutional neural networks (3D-CNNs) is also formulated to robustly recognize the dimensional emotion in facial videos. Furthermore, to exploit multimodal information, we fuse the audio and video features to emotion regression model. The experimental results show that our method can achieve the state-of-the-art results in dimensional emotion recognition with the highest concordance correlation coefficient (CCC) on AV+EC 2017 dataset.

Original languageEnglish
Title of host publicationAVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018
PublisherAssociation for Computing Machinery, Inc
Pages27-32
Number of pages6
ISBN (Electronic)9781450359771
DOIs
StatePublished - 26 Oct 2018
Event2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, AVSU 2018, co-located with MM 2018 - Seoul, Korea, Republic of
Duration: 26 Oct 2018 → …

Publication series

NameAVSU 2018 - Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Co-located with MM 2018

Conference

Conference2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, AVSU 2018, co-located with MM 2018
Country/TerritoryKorea, Republic of
CitySeoul
Period26/10/18 → …

Bibliographical note

Publisher Copyright:
© 2018 Association for Computing Machinery.

Keywords

  • Convolutional Long Short-Term Memory
  • Multimodal emotion recognition
  • Recurrent Neural Network
  • Spatiotemporal attention

Fingerprint

Dive into the research topics of 'Audio-visual attention networks for emotion recognition'. Together they form a unique fingerprint.

Cite this