TY - GEN
T1 - Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature
AU - Kim, Nayoung
AU - Ha, Seong Jong
AU - Kang, Je Won
N1 - Funding Information:
This work was supported by NCSOFT, in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(No.NRF-2019R1C1C1010249), and in part by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2021-2020-0-01460) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation)
Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Video Question Answering (Video QA) aims to give an answer to the question through semantic reasoning between visual and linguistic information. Recently, handling large amounts of multi-modal video and language information of a video is considered important in the industry. However, the current video QA models use deep features, suffered from significant computational complexity and insufficient representation capability both in training and testing. Existing features are extracted using pretrained networks after all the frames are decoded, which is not always suitable for video QA tasks. In this paper, we develop a novel deep neural network to provide video QA features obtained from coded video bit-stream to reduce the complexity. The proposed network includes several dedicated deep modules to both the video QA and the video compression system, which is the first attempt at the video QA task. The proposed network is predominantly model-agnostic. It is integrated into the state-of-the-art networks for improved performance without any computationally expensive motion-related deep models. The experimental results demonstrate that the proposed network outperforms the previous studies at lower complexity. https://github.com/Nayoung-Kim-ICP/VQAC.
AB - Video Question Answering (Video QA) aims to give an answer to the question through semantic reasoning between visual and linguistic information. Recently, handling large amounts of multi-modal video and language information of a video is considered important in the industry. However, the current video QA models use deep features, suffered from significant computational complexity and insufficient representation capability both in training and testing. Existing features are extracted using pretrained networks after all the frames are decoded, which is not always suitable for video QA tasks. In this paper, we develop a novel deep neural network to provide video QA features obtained from coded video bit-stream to reduce the complexity. The proposed network includes several dedicated deep modules to both the video QA and the video compression system, which is the first attempt at the video QA task. The proposed network is predominantly model-agnostic. It is integrated into the state-of-the-art networks for improved performance without any computationally expensive motion-related deep models. The experimental results demonstrate that the proposed network outperforms the previous studies at lower complexity. https://github.com/Nayoung-Kim-ICP/VQAC.
UR - http://www.scopus.com/inward/record.url?scp=85127006329&partnerID=8YFLogxK
U2 - 10.1109/ICCV48922.2021.00173
DO - 10.1109/ICCV48922.2021.00173
M3 - Conference contribution
AN - SCOPUS:85127006329
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 1688
EP - 1697
BT - Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
Y2 - 11 October 2021 through 17 October 2021
ER -