BBN VISER TRECVID 2011 multimedia Event Detection System

Pradeep Natarajan, Prem Natarajan, Vasant Manohar, Shuang Wu, Stavros Tsakalidis, Shiv N. Vitaladevuni, Xiaodan Zhuang, Rohit Prasad, Guangnan Ye, Dong Liu, I. Hong Jhuo, Shih Fu Chang, Hamid Izadinia, Imran Saleemi, Mubarak Shah, Brandyn White, Tom Yeh, Larry Davis

Research output: Contribution to conferencePaperpeer-review

30 Scopus citations

Abstract

We describe the Raytheon BBN (BBN) VISER system that is designed to detect events of interest in multimedia data. We also present a comprehensive analysis of the different modules of that system in the context of the MED 2011 task. The VISER system incorporates a large set of low-level features that capture appearance, color, motion, audio, and audio-visual co-occurrence patterns in videos. For the low-level features, we rigorously analyzed several coding and pooling strategies, and also used state-of-the-art spatio-temporal pooling strategies to model relationships between different features. The system also uses high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Furthermore, the VISER system exploits multimodal information by analyzing available spoken and videotext content using BBN's state-of-the-art Byblos automatic speech recognition (ASR) and video text recognition systems. These diverse streams of information are combined into a single, fixed dimensional vector for each video. We explored two different combination strategies: early fusion and late fusion. Early fusion was implemented through a fast kernel-based fusion framework and late fusion was performed using both Bayesian model combination (BAYCOM) as well as an innovative a weighted-average framework. Consistent with the previous MED'10 evaluation, low-level visual features exhibit strong performance and form the basis of our system. However, high-level information from speech, video-text, and object detection provide consistent and significant performance improvements. Overall, BBN's VISER system exhibited the best performance among all the submitted systems with an average ANDC score of 0.46 across the 10 MED'11 test events when the threshold was optimized for the NDC score, and <30% missed detection rate when the threshold was optimized to minimize missed detections at 6% false alarm rate. Description of Submitted Runs BBNVISER-LLFeat: Uses a combination of 6 high-performing, multimodal, and complementary low-level features, namely, appearance, color, motion based, MFCC, and audio energy. We combine these low-level features using an early fusion strategy. The threshold is estimated to minimize the NDC score. BBNVISER-Fusion1: Combines several sub-systems, each based on some combination of low-level features, ASR, video text OCR, and other high-level concepts using a late-fusion, Bayesian model combination strategy. The threshold is estimated to minimize the NDC score. BBNVISER-Fusion2: Combines same set of subsystems as BBNVISER-Fusion1. Instead of BAYCOM, it uses a novel weighted average fusion strategy. The fusion weights (for each sub-system) are estimated for each video automatically at runtime. BBNVISER-Fusion3: Combines all the sub-systems used in BBNVISER-Fusion3 with separate end-to-end systems from Columbia and UCF. In all, 18 sub-systems were combined using weighted average fusion. The threshold is estimated to minimize the probability of missed detection in the neighborhood of ALADDIN's Year 1 false alarm rate ceiling.

Original languageEnglish
StatePublished - 2011
EventTREC Video Retrieval Evaluation, TRECVID 2011 - Gaithersburg, MD, United States
Duration: 5 Dec 20117 Dec 2011

Conference

ConferenceTREC Video Retrieval Evaluation, TRECVID 2011
Country/TerritoryUnited States
CityGaithersburg, MD
Period5/12/117/12/11

Keywords

  • Automatic speech recognition
  • BAYCOM
  • Early fusion
  • Feature fusion
  • Late fusion
  • Low-level visual features
  • Spatio-temporal pooling
  • Videotext OCR

Fingerprint

Dive into the research topics of 'BBN VISER TRECVID 2011 multimedia Event Detection System'. Together they form a unique fingerprint.

Cite this