Abstract
We describe the Raytheon BBN (BBN) VISER system that is designed to detect events of interest in multimedia data. We also present a comprehensive analysis of the different modules of that system in the context of the MED 2011 task. The VISER system incorporates a large set of low-level features that capture appearance, color, motion, audio, and audio-visual co-occurrence patterns in videos. For the low-level features, we rigorously analyzed several coding and pooling strategies, and also used state-of-the-art spatio-temporal pooling strategies to model relationships between different features. The system also uses high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Furthermore, the VISER system exploits multimodal information by analyzing available spoken and videotext content using BBN's state-of-the-art Byblos automatic speech recognition (ASR) and video text recognition systems. These diverse streams of information are combined into a single, fixed dimensional vector for each video. We explored two different combination strategies: early fusion and late fusion. Early fusion was implemented through a fast kernel-based fusion framework and late fusion was performed using both Bayesian model combination (BAYCOM) as well as an innovative a weighted-average framework. Consistent with the previous MED'10 evaluation, low-level visual features exhibit strong performance and form the basis of our system. However, high-level information from speech, video-text, and object detection provide consistent and significant performance improvements. Overall, BBN's VISER system exhibited the best performance among all the submitted systems with an average ANDC score of 0.46 across the 10 MED'11 test events when the threshold was optimized for the NDC score, and <30% missed detection rate when the threshold was optimized to minimize missed detections at 6% false alarm rate. Description of Submitted Runs BBNVISER-LLFeat: Uses a combination of 6 high-performing, multimodal, and complementary low-level features, namely, appearance, color, motion based, MFCC, and audio energy. We combine these low-level features using an early fusion strategy. The threshold is estimated to minimize the NDC score. BBNVISER-Fusion1: Combines several sub-systems, each based on some combination of low-level features, ASR, video text OCR, and other high-level concepts using a late-fusion, Bayesian model combination strategy. The threshold is estimated to minimize the NDC score. BBNVISER-Fusion2: Combines same set of subsystems as BBNVISER-Fusion1. Instead of BAYCOM, it uses a novel weighted average fusion strategy. The fusion weights (for each sub-system) are estimated for each video automatically at runtime. BBNVISER-Fusion3: Combines all the sub-systems used in BBNVISER-Fusion3 with separate end-to-end systems from Columbia and UCF. In all, 18 sub-systems were combined using weighted average fusion. The threshold is estimated to minimize the probability of missed detection in the neighborhood of ALADDIN's Year 1 false alarm rate ceiling.
Original language | English |
---|---|
State | Published - 2011 |
Event | TREC Video Retrieval Evaluation, TRECVID 2011 - Gaithersburg, MD, United States Duration: 5 Dec 2011 → 7 Dec 2011 |
Conference
Conference | TREC Video Retrieval Evaluation, TRECVID 2011 |
---|---|
Country/Territory | United States |
City | Gaithersburg, MD |
Period | 5/12/11 → 7/12/11 |
Keywords
- Automatic speech recognition
- BAYCOM
- Early fusion
- Feature fusion
- Late fusion
- Low-level visual features
- Spatio-temporal pooling
- Videotext OCR