TY - JOUR
T1 - What and when to look? Temporal span proposal network for video relation detection
AU - Woo, Sangmin
AU - Noh, Junhyug
AU - Kim, Kangil
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/2/1
Y1 - 2026/2/1
N2 - Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. The segment-based methods lack temporal continuity on the other hand, window-based scale poorly. To tackle this limitations of typical methods, we propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2× or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVRD and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach.
AB - Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. The segment-based methods lack temporal continuity on the other hand, window-based scale poorly. To tackle this limitations of typical methods, we propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2× or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVRD and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach.
KW - Multi object tracking
KW - Proposal network
KW - Relationship detection
KW - Video understanding
UR - https://www.scopus.com/pages/publications/105014923200
U2 - 10.1016/j.eswa.2025.129503
DO - 10.1016/j.eswa.2025.129503
M3 - Article
AN - SCOPUS:105014923200
SN - 0957-4174
VL - 297
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 129503
ER -