TY - GEN
T1 - Mining Better Samples for Contrastive Learning of Temporal Correspondence
AU - Jeon, Sangryul
AU - Min, Dongbo
AU - Kim, Seungryong
AU - Sohn, Kwanghoon
N1 - Funding Information:
Acknowledgements : This work was supported by IITP grant funded by the Korea government (MSIT) (No.2020-0-00056, To create AI systems that act appropriately and effectively in novel situations that occur in open worlds) and the Yonsei University Research Fund of 2021 (2021-22-0001).
Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - We present a novel framework for contrastive learning of pixel-level representation using only unlabeled video. Without the need of ground-truth annotation, our method is capable of collecting well-defined positive correspondences by measuring their confidences and well-defined negative ones by appropriately adjusting their hardness during training. This allows us to suppress the adverse impact of ambiguous matches and prevent a trivial solution from being yielded by too hard or too easy negative samples. To accomplish this, we incorporate three different criteria that ranges from a pixel-level matching confidence to a video-level one into a bottom-up pipeline, and plan a curriculum that is aware of current representation power for the adaptive hardness of negative samples during training. With the proposed method, state-of-the-art performance is attained over the latest approaches on several video label propagation tasks.
AB - We present a novel framework for contrastive learning of pixel-level representation using only unlabeled video. Without the need of ground-truth annotation, our method is capable of collecting well-defined positive correspondences by measuring their confidences and well-defined negative ones by appropriately adjusting their hardness during training. This allows us to suppress the adverse impact of ambiguous matches and prevent a trivial solution from being yielded by too hard or too easy negative samples. To accomplish this, we incorporate three different criteria that ranges from a pixel-level matching confidence to a video-level one into a bottom-up pipeline, and plan a curriculum that is aware of current representation power for the adaptive hardness of negative samples during training. With the proposed method, state-of-the-art performance is attained over the latest approaches on several video label propagation tasks.
UR - http://www.scopus.com/inward/record.url?scp=85123215522&partnerID=8YFLogxK
U2 - 10.1109/CVPR46437.2021.00109
DO - 10.1109/CVPR46437.2021.00109
M3 - Conference contribution
AN - SCOPUS:85123215522
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 1034
EP - 1044
BT - Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021
PB - IEEE Computer Society
Y2 - 19 June 2021 through 25 June 2021
ER -