Abstract
Unlike other vision tasks where Transformer-based approaches are becoming increasinglycommon, stereo depth estimation is still dominated by convolution-based models. This is mainly due tothe limited availability of real-world ground truth for stereo matching, which hinders the performanceimprovement of transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a methodto maximize the potential of Transformer-based stereo architectures by unifying self-supervised learningfor pre-training with stereo matching framework based on supervised learning. Specifically, we designa dual-task learning scheme that reconstructs masked regions of an input image while simultaneouslypredicting corresponding points in the paired image. We demonstrate that this approach encourages themodel to learn locality-aware representations, which are critical to overcoming the data inefficiency ofTransformers. Moreover, to address these challenging tasks of reconstruction-and-prediction, we propose avariable masking ratio strategy that promotes robustness to varying levels of visual information. Additionally,we introduce losses that exploit stereo geometry and correspondence at the appearance, feature, and disparitylevels. To further validate the effectiveness of our design, we conduct frequency decomposition and attentionmap visualization, which reveal how the model effectively captures fine-grained structures and cross-viewcorrespondences. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such asthe ETH3D, KITTI 2012, and KITTI 2015 datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 204695-204707 |
| Number of pages | 13 |
| Journal | IEEE Access |
| Volume | 13 |
| DOIs | |
| State | Published - 2025 |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- Masked image modeling
- self-supervised learning
- stereo depth estimation
- supervised learning
- transformer
Fingerprint
Dive into the research topics of 'UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver