Skip to main navigation Skip to search Skip to main content

UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching

Research output: Contribution to journalArticlepeer-review

Abstract

Unlike other vision tasks where Transformer-based approaches are becoming increasinglycommon, stereo depth estimation is still dominated by convolution-based models. This is mainly due tothe limited availability of real-world ground truth for stereo matching, which hinders the performanceimprovement of transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a methodto maximize the potential of Transformer-based stereo architectures by unifying self-supervised learningfor pre-training with stereo matching framework based on supervised learning. Specifically, we designa dual-task learning scheme that reconstructs masked regions of an input image while simultaneouslypredicting corresponding points in the paired image. We demonstrate that this approach encourages themodel to learn locality-aware representations, which are critical to overcoming the data inefficiency ofTransformers. Moreover, to address these challenging tasks of reconstruction-and-prediction, we propose avariable masking ratio strategy that promotes robustness to varying levels of visual information. Additionally,we introduce losses that exploit stereo geometry and correspondence at the appearance, feature, and disparitylevels. To further validate the effectiveness of our design, we conduct frequency decomposition and attentionmap visualization, which reveal how the model effectively captures fine-grained structures and cross-viewcorrespondences. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such asthe ETH3D, KITTI 2012, and KITTI 2015 datasets.

Original languageEnglish
Pages (from-to)204695-204707
Number of pages13
JournalIEEE Access
Volume13
DOIs
StatePublished - 2025

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Keywords

  • Masked image modeling
  • self-supervised learning
  • stereo depth estimation
  • supervised learning
  • transformer

Fingerprint

Dive into the research topics of 'UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching'. Together they form a unique fingerprint.

Cite this