IMPROVING SELF-SUPERVISED VISION TRANSFORMERS FOR VISUAL CONTROL

Wonil Song, Kwanghoon Sohn, Dongbo Min

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Despite the tremendous success of vision transformer (ViT) architectures in a broad range of computer vision tasks, the potential of ViT for vision-based deep reinforcement learning (RL) has not been fully explored yet. To improve the performance of the ViT model in visual RL, we propose a simple yet effective approach for self-supervised learning by utilizing the structural capability of a single ViT model, which can learn multiple, distinct representations through extra learnable token embeddings. To this end, in addition to an RL token used for RL input, which corresponds to the classification token in computer vision, we introduce additional extra tokens that are tailored to two auxiliary self-supervised tasks specialized to learn visual and environmental dynamics representations. By interacting with embeddings of these extra tokens through self-attention, our approach provides additional learning signals to the ViT encoder, enabling it to learn more comprehensive representations that are beneficial to RL tasks. In experiments on benchmarks including the DeepMind Control Suite (DMControl) and Atari games, we demonstrate that the proposed approach outperforms the baselines that utilize ViT encoders, particularly achieving state-of-the-art performance in 4 out of 5 tasks in DMControl.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Image Processing, ICIP 2024 - Proceedings
PublisherIEEE Computer Society
Pages1302-1308
Number of pages7
ISBN (Electronic)9798350349399
DOIs
StatePublished - 2024
Event31st IEEE International Conference on Image Processing, ICIP 2024 - Abu Dhabi, United Arab Emirates
Duration: 27 Oct 202430 Oct 2024

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880

Conference

Conference31st IEEE International Conference on Image Processing, ICIP 2024
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period27/10/2430/10/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • Reinforcement Learning
  • Self-Supervised Learning
  • Vision Transformer

Fingerprint

Dive into the research topics of 'IMPROVING SELF-SUPERVISED VISION TRANSFORMERS FOR VISUAL CONTROL'. Together they form a unique fingerprint.

Cite this