Abstract
Despite the tremendous success of vision transformer (ViT) architectures in a broad range of computer vision tasks, the potential of ViT for vision-based deep reinforcement learning (RL) has not been fully explored yet. To improve the performance of the ViT model in visual RL, we propose a simple yet effective approach for self-supervised learning by utilizing the structural capability of a single ViT model, which can learn multiple, distinct representations through extra learnable token embeddings. To this end, in addition to an RL token used for RL input, which corresponds to the classification token in computer vision, we introduce additional extra tokens that are tailored to two auxiliary self-supervised tasks specialized to learn visual and environmental dynamics representations. By interacting with embeddings of these extra tokens through self-attention, our approach provides additional learning signals to the ViT encoder, enabling it to learn more comprehensive representations that are beneficial to RL tasks. In experiments on benchmarks including the DeepMind Control Suite (DMControl) and Atari games, we demonstrate that the proposed approach outperforms the baselines that utilize ViT encoders, particularly achieving state-of-the-art performance in 4 out of 5 tasks in DMControl.
| Original language | English |
|---|---|
| Title of host publication | 2024 IEEE International Conference on Image Processing, ICIP 2024 - Proceedings |
| Publisher | IEEE Computer Society |
| Pages | 1302-1308 |
| Number of pages | 7 |
| ISBN (Electronic) | 9798350349399 |
| DOIs | |
| State | Published - 2024 |
| Event | 31st IEEE International Conference on Image Processing, ICIP 2024 - Abu Dhabi, United Arab Emirates Duration: 27 Oct 2024 → 30 Oct 2024 |
Publication series
| Name | Proceedings - International Conference on Image Processing, ICIP |
|---|---|
| ISSN (Print) | 1522-4880 |
Conference
| Conference | 31st IEEE International Conference on Image Processing, ICIP 2024 |
|---|---|
| Country/Territory | United Arab Emirates |
| City | Abu Dhabi |
| Period | 27/10/24 → 30/10/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- Reinforcement Learning
- Self-Supervised Learning
- Vision Transformer