Abstract
Recently, transformer-based architectures have been shown to outperform classic convolutional architectures and have rapidly been established as state-of-the-art models for many medical vision tasks. Their superior performance can be explained by their ability to capture long-range dependencies of their multi-head self-attention mechanism. However, they tend to overfit on small- or even medium-sized datasets because of their weak inductive bias. As a result, they require massive, labeled datasets, which are expensive to obtain, especially in the medical domain. This motivated us to explore unsupervised semantic feature learning without any form of annotation. In this work, we aimed to learn semantic features in a self-supervised manner by training transformer-based models to segment the numerical signals of geometric shapes inserted on original computed tomography (CT) images. Moreover, we developed a Convolutional Pyramid vision Transformer (CPT) that leverages multi-kernel convolutional patch embedding and local spatial reduction in each of its layer to generate multi-scale features, capture local information, and reduce computational cost. Using these approaches, we were able to noticeably outperformed state-of-the-art deep learning-based segmentation or classification models of liver cancer CT datasets of 5,237 patients, the pancreatic cancer CT datasets of 6,063 patients, and breast cancer MRI dataset of 127 patients.
Original language | English |
---|---|
Pages (from-to) | 2003-2014 |
Number of pages | 12 |
Journal | IEEE journal of biomedical and health informatics |
Volume | 27 |
Issue number | 4 |
DOIs | |
State | Published - 1 Apr 2023 |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- CT images
- MRI images
- breast cancer
- cancer classification
- cancer segmentation
- liver cancer
- pancreatic cancer
- self-supervised pretraining
- vision transformer