TY - JOUR
T1 - Unsupervised Visual Representation Learning Based on Segmentation of Geometric Pseudo-Shapes for Transformer-Based Medical Tasks
AU - Viriyasaranon, Thanaporn
AU - Woo, Sang Myung
AU - Choi, Jang Hwan
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023/4/1
Y1 - 2023/4/1
N2 - Recently, transformer-based architectures have been shown to outperform classic convolutional architectures and have rapidly been established as state-of-the-art models for many medical vision tasks. Their superior performance can be explained by their ability to capture long-range dependencies of their multi-head self-attention mechanism. However, they tend to overfit on small- or even medium-sized datasets because of their weak inductive bias. As a result, they require massive, labeled datasets, which are expensive to obtain, especially in the medical domain. This motivated us to explore unsupervised semantic feature learning without any form of annotation. In this work, we aimed to learn semantic features in a self-supervised manner by training transformer-based models to segment the numerical signals of geometric shapes inserted on original computed tomography (CT) images. Moreover, we developed a Convolutional Pyramid vision Transformer (CPT) that leverages multi-kernel convolutional patch embedding and local spatial reduction in each of its layer to generate multi-scale features, capture local information, and reduce computational cost. Using these approaches, we were able to noticeably outperformed state-of-the-art deep learning-based segmentation or classification models of liver cancer CT datasets of 5,237 patients, the pancreatic cancer CT datasets of 6,063 patients, and breast cancer MRI dataset of 127 patients.
AB - Recently, transformer-based architectures have been shown to outperform classic convolutional architectures and have rapidly been established as state-of-the-art models for many medical vision tasks. Their superior performance can be explained by their ability to capture long-range dependencies of their multi-head self-attention mechanism. However, they tend to overfit on small- or even medium-sized datasets because of their weak inductive bias. As a result, they require massive, labeled datasets, which are expensive to obtain, especially in the medical domain. This motivated us to explore unsupervised semantic feature learning without any form of annotation. In this work, we aimed to learn semantic features in a self-supervised manner by training transformer-based models to segment the numerical signals of geometric shapes inserted on original computed tomography (CT) images. Moreover, we developed a Convolutional Pyramid vision Transformer (CPT) that leverages multi-kernel convolutional patch embedding and local spatial reduction in each of its layer to generate multi-scale features, capture local information, and reduce computational cost. Using these approaches, we were able to noticeably outperformed state-of-the-art deep learning-based segmentation or classification models of liver cancer CT datasets of 5,237 patients, the pancreatic cancer CT datasets of 6,063 patients, and breast cancer MRI dataset of 127 patients.
KW - CT images
KW - MRI images
KW - breast cancer
KW - cancer classification
KW - cancer segmentation
KW - liver cancer
KW - pancreatic cancer
KW - self-supervised pretraining
KW - vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85147279127&partnerID=8YFLogxK
U2 - 10.1109/JBHI.2023.3237596
DO - 10.1109/JBHI.2023.3237596
M3 - Article
AN - SCOPUS:85147279127
SN - 2168-2194
VL - 27
SP - 2003
EP - 2014
JO - IEEE journal of biomedical and health informatics
JF - IEEE journal of biomedical and health informatics
IS - 4
ER -