Masked pre-training of transformers for histology image analysis
Link:
https://www.sciencedirect.com/science/article/pii/S2153353924000257
Title:
Masked pre-training of transformers for histology image analysis
Abstract:
In digital pathology, whole-slide images (WSIs) are widely used for applications such as cancer diagnosis and prognosis prediction. Vision transformer (ViT) models have recently emerged as a promising method for encoding large regions of WSIs while preserving spatial relationships among patches. However, due to the large number of model parameters and limited labeled data, applying transformer models to WSIs remains challenging. In this study, we propose a pretext task to train the transformer model in a self-supervised manner. Our model, MaskHIT, uses the transformer output to reconstruct masked patches, measured by contrastive loss. We pre-trained the MaskHIT model using over 7000 WSIs from TCGA and extensively evaluated its performance in multiple experiments, covering survival prediction, cancer subtype classification, and grade prediction tasks. Our experiments demonstrate that the pre-training procedure enables context-aware understanding of WSIs, facilitates the learning of representative histological features based on patch positions and visual patterns, and is essential for the ViT model to achieve optimal results on WSI-level tasks. The pre-trained MaskHIT surpasses various multiple instance learning approaches by 3% and 2% on survival prediction and cancer subtype classification tasks, and also outperforms recent state-of-the-art transformer-based methods. Finally, a comparison between the attention maps generated by the MaskHIT model with the pathologist's annotations indicates that the model can accurately identify clinically relevant histological structures on the whole slide for each task.
Citation:
Jiang S, Hondelink L, Suriawinata AA, Hassanpour S. Masked pre-training of transformers for histology image analysis. Journal of Pathology Informatics. 2024 May 31:100386.