A computation-efficient network with feature aggregation for cancer subtype classification on histopathological images.
Zong Fan, Chaojie Zhang, Lulu Sun, Wade Thorstad, Hiram Gay, Xiaowei Wang, Hua Li
Abstract
Open AccessHistopathology whole-slide images (WSI) capture detailed structural and morphological features of tumor tissue, offering rich histological and molecular information. Deep learning (DL) methods have emerged to assist in automatically examining histopathology WSIs and supporting tumor classification. Traditional DL approaches for WSI images face challenges due to the intrinsic complexity of tumor tissue characteristics and the extremely large image size. Multiple instance learning (MIL) methods have been proposed to address these issues by splitting the WSI images into small non-overlapping tiles and aggregating predictions from selected informative tiles for the final classification outcome. However, MIL methods still face challenges such as the need for accurate pseudo-labels, the risk of losing local information, or the failure to learn explicit class-relevant information. To address these limitations, we propose a novel framework that uses a lightweight convolutional neural network (CNN)-based tile encoder (CTE) to extract local tile features and a Transformer-based feature aggregator (TFA) to fuse local features into a representative global feature for WSI classification. Three key contributions of our framework are as follows. Firstly, we design a two-stage training strategy that decouples a lightweight CTE pre-training (using sparsely sampled tiles for efficiency) and TFA fine-tuning (using all tiles for accuracy). It significantly reduces computational costs compared to existing MIL methods while alleviating local information loss. Secondly, dynamic self-attention-based aggregation is designed in TFA, leveraging the Transformer's self-attention to weigh all local tile features without accurate pseudo-labels. It ensures comprehensive integration of local information from both tumor and ambiguous non-tumor regions to enrich global representations from input WSIs, which benefits the final classification performance. Finally, interpretable saliency maps are generated from TFA attention scores, highlighting histopathologically relevant regions to align model decisions with clinical reasoning. Comprehensive experiments on three cancer subtype datasets demonstrate the effectiveness of our proposed method over existing MIL approaches. We also conduct further investigations into the impacts of various factors on model performance, gaining in-depth insights into our method. Our framework achieves higher classification accuracy while maintaining computational efficiency, making it a promising tool for histopathology image analysis.