Journal of imaging

VT-MFLV: Vision-Text Multimodal Feature Learning V Network for Medical Image Segmentation.

Wenju Wang, Jiaqi Li, Zinuo Ye, Yuyang Cai, Zhen Wang, Renwei Zhang

Published: 202510.3390/jimaging11120425

Abstract

Open Access

Currently, existing multimodal segmentation methods face limitations in effectively leveraging medical text to guide visual feature learning. They often suffer from insufficient multimodal fusion and inadequate accuracy in fine-grained lesion segmentation accuracy. To address these challenges, the Vision-Text Multimodal Feature Learning V Network (VT-MFLV) is proposed. This model exploits the complementarity between medical images and text to enhance multimodal fusion, which consequently improves critical lesion recognition accuracy. VT-MFLV introduces three key modules: Diagnostic Image-Text Residual Multi-Head Semantic Encoding (DIT-RMHSE) module that preserves critical semantic cues while reducing preprocessing complexity; Fine-Grained Multimodal Fusion Local Attention Encoding (FG-MFLA) module that strengthens local cross-modal interaction; and Adaptive Global Feature Compression and Focusing (AGCF) module that emphasizes clinically relevant lesion regions. Experiments are conducted on two publicly available pulmonary infection datasets. On the MosMedData dataset, VT-MFLV achieved Dice and mIoU scores of 75.61 ± 0.32% and 63.98 ± 0.29%. On the QaTa-COV1 dataset, VT-MFLV achieved Dice and mIoU scores of 83.34 ± 0.36% and 72.09 ± 0.30%, both reaching world-leading levels.

View at DOI