Keyword-Conditioned Image Segmentation via the Cross-Attentive Alignment of Language and Vision Sensor Data.
Hye Rim Kim, Byoung Chul Ko
Abstract
Open AccessAdvancements in multimodal large language models have opened up new possibilities for reasoning-based image segmentation by jointly processing visual and linguistic information. However, existing approaches often suffer from a semantic discrepancy between language interpretation and visual segmentation as a result of the lack of a structural connection between query understanding and segmentation execution. To address this issue, we propose a keyword-conditioned image segmentation model (KeySeg) as a novel architecture that explicitly encodes and integrates inferred query conditions into the segmentation process. KeySeg embeds the core concepts extracted from multimodal inputs into a dedicated [KEY] token, which is then fused with a [SEG] token through a cross-attention-based fusion module. This design enables the model to reflect query conditions explicitly and precisely in the segmentation criteria. Additionally, we introduce a keyword alignment loss that guides the [KEY] token to align closely with the semantic core of the input query, thereby enhancing the accuracy of condition interpretation. By separating the roles of condition reasoning and segmentation instruction, and making their interactions explicit, KeySeg achieves both expressive capacity and interpretative stability, even under complex language conditions.