IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Caption Assisted Multimodal Large Language Model for Video Moment Retrieval.
Peiyu Xie, Jinxing Li, Guangming Lu, Yong Xu, David Zhang
Published: 202510.1109/TIP.2025.3620124
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant potential across various multimodal tasks, including retrieval, summarization, and reasoning. However, it remains a substantial challenge for MLLMs to understand and precisely ret…
Preview only. Read the full abstract at the source