IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Caption Assisted Multimodal Large Language Model for Video Moment Retrieval.

Peiyu Xie, Jinxing Li, Guangming Lu, Yong Xu, David Zhang

Published: 202510.1109/TIP.2025.3620124

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant potential across various multimodal tasks, including retrieval, summarization, and reasoning. However, it remains a substantial challenge for MLLMs to understand and precisely ret…

Preview only. Read the full abstract at the source

View at DOI