IEEE transactions on pattern analysis and machine intelligence
TextMonkey: an OCR-Free Large Multimodal Model for Understanding Document.
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai
Published: 202610.1109/TPAMI.2026.3653415
Abstract
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention layer, we achieve cross-window connectivity at higher input res…
Preview only. Read the full abstract at the source