TY - GEN
T1 - Unveiling the Potential of Multimodal Large Language Models for Scene Text Segmentation via Semantic-Enhanced Features
AU - Kim, Ho Jun
AU - Kyu Kim, Hyung
AU - Lee, Sangmin
AU - Kim, Hak Gu
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Scene text segmentation is to accurately identify text areas within a scene while disregarding non-Textual elements like background imagery or graphical elements. However, current text segmentation models often fail to accurately segment text regions due to complex background noises or various font styles and sizes. To address this issue, it is essential to consider not only visual information but also semantic information of text in scene text segmentation. For this purpose, we propose a novel semantic-Aware scene text segmentation framework, which incorporates multimodal large language models (MLLMs) to fuse visual, text, and linguistic information. By leveraging semantic-enhanced features from multimodal LLMs, the scene text segmentation model can remove false positives that are visually confusing but not recognized as text. Both qualitative and quantitative evaluations demonstrate that multimodal LLMs improve scene text segmentation performances.
AB - Scene text segmentation is to accurately identify text areas within a scene while disregarding non-Textual elements like background imagery or graphical elements. However, current text segmentation models often fail to accurately segment text regions due to complex background noises or various font styles and sizes. To address this issue, it is essential to consider not only visual information but also semantic information of text in scene text segmentation. For this purpose, we propose a novel semantic-Aware scene text segmentation framework, which incorporates multimodal large language models (MLLMs) to fuse visual, text, and linguistic information. By leveraging semantic-enhanced features from multimodal LLMs, the scene text segmentation model can remove false positives that are visually confusing but not recognized as text. Both qualitative and quantitative evaluations demonstrate that multimodal LLMs improve scene text segmentation performances.
KW - Large language models (LLMs)
KW - Multimodal LLMs
KW - Scene text segmentation
UR - https://www.scopus.com/pages/publications/85214675560
U2 - 10.1109/ICIPCW64161.2024.10769199
DO - 10.1109/ICIPCW64161.2024.10769199
M3 - Conference contribution
AN - SCOPUS:85214675560
T3 - 2024 IEEE International Conference on Image Processing Challenges and Workshops, ICIPCW 2024 - Proceedings
SP - 4210
EP - 4215
BT - 2024 IEEE International Conference on Image Processing Challenges and Workshops, ICIPCW 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st IEEE International Conference on Image Processing Challenges and Workshops, ICIPCW 2024
Y2 - 27 October 2024 through 30 October 2024
ER -