Unveiling the Potential of Multimodal Large Language Models for Scene Text Segmentation via Semantic-Enhanced Features

  • Ho Jun Kim
  • , Hyung Kyu Kim
  • , Sangmin Lee
  • , Hak Gu Kim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Scene text segmentation is to accurately identify text areas within a scene while disregarding non-Textual elements like background imagery or graphical elements. However, current text segmentation models often fail to accurately segment text regions due to complex background noises or various font styles and sizes. To address this issue, it is essential to consider not only visual information but also semantic information of text in scene text segmentation. For this purpose, we propose a novel semantic-Aware scene text segmentation framework, which incorporates multimodal large language models (MLLMs) to fuse visual, text, and linguistic information. By leveraging semantic-enhanced features from multimodal LLMs, the scene text segmentation model can remove false positives that are visually confusing but not recognized as text. Both qualitative and quantitative evaluations demonstrate that multimodal LLMs improve scene text segmentation performances.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Image Processing Challenges and Workshops, ICIPCW 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4210-4215
Number of pages6
ISBN (Electronic)9798331515942
DOIs
StatePublished - 2024
Externally publishedYes
Event31st IEEE International Conference on Image Processing Challenges and Workshops, ICIPCW 2024 - Abu Dhabi, United Arab Emirates
Duration: 27 Oct 202430 Oct 2024

Publication series

Name2024 IEEE International Conference on Image Processing Challenges and Workshops, ICIPCW 2024 - Proceedings

Conference

Conference31st IEEE International Conference on Image Processing Challenges and Workshops, ICIPCW 2024
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period27/10/2430/10/24

Keywords

  • Large language models (LLMs)
  • Multimodal LLMs
  • Scene text segmentation

Fingerprint

Dive into the research topics of 'Unveiling the Potential of Multimodal Large Language Models for Scene Text Segmentation via Semantic-Enhanced Features'. Together they form a unique fingerprint.

Cite this