TY - GEN
T1 - Spatial Cross-Attention for Transformer-Based Image Captioning
AU - Anh Ngo, Khoa
AU - Shim, Kyuhong
AU - Shim, Byonghyo
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Transformer-based networks have achieved great success in image captioning because of the attention mechanism that finds relevant image locations for each word. However, the current cross-attention process, which aligns word-to-image, does not consider the spatial relationships existing in patch-to-patch. This lack of spatial information may cause incorrect descriptions that fail at generating words that correctly describe the positional relationships. In this paper, we introduce a novel cross-attention architecture that utilizes spatial information from coordinate differences between relevant image patches. In doing so, our new cross-attention process dynamically considers both the related contents and their spatial relationships in caption generation. In addition, we introduce an efficient implementation of relative spatial attention based on convolutional operations. Experimental results show that the proposed spatial cross-attention improves captions to correctly describe the spatial relationships of objects, leading to an increase of 0.7 CIDEr score on the MS-COCO dataset compared to the previous state-of-the-art.
AB - Transformer-based networks have achieved great success in image captioning because of the attention mechanism that finds relevant image locations for each word. However, the current cross-attention process, which aligns word-to-image, does not consider the spatial relationships existing in patch-to-patch. This lack of spatial information may cause incorrect descriptions that fail at generating words that correctly describe the positional relationships. In this paper, we introduce a novel cross-attention architecture that utilizes spatial information from coordinate differences between relevant image patches. In doing so, our new cross-attention process dynamically considers both the related contents and their spatial relationships in caption generation. In addition, we introduce an efficient implementation of relative spatial attention based on convolutional operations. Experimental results show that the proposed spatial cross-attention improves captions to correctly describe the spatial relationships of objects, leading to an increase of 0.7 CIDEr score on the MS-COCO dataset compared to the previous state-of-the-art.
KW - Image captioning
KW - Positional embedding
KW - Spatial cross-attention
KW - Transformer
UR - https://www.scopus.com/pages/publications/86000372007
U2 - 10.1109/ICASSP49357.2023.10096823
DO - 10.1109/ICASSP49357.2023.10096823
M3 - Conference contribution
AN - SCOPUS:86000372007
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Y2 - 4 June 2023 through 10 June 2023
ER -