TY - GEN
T1 - Perceive before Respond
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Xia, Wuyou
AU - Liu, Shengzhe
AU - Rong, Qin
AU - Jia, Guoli
AU - Park, Eunil
AU - Yang, Jufeng
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - In online chatting, people increasingly prefer using stickers to supplement or replace text for replies, as sticker images can express vivid and varied emotions. The Sticker Response Selection (SRS) task aims to predict the sticker image that is most relevant to the history dialogue. Previous researches explore the semantic similarity between context and stickers, overlooking both unimodal and cross-modal emotional information. In this paper, we propose a 'Perceive before Respond' (PBR) training paradigm. PBR perceives sticker emotions through a knowledge distillation module. Variety representations of each emotion category are acquired from the large-scale sticker emotion recognition dataset and distilled into our model to enhance emotion comprehension. We further distinguish stickers with similar subject elements under the same topic. We perform contrastive learning at both inter- and intra-topic levels to obtain discriminative and diverse sticker representations. In addition, we improve the hard negative sampling method for image-text matching based on cross-modal sentiment association, conducting hard sample mining from both semantic similarity and sentiment polarity similarity for sticker-to-dialogue and dialogue-to-sticker. Extensive experiments verify the effectiveness of each proposed component. Ablation experiments on different backbone networks demonstrate the generality of our approach. Our code is released on https://github.com/wuyou-xia/Perceive-before-Respond.
AB - In online chatting, people increasingly prefer using stickers to supplement or replace text for replies, as sticker images can express vivid and varied emotions. The Sticker Response Selection (SRS) task aims to predict the sticker image that is most relevant to the history dialogue. Previous researches explore the semantic similarity between context and stickers, overlooking both unimodal and cross-modal emotional information. In this paper, we propose a 'Perceive before Respond' (PBR) training paradigm. PBR perceives sticker emotions through a knowledge distillation module. Variety representations of each emotion category are acquired from the large-scale sticker emotion recognition dataset and distilled into our model to enhance emotion comprehension. We further distinguish stickers with similar subject elements under the same topic. We perform contrastive learning at both inter- and intra-topic levels to obtain discriminative and diverse sticker representations. In addition, we improve the hard negative sampling method for image-text matching based on cross-modal sentiment association, conducting hard sample mining from both semantic similarity and sentiment polarity similarity for sticker-to-dialogue and dialogue-to-sticker. Extensive experiments verify the effectiveness of each proposed component. Ablation experiments on different backbone networks demonstrate the generality of our approach. Our code is released on https://github.com/wuyou-xia/Perceive-before-Respond.
KW - multimodal learning
KW - sticker response selection
UR - https://www.scopus.com/pages/publications/85209782561
U2 - 10.1145/3664647.3680987
DO - 10.1145/3664647.3680987
M3 - Conference contribution
AN - SCOPUS:85209782561
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 9631
EP - 9640
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -