TY - GEN
T1 - Modeling Multimodal Social Interactions
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Lee, Sangmin
AU - Lai, Bolin
AU - Ryan, Fiona
AU - Boote, Bikram
AU - Rehg, James M.
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behav-iors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Conse-quently, they are limited in modeling the intricate dynam-ics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dy-namics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to cu-rate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Exper-iments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.iolprojectslMMSI.
AB - Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behav-iors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Conse-quently, they are limited in modeling the intricate dynam-ics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dy-namics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to cu-rate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Exper-iments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.iolprojectslMMSI.
UR - https://www.scopus.com/pages/publications/85208958541
U2 - 10.1109/CVPR52733.2024.01382
DO - 10.1109/CVPR52733.2024.01382
M3 - Conference contribution
AN - SCOPUS:85208958541
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 14585
EP - 14595
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -