MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation

  • Zhicheng Zhang
  • , Pancheng Zhao
  • , Eunil Park
  • , Jufeng Yang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to extract the temporal correlation of affective cues in the video. Inspired by psychology research and empirical theory, we verify that the degree of emotion may vary in different segments of the video, thus introducing the sen-timent complementary and emotion intrinsic among temporal segments. We propose an MAE-style method for learning robust affective representation of videos via masking, termed MART. First, we extract the affective cues of the lexicon and verify the extracted one by computing its matching score with video content, in terms of sentiment and emotion scores alongside the temporal dimension. Then, with the verified cues, we propose masked affective modeling to re-cover temporal emotion distribution. We present temporal affective complementary learning that pulls the complementary part and pushes the intrinsic one of masked multimodal features, where the constraint is set with cross-modal attention among features to mask the video and recover the degree of emotion among segments. Extensive experiments on five benchmarks show the superiority of our method in video sentiment analysis, video emotion recognition, multimodal sentiment analysis, and multimodal emotion recognition.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages12830-12840
Number of pages11
ISBN (Electronic)9798350353006
ISBN (Print)9798350353006
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period16/06/2422/06/24

Keywords

  • Masked Autoencoder
  • Video Emotion Analysis

Fingerprint

Dive into the research topics of 'MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation'. Together they form a unique fingerprint.

Cite this