TY - GEN
T1 - EmoVLM-KD
T2 - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
AU - Lee, Sang Eun
AU - Lee, Yubeen
AU - Park, Eunil
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Visual emotion analysis, which has gained considerable attention in the field of affective computing, aims to predict the dominant emotions conveyed by an image. Despite advancements in visual emotion analysis with the emergence of vision-language models, we observed that instruction-tuned vision-language models and conventional vision models exhibit complementary strengths in visual emotion analysis, as vision-language models excel in certain cases, whereas vision models perform better in others. This finding highlights the need to integrate these capabilities to enhance the performance of visual emotion analysis. To bridge this gap, we propose EmoVLM-KD, an instruction-tuned vision-language model augmented with a lightweight module distilled from conventional vision models. Instead of deploying both models simultaneously, which incurs high computational costs, we transfer the predictive patterns of a conventional vision model into the vision-language model using a knowledge distillation framework. Our approach first fine-tunes a vision-language model on emotion-specific instruction data and then attaches a distilled module to its visual encoder while keeping the vision-language model frozen. Predictions from the vision language model and the distillation module are effectively balanced by a gate module, which subsequently generates the final outcome. Extensive experiments show that EmoVLM-KD achieves state-of-the-art performance on multiple visual emotion analysis benchmark datasets, outperforming the existing methods while maintaining computational efficiency. The code is available in https://github.com/sange1104/EmoVLM-KD.
AB - Visual emotion analysis, which has gained considerable attention in the field of affective computing, aims to predict the dominant emotions conveyed by an image. Despite advancements in visual emotion analysis with the emergence of vision-language models, we observed that instruction-tuned vision-language models and conventional vision models exhibit complementary strengths in visual emotion analysis, as vision-language models excel in certain cases, whereas vision models perform better in others. This finding highlights the need to integrate these capabilities to enhance the performance of visual emotion analysis. To bridge this gap, we propose EmoVLM-KD, an instruction-tuned vision-language model augmented with a lightweight module distilled from conventional vision models. Instead of deploying both models simultaneously, which incurs high computational costs, we transfer the predictive patterns of a conventional vision model into the vision-language model using a knowledge distillation framework. Our approach first fine-tunes a vision-language model on emotion-specific instruction data and then attaches a distilled module to its visual encoder while keeping the vision-language model frozen. Predictions from the vision language model and the distillation module are effectively balanced by a gate module, which subsequently generates the final outcome. Extensive experiments show that EmoVLM-KD achieves state-of-the-art performance on multiple visual emotion analysis benchmark datasets, outperforming the existing methods while maintaining computational efficiency. The code is available in https://github.com/sange1104/EmoVLM-KD.
UR - https://www.scopus.com/pages/publications/105017855948
U2 - 10.1109/CVPRW67362.2025.00560
DO - 10.1109/CVPRW67362.2025.00560
M3 - Conference contribution
AN - SCOPUS:105017855948
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 5633
EP - 5642
BT - Proceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
PB - IEEE Computer Society
Y2 - 11 June 2025 through 12 June 2025
ER -