TY - GEN
T1 - Multi-View Spatial-Temporal Learning for Understanding Unusual Behaviors in Untrimmed Naturalistic Driving Videos
AU - Nguyen, Huy Hung
AU - Tran, Chi Dai
AU - Hoang Pham, Long
AU - Tran, Duong Nguyen Ngoc
AU - Huu-Phuong Tran, Tai
AU - Vu, Duong Khac
AU - Pham-Nam Ho, Quoc
AU - Huynh, Ngoc Doan Minh
AU - Jeon, Hyung Min
AU - Jeon, Hyung Joon
AU - Jeon, Jae Wook
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The task of Naturalistic Driving Action Recognition aims to detect and temporally localize distracting driving behavior in untrimmed videos. In this paper, we introduce our framework for Track 3 of the 8th AI City Challenge in 2024. The approach is primarily based on large model fine-tuning and ensemble techniques to train a set of action recognition models on a small-scale dataset. Starting with raw videos, we segment them into individual action sequences based on their annotation. We then fine-tune four different action recognition models, with K-fold cross-validation applied to the segmented data. Following this, we execute a multi-view ensemble, selecting the most visible camera views for each action class to generate clip-level classification results for each video. Finally, a multi-step post-processing algorithm, which is designed for the AI City Challenge dataset's specific features, is employed to perform temporal action localization and produce temporal segments for the actions. Our solution achieves a final mOS score of 0.7798 and attains the 5th rank on the public leaderboard for the test set A2 of the challenge. The source code will be publicly available at https://github.com/SKKUAutoLab/AIC24-Track03.
AB - The task of Naturalistic Driving Action Recognition aims to detect and temporally localize distracting driving behavior in untrimmed videos. In this paper, we introduce our framework for Track 3 of the 8th AI City Challenge in 2024. The approach is primarily based on large model fine-tuning and ensemble techniques to train a set of action recognition models on a small-scale dataset. Starting with raw videos, we segment them into individual action sequences based on their annotation. We then fine-tune four different action recognition models, with K-fold cross-validation applied to the segmented data. Following this, we execute a multi-view ensemble, selecting the most visible camera views for each action class to generate clip-level classification results for each video. Finally, a multi-step post-processing algorithm, which is designed for the AI City Challenge dataset's specific features, is employed to perform temporal action localization and produce temporal segments for the actions. Our solution achieves a final mOS score of 0.7798 and attains the 5th rank on the public leaderboard for the test set A2 of the challenge. The source code will be publicly available at https://github.com/SKKUAutoLab/AIC24-Track03.
KW - action recognition
UR - https://www.scopus.com/pages/publications/85206452272
U2 - 10.1109/CVPRW63382.2024.00709
DO - 10.1109/CVPRW63382.2024.00709
M3 - Conference contribution
AN - SCOPUS:85206452272
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 7144
EP - 7152
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PB - IEEE Computer Society
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
Y2 - 16 June 2024 through 22 June 2024
ER -