TY - JOUR
T1 - Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition
AU - Pham, Nhat Truong
AU - Dang, Duc Ngoc Minh
AU - Nguyen, Ngoc Duy
AU - Nguyen, Thanh Thi
AU - Nguyen, Hai
AU - Manavalan, Balachandran
AU - Lim, Chee Peng
AU - Nguyen, Sy Dzung
N1 - Publisher Copyright:
© 2023 The Author(s)
PY - 2023/11/15
Y1 - 2023/11/15
N2 - Recently, speech emotion recognition (SER) has become an active research area in speech processing, particularly with the advent of deep learning (DL). Numerous DL-based methods have been proposed for SER. However, most of the existing DL-based models are complex and require a large amounts of data to achieve a good performance. In this study, a new framework of deep attention-based dilated convolutional-recurrent neural networks coupled with a hybrid data augmentation method was proposed for addressing SER tasks. The hybrid data augmentation method constitutes an upsampling technique for generating more speech data samples based on the traditional and generative adversarial network approaches. By leveraging both convolutional and recurrent neural networks in a dilated form along with an attention mechanism, the proposed DL framework can extract high-level representations from three-dimensional log Mel spectrogram features. Dilated convolutional neural networks acquire larger receptive fields, whereas dilated recurrent neural networks overcome complex dependencies as well as the vanishing and exploding gradient issues. Furthermore, the loss functions are reconfigured by combining the SoftMax loss and the center-based losses to classify various emotional states. The proposed framework was implemented using the Python programming language and the TensorFlow deep learning library. To validate the proposed framework, the EmoDB and ERC benchmark datasets, which are imbalanced and/or small datasets, were employed. The experimental results indicate that the proposed framework outperforms other related state-of-the-art methods, yielding the highest unweighted recall rates of 88.03 ± 1.39 (%) and 66.56 ± 0.67 (%) for the EmoDB and ERC datasets, respectively.
AB - Recently, speech emotion recognition (SER) has become an active research area in speech processing, particularly with the advent of deep learning (DL). Numerous DL-based methods have been proposed for SER. However, most of the existing DL-based models are complex and require a large amounts of data to achieve a good performance. In this study, a new framework of deep attention-based dilated convolutional-recurrent neural networks coupled with a hybrid data augmentation method was proposed for addressing SER tasks. The hybrid data augmentation method constitutes an upsampling technique for generating more speech data samples based on the traditional and generative adversarial network approaches. By leveraging both convolutional and recurrent neural networks in a dilated form along with an attention mechanism, the proposed DL framework can extract high-level representations from three-dimensional log Mel spectrogram features. Dilated convolutional neural networks acquire larger receptive fields, whereas dilated recurrent neural networks overcome complex dependencies as well as the vanishing and exploding gradient issues. Furthermore, the loss functions are reconfigured by combining the SoftMax loss and the center-based losses to classify various emotional states. The proposed framework was implemented using the Python programming language and the TensorFlow deep learning library. To validate the proposed framework, the EmoDB and ERC benchmark datasets, which are imbalanced and/or small datasets, were employed. The experimental results indicate that the proposed framework outperforms other related state-of-the-art methods, yielding the highest unweighted recall rates of 88.03 ± 1.39 (%) and 66.56 ± 0.67 (%) for the EmoDB and ERC datasets, respectively.
KW - Attention mechanism
KW - Dilated convolutional neural networks
KW - Dilated recurrent neural networks
KW - Generative adversarial networks
KW - Hybrid data augmentation
KW - Long short-term memory
KW - Mel spectrogram features
KW - Short-time Fourier transform
KW - Speech emotion recognition
UR - https://www.scopus.com/pages/publications/85161704678
U2 - 10.1016/j.eswa.2023.120608
DO - 10.1016/j.eswa.2023.120608
M3 - Article
AN - SCOPUS:85161704678
SN - 0957-4174
VL - 230
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 120608
ER -