TY - JOUR
T1 - Deep features-based speech emotion recognition for smart affective services
AU - Badshah, Abdul Malik
AU - Rahim, Nasir
AU - Ullah, Noor
AU - Ahmad, Jamil
AU - Muhammad, Khan
AU - Lee, Mi Young
AU - Kwon, Soonil
AU - Baik, Sung Wook
N1 - Publisher Copyright:
© 2017, Springer Science+Business Media, LLC.
PY - 2019/3/1
Y1 - 2019/3/1
N2 - Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.
AB - Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.
KW - Convolutional neural network
KW - Rectangular kernels
KW - Spectrogram
KW - Speech emotion recognition
UR - https://www.scopus.com/pages/publications/85032707901
U2 - 10.1007/s11042-017-5292-7
DO - 10.1007/s11042-017-5292-7
M3 - Article
AN - SCOPUS:85032707901
SN - 1380-7501
VL - 78
SP - 5571
EP - 5589
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 5
ER -