Abstract: Spectrogram is commonly used as the input feature of deep neural networks to learn the high(er)-level time–frequency pattern of speech signal for speech emotion recognition (SER). Generally, ...
Abstract: In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time ...