Department of Computer,Ferdows Branch, Islamic Azad University, Ferdows, Iran;
10.30508/kdip.2026.563388.1171
Abstract
Speech emotion recognition is considered one of the central challenges in natural language processing and human–machine interaction. This field aims to extract hidden emotional layers from acoustic signals and therefore plays a key role in decision‑support systems, voice‑based assistants, and improving user experience in spoken interfaces. The inherent complexity of speech — including individual variability, cultural differences, and context‑dependent shifts — has made this problem both demanding and highly appealing to researchers. In the present study, two different deep learning models were designed and evaluated for detecting emotional states in speech. The first model is based on recurrent neural networks (RNNs), which are traditionally used for sequential data such as temporal speech signals. This model achieved acceptable performance in identifying basic emotions or simpler patterns. However, when faced with more complex affective states or signals with high variability, its accuracy declined. These limitations stem mainly from RNNs’ difficulty in modeling long‑term dependencies and their sensitivity to temporal noise. To address these issues, the second model was developed using a combination of a GRU architecture and an attention mechanism. GRU units, with their more compact structure and efficient temporal memory, are better suited for capturing and propagating essential features over time. Additionally, the attention mechanism enables the model to assign higher weights to the most informative portions of the speech signal, focusing computational resources on emotionally salient moments. This design allows the model to be more robust against variations in the signal and yields superior accuracy in recognizing diverse emotional categories. According to the results, the final accuracy of this model reached 0.9982, indicating exceptionally strong and nearly flawless performance in speech emotion classification.
jeshanzadeh,D. and ghafari,H. (2025). Speech emotion recognition using Gated recurrent neural network and attention mechanism. (e242473). Intelligent Knowledge Exploration and Processing, 5(18), e242473 doi: 10.30508/kdip.2026.563388.1171
MLA
jeshanzadeh,D. , and ghafari,H. . "Speech emotion recognition using Gated recurrent neural network and attention mechanism" .e242473 , Intelligent Knowledge Exploration and Processing, 5, 18, 2025, e242473. doi: 10.30508/kdip.2026.563388.1171
HARVARD
jeshanzadeh D., ghafari H. (2025). 'Speech emotion recognition using Gated recurrent neural network and attention mechanism', Intelligent Knowledge Exploration and Processing, 5(18), e242473. doi: 10.30508/kdip.2026.563388.1171
CHICAGO
D. jeshanzadeh and H. ghafari, "Speech emotion recognition using Gated recurrent neural network and attention mechanism," Intelligent Knowledge Exploration and Processing, 5 18 (2025): e242473, doi: 10.30508/kdip.2026.563388.1171
VANCOUVER
jeshanzadeh D., ghafari H. Speech emotion recognition using Gated recurrent neural network and attention mechanism. kdip, 2025; 5(18): e242473. doi: 10.30508/kdip.2026.563388.1171