Affective Gesture Recognition in Virtual Reality Using LSTM-CNN Fusion for Emotion-Adaptive Interaction
DOI:
https://doi.org/10.51903/jtie.v4i1.278Keywords:
Emotion Recognition, VR, Body Gestures, CNN-LSTM, Affective ComputingAbstract
Emotion recognition in Virtual Reality (VR) has become increasingly relevant for enhancing immersive user experiences and enabling emotionally responsive interactions. Traditional approaches that rely on facial expressions or vocal cues often face limitations in VR environments due to occlusion by head-mounted displays and restricted audio inputs. This study aims to develop an emotion recognition model based on body gestures using a hybrid deep learning architecture combining Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM). The CNN component extracts spatial features from skeletal data, while the LSTM processes the temporal dynamics of the gestures. The proposed model was trained and evaluated using a benchmark VR gesture-emotion dataset annotated with five distinct emotional states: happy, sad, angry, neutral, and surprised. Experimental results show that the CNN-LSTM model achieved an overall accuracy of 89.4%, with precision and recall scores of 88.7% and 87.9%, respectively. These findings demonstrate the model’s ability to generalize across various gesture patterns with high reliability. The integration of spatial and temporal features proves effective in capturing subtle emotional expressions conveyed through movement. The contribution of this research lies in offering a robust and non-intrusive method for emotion detection tailored to immersive VR settings. The model opens potential applications in virtual therapy, training simulations, and affective gaming, where real-time emotional feedback can significantly enhance system adaptiveness and user engagement. Future work will explore real-time implementation, multimodal sensor fusion, and advanced architectures, such as attention mechanisms for further performance improvements
References
Alabdullah, B. I., Ansar, H., Mudawi, N. Al, Alazeb, A., Alshahrani, A., Alotaibi, S. S., & Jalal, A. (2023). Smart Home Automation-Based Hand Gesture Recognition Using Feature Fusion and Recurrent Neural Network. Sensors, 23(17), 7523. https://doi.org/10.3390/s23177523
Arman, A., Prasetya, P., Arifany, F. N., Pradnyaparamita, F. B., & Laksito, J. (2022). A DIGITAL PRINTING APPLICATION AS AN EXPRESSION IDENTIFICATION SYSTEM. Journal of Technology Informatics and Engineering, 1(2), 5–15. https://doi.org/10.51903/JTIE.V1I2.135
Atmaja, B. T., Sasou, A., & Akagi, M. (2022). Survey on Bimodal Speech Emotion Recognition from Acoustic and Linguistic Information Fusion. Speech Communication, 140, 11–28. https://doi.org/10.1016/j.specom.2022.03.002
Chouhayebi, H., Mahraz, M. A., Riffi, J., Tairi, H., & Alioua, N. (2024). Human Emotion Recognition Based on Spatio-Temporal Facial Features Using HOG-HOF and VGG-LSTM. Computers, 13(4), 101. https://doi.org/10.3390/computers13040101
Dirin, A., & Laine, T. H. (2023). The Influence of Virtual Character Design on Emotional Engagement in Immersive Virtual Reality: The Case of Feelings of Being. Electronics, 12(10), 2321. https://doi.org/10.3390/electronics12102321
Grewal, D., Herhausen, D., Ludwig, S., & Villarroel Ordenes, F. (2022). The Future of Digital Communication Research: Considering Dynamics and Multimodality. Journal of Retailing, 98(2), 224–240. https://doi.org/10.1016/j.jretai.2021.01.007
Huang, Z., Ma, Y., Wang, R., Li, W., & Dai, Y. (2023). A Model for EEG-Based Emotion Recognition: CNN-Bi-LSTM with Attention Mechanism. Electronics, 12(14), 3188. https://doi.org/10.3390/electronics12143188
Izountar, Y., Benbelkacem, S., Otmane, S., Khababa, A., Masmoudi, M., & Zenati, N. (2022). VR-PEER: A Personalized Exer-Game Platform Based on Emotion Recognition. Electronics, 11(3), 1–16. https://doi.org/10.3390/electronics11030455
Kaklauskas, A., Abraham, A., Ubarte, I., Kliukas, R., Luksaite, V., Binkyte-Veliene, A., Vetloviene, I., & Kaklauskiene, L. (2022). A Review of AI Cloud and Edge Sensors, Methods, and Applications for the Recognition of Emotional, Affective and Physiological States. Sensors, 22(20), 7824. https://doi.org/10.3390/s22207824
Kaseris, M., Kostavelis, I., & Malassiotis, S. (2024). A Comprehensive Survey on Deep Learning Methods in Human Activity Recognition. Machine Learning and Knowledge Extraction, 6(2), 842–876. https://doi.org/10.3390/make6020040
Khan, U. A., Xu, Q., Liu, Y., Lagstedt, A., Alamäki, A., & Kauttonen, J. (2024). Exploring Contactless Techniques in Multimodal Emotion Recognition: Insights into Diverse Applications, Challenges, Solutions, and Prospects. In Multimedia Systems (Vol. 30, Issue 3). Springer Berlin Heidelberg. https://doi.org/10.1007/s00530-024-01302-2
Kopalidis, T., Solachidis, V., Vretos, N., & Daras, P. (2024). Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and Datasets. Information, 15(3), 1356. https://doi.org/10.3390/info15030135
Leong, S. C., Tang, Y. M., Lai, C. H., & Lee, C. K. M. (2023). Facial Expression and Body Gesture Emotion Recognition: A Systematic Review on the Use of Visual Data in Affective Computing. Computer Science Review, 48, 100545. https://doi.org/10.1016/j.cosrev.2023.100545
Lian, H., Lu, C., Li, S., Zhao, Y., Tang, C., & Zong, Y. (2023). A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy, 25(10), 1440. https://doi.org/10.3390/e25101440
Mourtzis, D., Angelopoulos, J., & Panopoulos, N. (2023). The Future of the Human–Machine Interface (HMI) in Society 5.0. Future Internet, 15(5), 162. https://doi.org/10.3390/fi15050162
Rahman, M. M., Gupta, D., Bhatt, S., Shokouhmand, S., & Faezipour, M. (2024). A Comprehensive Review of Machine Learning Approaches for Anomaly Detection in Smart Homes: Experimental Analysis and Future Directions. Future Internet, 16(4), 139. https://doi.org/10.3390/fi16040139
Rani, C. J., & Devarakonda, N. (2022). An Effectual Classical Dance Pose Estimation and Classification System Employing Convolution Neural Network –Long ShortTerm Memory (CNN-LSTM) Network for Video Sequences. Microprocessors and Microsystems, 95, 104651. https://doi.org/10.1016/j.micpro.2022.104651
Shomoye, M., & Zhao, R. (2024). Automated Emotion Recognition of Students in Virtual Reality Classrooms. Computers & Education: X Reality, 5, 100082. https://doi.org/10.1016/j.cexr.2024.100082
Siddiqui, M. F. H., Dhakal, P., Yang, X., & Javaid, A. Y. (2022). A Survey on Databases for Multimodal Emotion Recognition and an Introduction to the VIRI (Visible and InfraRed Image) Database. Multimodal Technologies and Interaction, 6(6), 47. https://doi.org/10.3390/mti6060047
Strazdas, D., Hintz, J., Khalifa, A., Abdelrahman, A. A., Hempel, T., & Al-Hamadi, A. (2022). Robot System Assistant (RoSA): Towards Intuitive Multi-Modal and Multi-Device Human-Robot Interaction. Sensors, 22(3), 923. https://doi.org/10.3390/s22030923
Swoboda, D., Boasen, J., Léger, P. M., Pourchon, R., & Sénécal, S. (2022). Comparing the Effectiveness of Speech and Physiological Features in Explaining Emotional Responses during Voice User Interface Interactions. Applied Sciences, 12(3), 1269. https://doi.org/10.3390/app12031269
Udahemuka, G., Djouani, K., & Kurien, A. M. (2024). Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Applied Sciences, 14(17), 8071. https://doi.org/10.3390/app14178071
Vrskova, R., Kamencay, P., Hudec, R., & Sykora, P. (2023). A New Deep-Learning Method for Human Activity Recognition. Sensors, 23(5), 2816. https://doi.org/10.3390/s23052816
Yaseen, Kwon, O. J., Kim, J., Jamil, S., Lee, J., & Ullah, F. (2024). Next-Gen Dynamic Hand Gesture Recognition: MediaPipe, Inception-v3 and LSTM-Based Enhanced Deep Learning Model. Electronics, 13(16), 3233. https://doi.org/10.3390/electronics13163233
Yuvaraj, R., Mittal, R., Prince, A. A., & Huang, J. S. (2025). Affective Computing for Learning in Education: A Systematic Review and Bibliometric Analysis. Education Sciences, 15(1), 65. https://doi.org/10.3390/educsci15010065
Zheng, Y., & Blasch, E. (2023). Facial Micro-Expression Recognition Enhanced by Score Fusion and a Hybrid Model from Convolutional LSTM and Vision Transformer. Sensors, 23(12), 5650. https://doi.org/10.3390/s23125650
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Technology Informatics and Engineering

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
This license allows others to copy, distribute, display, and perform the work, and derivative works based upon it, for both commercial and non-commercial purposes, as long as they credit the original author(s) and license their new creations under identical terms.
Licensed under CC BY-SA 4.0: https://creativecommons.org/licenses/by-sa/4.0/