Affective Gesture Recognition in Virtual Reality Using LSTM-CNN Fusion for Emotion-Adaptive Interaction

Soonya Gupta; Deepa  Kumar; Shiva  Sharma

doi:10.51903/jtie.v4i1.278

Authors

Soonya Gupta etaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India
Deepa Kumar Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India
Shiva Sharma Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India

DOI:

https://doi.org/10.51903/jtie.v4i1.278

Keywords:

Emotion Recognition, VR, Body Gestures, CNN-LSTM, Affective Computing

Abstract

Emotion recognition in Virtual Reality (VR) has become increasingly relevant for enhancing immersive user experiences and enabling emotionally responsive interactions. Traditional approaches that rely on facial expressions or vocal cues often face limitations in VR environments due to occlusion by head-mounted displays and restricted audio inputs. This study aims to develop an emotion recognition model based on body gestures using a hybrid deep learning architecture combining Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM). The CNN component extracts spatial features from skeletal data, while the LSTM processes the temporal dynamics of the gestures. The proposed model was trained and evaluated using a benchmark VR gesture-emotion dataset annotated with five distinct emotional states: happy, sad, angry, neutral, and surprised. Experimental results show that the CNN-LSTM model achieved an overall accuracy of 89.4%, with precision and recall scores of 88.7% and 87.9%, respectively. These findings demonstrate the model’s ability to generalize across various gesture patterns with high reliability. The integration of spatial and temporal features proves effective in capturing subtle emotional expressions conveyed through movement. The contribution of this research lies in offering a robust and non-intrusive method for emotion detection tailored to immersive VR settings. The model opens potential applications in virtual therapy, training simulations, and affective gaming, where real-time emotional feedback can significantly enhance system adaptiveness and user engagement. Future work will explore real-time implementation, multimodal sensor fusion, and advanced architectures, such as attention mechanisms for further performance improvements

References

Alabdullah, B. I., Ansar, H., Mudawi, N. Al, Alazeb, A., Alshahrani, A., Alotaibi, S. S., & Jalal, A. (2023). Smart Home Automation-Based Hand Gesture Recognition Using Feature Fusion and Recurrent Neural Network. Sensors, 23(17), 7523. https://doi.org/10.3390/s23177523

Arman, A., Prasetya, P., Arifany, F. N., Pradnyaparamita, F. B., & Laksito, J. (2022). A DIGITAL PRINTING APPLICATION AS AN EXPRESSION IDENTIFICATION SYSTEM. Journal of Technology Informatics and Engineering, 1(2), 5–15. https://doi.org/10.51903/JTIE.V1I2.135

Atmaja, B. T., Sasou, A., & Akagi, M. (2022). Survey on Bimodal Speech Emotion Recognition from Acoustic and Linguistic Information Fusion. Speech Communication, 140, 11–28. https://doi.org/10.1016/j.specom.2022.03.002

Chouhayebi, H., Mahraz, M. A., Riffi, J., Tairi, H., & Alioua, N. (2024). Human Emotion Recognition Based on Spatio-Temporal Facial Features Using HOG-HOF and VGG-LSTM. Computers, 13(4), 101. https://doi.org/10.3390/computers13040101

Dirin, A., & Laine, T. H. (2023). The Influence of Virtual Character Design on Emotional Engagement in Immersive Virtual Reality: The Case of Feelings of Being. Electronics, 12(10), 2321. https://doi.org/10.3390/electronics12102321

Grewal, D., Herhausen, D., Ludwig, S., & Villarroel Ordenes, F. (2022). The Future of Digital Communication Research: Considering Dynamics and Multimodality. Journal of Retailing, 98(2), 224–240. https://doi.org/10.1016/j.jretai.2021.01.007

Huang, Z., Ma, Y., Wang, R., Li, W., & Dai, Y. (2023). A Model for EEG-Based Emotion Recognition: CNN-Bi-LSTM with Attention Mechanism. Electronics, 12(14), 3188. https://doi.org/10.3390/electronics12143188

Izountar, Y., Benbelkacem, S., Otmane, S., Khababa, A., Masmoudi, M., & Zenati, N. (2022). VR-PEER: A Personalized Exer-Game Platform Based on Emotion Recognition. Electronics, 11(3), 1–16. https://doi.org/10.3390/electronics11030455

Kaklauskas, A., Abraham, A., Ubarte, I., Kliukas, R., Luksaite, V., Binkyte-Veliene, A., Vetloviene, I., & Kaklauskiene, L. (2022). A Review of AI Cloud and Edge Sensors, Methods, and Applications for the Recognition of Emotional, Affective and Physiological States. Sensors, 22(20), 7824. https://doi.org/10.3390/s22207824

Kaseris, M., Kostavelis, I., & Malassiotis, S. (2024). A Comprehensive Survey on Deep Learning Methods in Human Activity Recognition. Machine Learning and Knowledge Extraction, 6(2), 842–876. https://doi.org/10.3390/make6020040

Khan, U. A., Xu, Q., Liu, Y., Lagstedt, A., Alamäki, A., & Kauttonen, J. (2024). Exploring Contactless Techniques in Multimodal Emotion Recognition: Insights into Diverse Applications, Challenges, Solutions, and Prospects. In Multimedia Systems (Vol. 30, Issue 3). Springer Berlin Heidelberg. https://doi.org/10.1007/s00530-024-01302-2

Kopalidis, T., Solachidis, V., Vretos, N., & Daras, P. (2024). Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and Datasets. Information, 15(3), 1356. https://doi.org/10.3390/info15030135

Leong, S. C., Tang, Y. M., Lai, C. H., & Lee, C. K. M. (2023). Facial Expression and Body Gesture Emotion Recognition: A Systematic Review on the Use of Visual Data in Affective Computing. Computer Science Review, 48, 100545. https://doi.org/10.1016/j.cosrev.2023.100545

Lian, H., Lu, C., Li, S., Zhao, Y., Tang, C., & Zong, Y. (2023). A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy, 25(10), 1440. https://doi.org/10.3390/e25101440

Mourtzis, D., Angelopoulos, J., & Panopoulos, N. (2023). The Future of the Human–Machine Interface (HMI) in Society 5.0. Future Internet, 15(5), 162. https://doi.org/10.3390/fi15050162

Rahman, M. M., Gupta, D., Bhatt, S., Shokouhmand, S., & Faezipour, M. (2024). A Comprehensive Review of Machine Learning Approaches for Anomaly Detection in Smart Homes: Experimental Analysis and Future Directions. Future Internet, 16(4), 139. https://doi.org/10.3390/fi16040139

Rani, C. J., & Devarakonda, N. (2022). An Effectual Classical Dance Pose Estimation and Classification System Employing Convolution Neural Network –Long ShortTerm Memory (CNN-LSTM) Network for Video Sequences. Microprocessors and Microsystems, 95, 104651. https://doi.org/10.1016/j.micpro.2022.104651

Shomoye, M., & Zhao, R. (2024). Automated Emotion Recognition of Students in Virtual Reality Classrooms. Computers & Education: X Reality, 5, 100082. https://doi.org/10.1016/j.cexr.2024.100082

Siddiqui, M. F. H., Dhakal, P., Yang, X., & Javaid, A. Y. (2022). A Survey on Databases for Multimodal Emotion Recognition and an Introduction to the VIRI (Visible and InfraRed Image) Database. Multimodal Technologies and Interaction, 6(6), 47. https://doi.org/10.3390/mti6060047

Strazdas, D., Hintz, J., Khalifa, A., Abdelrahman, A. A., Hempel, T., & Al-Hamadi, A. (2022). Robot System Assistant (RoSA): Towards Intuitive Multi-Modal and Multi-Device Human-Robot Interaction. Sensors, 22(3), 923. https://doi.org/10.3390/s22030923

Swoboda, D., Boasen, J., Léger, P. M., Pourchon, R., & Sénécal, S. (2022). Comparing the Effectiveness of Speech and Physiological Features in Explaining Emotional Responses during Voice User Interface Interactions. Applied Sciences, 12(3), 1269. https://doi.org/10.3390/app12031269

Udahemuka, G., Djouani, K., & Kurien, A. M. (2024). Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Applied Sciences, 14(17), 8071. https://doi.org/10.3390/app14178071

Vrskova, R., Kamencay, P., Hudec, R., & Sykora, P. (2023). A New Deep-Learning Method for Human Activity Recognition. Sensors, 23(5), 2816. https://doi.org/10.3390/s23052816

Yaseen, Kwon, O. J., Kim, J., Jamil, S., Lee, J., & Ullah, F. (2024). Next-Gen Dynamic Hand Gesture Recognition: MediaPipe, Inception-v3 and LSTM-Based Enhanced Deep Learning Model. Electronics, 13(16), 3233. https://doi.org/10.3390/electronics13163233

Yuvaraj, R., Mittal, R., Prince, A. A., & Huang, J. S. (2025). Affective Computing for Learning in Education: A Systematic Review and Bibliometric Analysis. Education Sciences, 15(1), 65. https://doi.org/10.3390/educsci15010065

Zheng, Y., & Blasch, E. (2023). Facial Micro-Expression Recognition Enhanced by Score Fusion and a Hybrid Model from Convolutional LSTM and Vision Transformer. Sensors, 23(12), 5650. https://doi.org/10.3390/s23125650