Spatiotemporal multimodal emotion recognition using Temporal video sequences and pose features for child emotion classification

Document Type

Article

Publication Title

Scientific Reports

Abstract

Developmental psychology and affective computing have placed great emphasis on identifying children’s emotional cues in recent times. In this study, a novel Spatio-Temporal Multimodal Emotion Recognition Network (ST-MERN) for child emotion classification is proposed. Dense feature embeddings of the EmoReact dataset and temporal video sequences are utilized for the study. The proposed method uses 115 continuous frames per visual signal instance, e.g., rotational-translational vectors, facial keypoints, and pose predictions. With steady performance on each frame and a mean confidence of 0.967, this ensures the system maintains good detection fidelity. In order to track subtle emotional changes, our method captures dynamic data like scale variation and frame-to-frame variation (rx, ry, rz, tx, ty). Latent features (p24–p33) provide a profound explanation of emotional states. The model is designed to preserve spatiotemporal consistency and improve emotion recognition by combining these features. Curiosity, uncertainty, excitement, happiness, surprise, disgust, fear, frustration, and valence are the nine categories on which the system categorizes children’s emotional states. Preliminary results show that our system effectively captures expressive nuances, with stable pose data and low feature variability across sequences. The model surpassed earlier models such as LSTM and TCN in generalization, with a high validation accuracy of 93.6% and test accuracy of 94.3% for the BiLSTM-based architecture. The BiLSTM model had enhanced classification capacity for different emotional states with an F1-score of 0.92. The TCN model is well-suited to real-time deployment since it recorded a competitive test accuracy of 91.7% with quick inference times of ~ 0.8 s per clip, even though it was slightly slower than the BiLSTM. With an F1-score of 0.89 and test accuracy of 90.2%, the LSTM model performed robustly; it trained faster than the BiLSTM and TCN, although its accuracy was slightly lower. By providing strong and interpretable classification that is sensitive to the dynamic nature of children’s emotional displays, this technique improves emotion detection in children. Our work provides the foundation for socially sensitive systems, therapy treatments, and affect-conscious education materials.

DOI

10.1038/s41598-025-25813-8

Publication Date

12-1-2025

This document is currently not available here.

Share

COinS