Student Engagement Recognition from Videos: A Comparison Between Deep Learning Neural Network Architectures

Bulletin of the Technical Committee on Learning Technology (ISSN: 2306-0212)
Volume 21, Number 3, 7-12 (2021)
Received July 21, 2021
Accepted September 5, 2021
Published online October 3, 2021
This work is under Creative Commons CC-BY-NC 3.0 license. For more information, see Creative Commons License

Authors: Pablo S. Werlang *email , Patrícia A. Jaquesemail

*: Corresponding Author
Universidade do Vale do Rio dos Sinos, São Leopoldo, RS, Brazi


Emotions play an essential role in learning. Thus, it is necessary to detect students’ emotions in learning environments. Neural network models, especially deep learning models, have shown excellent performance in recognizing basic emotions (happiness, sadness, fear, disgust, anger) from faces in videos. However, basic emotion occurrence is low in learning environments. Cognitive emotions, such as engagement, confusion, frustration, and boredom, happen five times more frequently than basic ones. However, while basic emotions are relatively easy to distinguish from one another, cognitive emotions are much subtler, thus requiring more complex models for their recognition. This paper presents a comparative study between different deep neural network architectures focusing on student engagement recognition in videos. Using the DAiSEE dataset, we have trained and evaluated fine-tuning models, Conv3D models, and engineered feature models with multiple hyperparameter settings. Our results show that fully trained Conv3D models performed better than fine-tuned VGG16 and ResNet50 models, or even the LSTM model trained with popular engineered features.

Keywords:Engagement recognition, Emotions in learning,Deep Learning


Online learning has existed for some time, but recently it has received more attention from people in every field of knowledge [1]. Such increasing demand has also increased the importance of online learning environments. However, this kind of learning traditionally has lacked a way to account for human psychology. Emotions are an affective state that significantly influence the learning process. While reduced in the past to the counterpoint of human cognition, emotions are nowadays known to be part of the cognitive process [2]. The affective computing field emerges from this need to improve the human-computer interaction by recognizing, simulating, and understanding human emotions [3].

Even though natural for human beings, recognizing emotions is much harder for machines because humans use vision for detecting danger and recognizing other humans’ hidden intentions and emotions [4]. Thus, using visual input is a suitable medium when trying to imbue machines with that capacity. Through facial recognition techniques advancements, it is now possible to use video cameras to recognize emotions [5]. Using hardware present in standard computers instead of expensive and intrusive devices is also a bonus because it makes the technology more likely to be used in everyday lives.

Most research concerning automatic emotion recognition concerns the six basic emotions: happiness, sadness, fear, anger, surprise, disgust [6]. The presence and manifestation of these emotions are seen in every geographic and cultural setting, making them very important to affective computing works or any studies regarding human emotions. The great majority of attempts of automatic emotion recognition through face videos, such as [7, 8], work with these emotions.

Research on emotion recognition in learning environments is a novel trend, and successful endeavors in that field will significantly impact how people learn with the help of computer-based systems [9]. In such learning situations, basic emotions are not as frequent. According to [10], the emotions engagement, confusion, frustration, and boredom (Fig. 1) are shown five more times than basic emotions in learning environments, and [11] shows that there is a correlation between the learning process and those emotions. From that group, engagement is the most expected and desired from a student. For that reason, there are many works in the machine learning field related to recognizing the occurrence of that emotion from various ways, such as video, text, voice, physiological signals, and the like.

Fig. 1. Emotions in learning, from DAiSEE dataset [20]

Given the increasing amount of data available from different fields of knowledge, the advent of Big Data, recent discoveries on neural networks, and Deep Learning models, several works on emotion recognition through video using deep neural networks have emerged. Many models achieved outstanding performance in basic emotion recognition, such as [7] and engagement recognition [12].

While most works focus on overall performance, measuring higher accuracy for classifiers or lower mean square error for regressors, none found compared specific deep learning model architectures for the engagement recognition task on videos. Such comparative studies are important as they provide helpful guidance for future researchers struggling to decide which models for implementing their solutions. This kind of work with models trained on public datasets also encourages future works on the same samples, improving the overall comparability of those models’ performance.

In this work, three popular deep learning models are implemented and trained to recognize the engagement emotion from students in videos: A fine-tuned convolutional VGG network architecture [13] wrapped in a recurrent network, a convolutional three-dimensional (C3D) [14], and a recurrent network fed with engineered (not discovered) features from the videos. The goal of these models is to help students in their learning process by providing means of intelligent learning environments to acknowledge its user’s emotions and allowing these systems to take actions based on them to improve students’ motivation and behavior towards learning success. We discuss a comparison between the models’ performances and propose a fusion model.


Models for recognizing engagement in learning situations can be built to work from various kinds of modalities, such as user input [15], self-reports as questionnaires [16], or face images/videos [12, 17, 18, 19]. Those using visual input can do it through image samples [17] or from image sequences (videos) [18].

Being a modern approach, engagement recognition from videos recently had many works published. The availability of free public datasets about student engagement, such as DAiSEE [20] and EmotiW [21], has greatly helped the community to develop models for that task.
In [20, 22], the authors present the DAiSEE dataset as well as build several deep learning models to serve as a baseline for other works, including a C3D [14], CNN+LSTM model, as well as fine-tune an Inception [23] network.

Besides different network topologies covered in section III, works in engagement recognition usually create models for binary classification (engaged vs. not engaged) [19], multi-level classification [18] (like four different levels of engagement), or intensity regression models [12]. While the first two map the input to specific labels for output, the regression models predict the engagement intensity from a given input in continuous values.
Using the second approach, EmotiW [21] presents a baseline model for engagement intensity prediction and provides the EmotiW dataset, which is used in the competition of the same name.

Works like [12] use more than one model to achieve their goals, employing model fusion techniques. Fusing models with non-redundant features can increase the overall model performance, as seen in the winner models [21, 24].

Despite the kind of model those works employ, specifically of deep learning, some trending architectures and feature extraction techniques are extensively used for the engagement recognition task. The present work intends to explore them individually, test three different models, compare their performances, and propose a deep learning model fusion approach. By reporting specific models’ topology, hyperparameters and showing each architecture’s performance results, such endeavor will surely clear doubts and encourage future researchers to recognize engagement, especially those training on the same dataset.


This section describes the designed approach to recognize engagement from video input, such as details about facial feature extraction and the chosen architectures for the models and their characteristics. Table I shows a comparison between the three model architectures implemented, highlighting their main differences.


A. Transfer Learning Model
Transfer learning is a technique that uses the knowledge learned from a machine learning model to help in the training process of a second model [25]. In short, to employ such a technique, one must acquire a model trained in a domain similar to the desired model’s domain. Then remove the top layers of that model and replace them with new untrained layers. Finally, the model must be trained with those starting layers frozen, ensuring that the new training weights will not overwrite the previously learned weights. This process is called fine-tuning, and it works because the initial layers have learned more generic and abstract features about the domain, something that the old samples and new ones have in common. When the training adjusts the weights of the final layers, it learns to map the initial features into more specific ones from the new samples. The fine-tuning process helps training complex models without consuming a high amount of time or machine resources.

In this work, we used the pre-trained VGG-Face model [26] for the transfer learning task: an implementation of the VGG16 architecture [13] using the Labeled Faces in the Wild dataset [27] as the base model. Then we froze all the layers, removed the top three fully connected, and used the model as input to a recurrent GRU network. Since the original model was trained to recognize static images, and our goal is to make engagement recognition in videos, a time component was inserted through the temporal layers, as shown in Fig. 2.
The GRU network [28], as a recurrent network, can learn features using memory information from the past steps. While the traditional RNN network suffers from the vanishing gradient problem [29], the GRU (and the LSTM) overcome that problem by adding a forget gate, allowing the network to discard less useful or old information during the training process. These kinds of networks quickly became the go-to networks to use when modeling time-dependent problems.

Fig. 2. Transfer learning model. It fine-tune a VGG-Face model, using its weights on a temporal GRU network to recognize engagement in videos.

B. Three-Dimensional Convolutional Model
For the second model, we created a fully trained convolutional model. Even though not benefiting from transfer learning knowledge, this approach should yield good results because it can discover features during training that is truer to the specific task, besides tailoring the model architecture as needed.
In this case, the model chosen to capture spatial-temporal aspects from the videos is a three-dimensional convolutional network, or C3D [14]. Three-dimensional convolutional networks work similarly to the regular two-dimensional counterpart, but instead, they execute the convolution process in a three-dimensional space. In the case of videos, the third dimension used is time. The main difference between C3D and a regular convolutional network wrapped in recurrent layers is that the former model learns the temporal features in the same step as the spatial features. Fig. 3 shows our implementation of the C3D network.

Fig. 3. Convolutional 3D model. It discovers time-dependent features in the convolution process.

C. Engineered Features Model
While convolutional networks perform exceptionally in emotion recognition tasks, not all models are based on automatic feature discovery. Some engineered features carry valuable information for recognizing engagement, such as eye gaze, head pose, and Action Units.

Eye gaze represents the direction each subject’s eye is looking relative to the camera. The head pose indicates the position in space and the three-axis rotation of the subject’s head relative to the camera. The Action Units (AUs) are part of the Facial Action Coding System (FACS) [30]. This system maps different parts of the face and labels its muscles contraction or relaxation as numbers, called Action Units, as shown in Fig. 4. The combination of different AUs can represent every kind of human facial expression. The FACS system uses facial expressions to represent emotions [6], and they are an important feature to be considered when building emotion recognition models.

Fig. 4. Some of the Action Units [36].

This work creates the third model using three features: eye gaze, head pose, and action units. It receives these features as input and builds a classifier from a temporal model. It implements the model using a recurrent LSTM network [31], which like the GRU, can overcome the vanishing gradient problem from traditional RNN networks. Fig. 5 shows the third model, which receives the stream of engineered features through the time and outputs if the video shows an engaged or not engaged person.

Fig. 5. The recurrent LSTM model. It receives eye gaze, head pose, and action units information from videos to detect engagement when the subject is engaged.

D. Feature Extraction
Each model receives facial features as input and performs the classification task. The transfer learning model and the three-dimensional convolutional model discover the features directly from images, while the engineered features model gets them from third-party software.

OpenFace is a tool specific for the affective computing community and for developing applications based on facial behavior analysis. As shown in Fig. 6, it implements facial detection, landmarks detection, eye movement, head positioning, and features extraction such as Action Units recognition. In addition, OpenFace achieved excellent performance in all of these tasks, managing its processing in real-time and obtaining performance compatible with the best results observed in other implementations [32].

Fig. 6. Overview from OpenFace software [32].

In this work, we use OpenFace to break every video into frames and extract the aligned face from each frame. It extracts the aligned faces detecting landmarks from specific points in the subject’s face, then removing the background and centering the face in a new normalized size image, like in Fig. 7. Those aligned faces are used as input to the transfer learning model and the three-dimensional convolutional model. The training process of the model discovers all features. The engineered features used as input on the third model uses the eye gaze, head pose, and action units extracted from OpenFace.

Fig. 7. Aligned face from a video


This section presents the details about the dataset, the approach to address dataset imbalance, implementation details and hyperparameters, and experimental results.

A. Dataset
We trained all models using the DAiSEE dataset [20, 22], which provides 9068 videos with approximately 10 seconds. The videos take place in different locations and luminosity settings (referred to as in the wild setting) and showcase 112 students from age 18-30, 32 female and 80 male, all Asian. They obtained annotations from a commercial collaborative platform, and annotation redundancy and outlier exclusion are used as methods to ensure annotation reliability. They record the subjects’ reactions from watching educational and recreational videos. Each of the following emotions: engagement, confusion, frustration, and boredom gives a score ranging from one to four by the annotators, where one represents no presence. Four represents a high-level presence of the given emotion.

B. Data Imbalance
The dataset samples are split into training, validation, and test sets, each containing 5482, 1720, and 1866 respectively. Each set is subject-independent, meaning that no video from the same subject is present in more than a single split. However, engaged videos (engagement videos labeled with intensity three and four) represent 94.15% of the total samples, leading to inadequate training because the classification algorithm ignores the less represented class, as the penalty for that is too forgiving. Three techniques were employed to counter that problem.

Undersampling was used to randomly discard samples from the engaged class in the training set until we obtained a 20-80% ratio. After that, the SMOTE technique [33] was employed to generate synthetic samples from the not engaged class until we obtained a 40-60% ratio. From that point, the classification was conducted using weights of 0.6 and 0.4 (to not engaged and engaged classes, respectively) to reflect the slight imbalance still present in the training set.

C. Implementation Details
We implemented three kinds of deep learning models. All of them had the purpose of recognizing engagement from face videos. All models used categorical cross-entropy as loss function, softmax as activation function in the last layer, and Adam optimizer [34] with learning rate 1e-4 and accuracy as metric.

As the goal of the work is to classify engaged from not engaged videos, the labels 0 and 1 from the dataset were relabeled as not engaged (value 0), and labels 2 and 3 were relabeled as engaged (value 1).

The training and validation sets’ inputs were composed of the pre-processed samples described in section IV.B, and test set samples are kept unchanged for the sake of comparison between other works. We describe the topology of each model in section III.

The first two models (from sections III.A and III.B) used images as input. Those images are frames from videos extracted using OpenFace aligned faces module from the dataset. The aligned face images are grayscale and 224×224 pixels in size. The videos’ framerate was reduced from 30 to 15 fps for memory reasons, and we used 30 frames for each time window, which means that videos had 2 seconds each. Each video was used with a 50% overlap, meaning that a new video starts on the last half of the previous video.

The third model (from section III.C) used OpenFace software to extract the features from the same videos the two other models used. The features extracted are left and right eye position, gaze position and angle, head position and rotation on three axes, and action units intensity and occurrence. We used the same time window and video overlap configuration as in the first two models.

D. Result and Analysis
For all the models, training was halted when validation loss stopped decreasing for ten epochs. The learning rate was also decreased every three epochs of not improving validation loss.

For the Convolutional 3D model, the variant implementations involved:
–Changes in the size of the convolutional blocks (from one to two).
–The number of blocks (from one to three).
–The width of each block.
–The number of layers and width in the fully connected (FC) layers (from one to three and four to 256).
The results from those competing models are shown in Table II.


The transfer learning model uses the fine-tuning technique, using a pre-trained network, in this case, VGG16 and ResNet50. These networks were trained on the VGG-Face dataset [26] for identifying subjects’ names. Since we trained the models to recognize faces, the models’ bottom layers should contain generic facial features. Thus, after removing the fully connected layers from the top of those models (one implemented using VGG16 architecture and the other ResNet50), some GRU/LSTM layers were added to make the network receive image temporal data (videos). Each model fine-tuned was tested for different widths and depths of recurrent layers, and Table III shows the best performance from each kind.


We trained the engineered features model using 329 feature values from 12821 videos segments using a recurrent LSTM network. We used several topologies, ranging from adjusting the number of LSTM layers (from one to six), the width of each layer (between three to 256 in various combinations), Dropout values (from 0.1 to 0.8), number and width of fully connected layers (from one to three layers and from 5 to 1024 neurons). We even tried a recurrent version of a topology inspired by ResNets. As shown in Table IV, the best result achieved by this model was 61.54% using the model from Fig. 5, which led us to believe that for engagement recognition, we need more data to train this model with only these features and without training directly on the images. Nevertheless, the engineered features model can be a powerful tool as a complementary model when fused with a convolutional model [35].

From Table IV, we can see that C3D models tend to perform better than the Conv+LSTM models for the engagement recognition task. One explanation for this is that 3D convolution networks learn appearance and motion features simultaneously [14], connecting the feature discovery process for temporal and spatial information.


We should note that the models’ performance is tied to its implementation, and different topology and hyperparameters decisions lead to different model results. This work not exhaustive for every model implementation using the chosen architectures but rather a guide for those aiming to know the main differences and expected results for these networks.


Engagement recognition is not easy because it is a much subtler emotion than the six basic ones. In learning situations, though, it is the most important because it indicates that the student keeps up with the tutor’s tasks.

The present work showed a comparison of some popular network models for engagement recognition in videos. The three-dimensional convolutional networks showed better performance from the three model architectures, followed by the transfer learning models and, lastly, by the engineered features. Though the last kind could not perform well, consistent with most related works, it is best used as a supporting model for the main model, which discovers features directly from videos.

It is also worth noting that few datasets are available featuring videos of people watching videos with labels describing the engagement emotion. The dataset used, DAiSEE, shows extreme data imbalance. This required measures for rebalancing the dataset, as well as harmed the generalization capacity of the model. As the study used only the data available in that dataset, our models were not tested in other scenarios, so we cannot attest to its performance in real-life conditions.

It was also challenging to find model hyperparameters that avoid overfitting with the available data. Thus, a model fusion implementation would help as more features would be fed to the training, improving its performance. Consider that the cognitive emotion manifestation usually follows a specific two-directional flow, meaning that to a student experiencing engagement to become bored, he will first become confused, then frustrated, and then bored (the other way around is also accurate). A sample model fusion network for future works could use as input the outputs of each model and predict a more precise emotion through the training process of a regular LSTM network considering the emotion prediction history of the input models.


[1] S. Brooks, “Does personal social media usage affect efficiency and well-being?,” Computers in Human Behavior, vol. 46, pp. 26–37, 2015.

[2] R. S. Lazarus, “Thoughts on the relations between emotion and cognition,” American psychologist, vol. 37, no. 9, p. 1019, 1982.

[3] R. W. Picard, Affective computing. MIT press, 2000.

[4] J. H. Turner, “The evolution of emotions in humans: A Darwinian–Durkheimian analysis,” Journal for the theory of social behaviour, vol. 26, no. 1, pp. 1-33, 1996.

[5] E. Sariyanidi, H. Gunes, and A. Cavallaro, “Automatic analysis of facial affect: A survey of registration, representation, and recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 6, pp. 1113–1133, 2014.

[6] P. Ekman, “Basic emotions,” Handbook of cognition and emotion, pp. 45–60, 1999.

[7] C. Liu, T. Tang, K. Lv, and M. Wang, “Multi-feature based emotion recognition for video clips,” Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 2018, pp. 630–634.

[8] D. H. Kim, W. J. Baddar, J. Jang, and Y. M. Ro, “Multi-objective based spatio-temporal feature representation learning robust to expression inten-sity variations for facial expression recognition,” IEEE Transactions on Affective Computing, vol. 10, no. 2, pp. 223–236, 2017.

[9] R. S. Baker, S. K. D’Mello, M. M. T. Rodrigo, and A. C. Graesser, “Better to be frustrated than bored: The incidence, persistence, and impact of learners’ cognitive–affective states during interactions with three different computer-based learning environments,” International Journal of Human-Computer Studies, vol. 68, no. 4, pp. 223–241, 2010.

[10] S. D’Mello and R. A. Calvo, “Beyond the basic emotions: what should affective computing compute?” CHI’13 Extended Abstracts on Human Factors in Computing Systems. ACM, 2013, pp. 2287–2294.

[11] N. Bosch and S. D’Mello, “The affective experience of novice computer programmers,” International journal of artificial intelligence in education,vol. 27, no. 1, pp. 181–206, 2017.

[12] J. Yang, K. Wang, X. Peng, and Y. Qiao, “Deep recurrent multi-instance learning with spatio-temporal features for engagement intensity predic-tion,” Proceedings of the 2018 on International Conference on Multi-modal Interaction. ACM, 2018, pp. 594–598.

[13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[14] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.

[15] Z. Zhang, Z. Li, H. Liu, T. Cao, and S. Liu, “Data-driven online learning engagement detection via facial expression and mouse behavior recogni-tion technology,” Journal of Educational Computing Research, vol. 58, no. 1, pp. 63–86, 2020.

[16] E. Carlotta Olivetti, M. G. Violante, E. Vezzetti, F. Marcolin, and B. Ey-nard, “Engagement evaluation in a virtual learning environment via facial expression recognition and self-reports: A preliminary approach,” Applied Sciences, vol. 10, no. 1, p. 314, 2020.

[17] O. M. Nezami, L. Hamey, D. Richards, and M. Dras, “Deep learning for domain adaption: Engagement recognition,” arXiv preprint arXiv:1808.02324, vol. 3, 2018.

[18] Huang, Y. Mei, H. Zhang, S. Liu, and H. Yang, “Fine-grained en-gagement recognition in online learning environment,” 2019 IEEE 9th international conference on electronics information and emergencycommunication (ICEIEC). IEEE, 2019, pp. 338–341.

[19] M. A. A. Dewan, F. Lin, D. Wen, M. Murshed, and Z. Uddin, “A deep learning approach to detecting engagement of online learners,” 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI). IEEE, 2018, pp. 1895–1902.

[20] Gupta, A. D’Cunha, K. Awasthi, and V. Balasubramanian, “Daisee: Towards user engagement recognition in the wild,” arXiv preprint arXiv:1609.01885, 2016.

[21] Dhall, A. Kaur, R. Goecke, and T. Gedeon, “Emotiw 2018: Audio-video, student engagement and group-level affect prediction,” Proceed-ings of the 2018 on International Conference on Multimodal Interaction.ACM, 2018, pp. 653–656.

[22] Gupta, R. Jaiswal, S. Adhikari, and V. N. Balasubramanian, “Daisee: dataset for affective states in e-learning environments,” arXiv, pp. 1–22, 2016.

[23] Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.

[24] Dhall, “Emotiw 2019: Automatic emotion, engagement and cohesion prediction tasks,” 2019 International Conference on Multimodal Inter-action, 2019, pp. 546–550.

[25] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.

[26] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” Proceedings of the British Machine Vision Conference (BMVC).BMVA Press, September 2015, pp. 41.1–41.12. [Online]. Available:

[27] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.

[28] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.

[29] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.

[30] P. Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992.

[31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[32] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, “Openface 2.0: Facial behavior analysis toolkit,” 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 59–66.

[33] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.

[34] P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[35] K. Wang, J. Yang, D. Guo, K. Zhang, X. Peng, and Y. Qiao, “Bootstrap model ensemble and rank loss for engagement intensity regression,” 2019 International Conference on Multimodal Interaction, 2019, pp. 551–556.

[36] R. Zhi, M. Liu, and D. Zhang, “A comprehensive survey on automatic facial action unit analysis,” The Visual Computer, vol. 36, no. 5, pp. 1067–1093, 2020.



Pablo S. Werlang

received a bachelor’s degree in Computer Engineering, and a master’s degree in Computational Modeling from Universidade Federal do Rio Grande (FURG) in Brazil in 2011 and 2013. Currently, he is pursuing a Ph.D. degree in Applied Computing from Universidade do Vale do Rio dos Sinos (UNISINOS) in Brazil. In 2011 he joined the Instituto Federal de Sul-Riograndense (IFSUL), where he currently is an Associate Professor. His research interests include Artificial Intelligence applied to education and Deep learning.


Patrícia A. Jaques

is an associate professor and researcher at the Graduate Program in Applied Computing (PPGCA) at the University of Vale do Rio dos Sinos (Unisinos) in Brazil. Patricia’s main research areas are Artificial Intelligence and Affective Computing, with application in Education. She studies how to develop more effective intelligent learning environments by considering both student’s knowledge and affect. In the last years, she’s been investigating how to detect students’ emotions and other affective states in intelligent learning environments and how these systems should respond to those emotions to promote learning and well-being.