چكيده به لاتين
Facial expression act as the most dominant and natural means for non-verbal communications (i.e., expressing emotions) that plays a crucial role in human communications, and its recognition is vital for human beings to understand reciprocal behavior and emotional state. Due to the high correlation between facial expression and mental state, facial expression recognition (FER) is one of the most challenging research issues in computer vision and related fields in which many researches have been conducted about it. In recent years, automatic FER has shifted to unconstrained conditions (a.k.a in-the-wild settings) due to the significant advancement of deep learning techniques, achievement of high accuracy, and the absence of serious challenges in constrained conditions. In unconstrained conditions, there are various real-world challenges, such as the occluded face, variations in head-pose, lighting, facial expression, and background. On the one hand, due to the small size of existing datasets and the imbalance distribution of facial expressions in them, training a deep neural network (DNN) for FER is still a very challenging task. On the other hand, DNNs suffer from problems such as overfitting, insufficient learning efficiency, and high computational complexity.
Automatic FER from static images has three main steps, including image preprocessing, feature extraction, and classification. With using DNNs, feature extraction and classification steps can be integrated into a single step (i.e., end-to-end training). To address the aforementioned problems, in this thesis, a two-stage training approach based on a deep convolutional neural network with a small number of training parameters is proposed. Two-stage training is such that, in the first stage, the network is trained with only the original training set for a limited number of epochs. Then, in the second stage, the network is trained on an augmented training set using the first stage training weights. In order to prevent over-fitting, increasing the learning capacity, and network generalization, the Switchable Normalization (SN) method for normalization and dropBlock for Regularization have been used. Meanwhile, to overcome the class imbalance problem, two methods have been adopted, (1) the oversampling method by using data augmentation techniques for all classes except majority classes and (2) the Focal Loss for guaranteeing identical learning between majority and minority classes. The experimental results indicate that the proposed method with only 1.5M training parameters can achieve 85.76 and 85.81 accuracy on two in-the-wild benchmark datasets (RAF-DB and FERPlus), respectively. It is worth to mention that the proposed method attain a remarkable result with comparison to the current state-of-the-art methods, and is shallower than many of the methods proposed in the FER literature.