چكيده به لاتين
Recently still image-based human action recognition has become an active research topic in computer vision and pattern recognition. It focuses on identifying a person's action or behavior from a single image. Compared with video based ones, image-based action representation and recognition are impossible to access the motion cues of action, which largely increases the difficulties in dealing with pose variances and cluttered backgrounds. Currently most efficient methods train a deep network directly on still image action recognition datasets using auxiliary data such as human bounding boxes, object bounding boxes, bounding boxes of human body Parts, etc. However, these methods in addition to the costs involved in generating auxiliary data from images have many parameters, therefore are not suitable for devices with limited computing resources such as mobile devices. We propose knowledge distillation and attention transfer from the larger teacher network to the smaller student network that both of them can improve student network performance for human action recognition without increasing parameter and computational costs. Furthermore, a big challenge in action recognition in still images is the lack of large enough datasets, which is problematic for training deep Convolutional Neural Networks (CNNs) due to the overfitting issue. In this paper, by taking advantage of pre-trained CNNs, we employ the transfer learning technique to tackle the lack of massive labeled action recognition datasets. Experimental results show that the knowledge distillation helps a ResNet-18 network to mimic a pre-trained ResNet-34 network and attention transfer helps to student has similar spatial attention maps to those of ResNet-34 teacher, although knowledge distillation works much better than attention transfer. We then took finetuned Se-ResneXt-101 network and used it as teacher for Se-Resnext-50 pertained on ImageNet, Se-ResneXt-50 with knowledge distillation and attention transfer achieve mean average precision of 92.08% on the Stanford 40 dataset. Finally, comparing our results with other papers shows that our method is able to improve the mean average precision of human action recognition in still images without increasing the number of parameters and the complexity of the student network.