چكيده به لاتين
Nowadays, deep convolutional neural networks (CNN) have gained good results in image classification, but recognizing human actions in still images is still challenging due to the lack of temporal and kinetic information. In most of the current methods, additional information such as human or object bounding boxes, human-background interaction and etc. are used. The human bounding box is provided in most datasets, but the use of the bounding boxes of the objects depends on the performance of the auxiliary networks in detecting them, and this limits the performance of the whole method. Besides, using vision transformers, which are based on the attention mechanism, is more promising in classifying images; because, these models offer better accuracy compared to convolutional networks. This research provides a way to recognize human actions without any additional information by using a CNN and ViT. First we improve the accuracy of a deep convolutional neural network (CNN) called ResNeXT50 by adding CBAM modules to the last layer so that the mean Average Precision (mAP) becomes 92/75 %. The CBAM modules added to the ResNeXT50 are modified by addying identity shortcuts based on the ResNet building blocks. Then, we ensemble a vision transformer (ViT) along with ResNeXT50 + CBAM. To avoid overfitting due to the lack of labeled training data, we use transfer learning of the models pretrained on ImageNet dataset; then, we apply data augmentation on the training data so that, two transformations, each for one network, are applied to the input images. In training phase, for faster convergence and higher accuracy, we use the gradient clipping technique, which has a great impact on improving the final results. Experiments show that the proposed method, trained and tested on the Stanford 40 Actions dataset, achieves mAP of 96.00%, which is better than other state-of-the-art methods. Thus, our proposed method has two advantages. The first one is that, it provides a better mAP and the second one is that, it does not use any additional information while training and testing the model.