چكيده به لاتين
Today, the identification of human activities through still images has emerged as a significant research area in computer vision and pattern recognition. The field of artificial intelligence strives to recognize and categorize types of human behavior or activities from a still image. Unlike videos, still images lack motion information that can describe the activity. Therefore, detecting and identifying activities from still images requires the development of effective and challenging methods. It is evident that deep learning networks, such as convolutional neural networks, have recently emerged as powerful tools in various machine learning domains. However, vision transformers, due to their superior performance and efficiency, are now seen as an attractive alternative to convolutional networks. Additionally, in the context of human activity recognition from still images, the availability of sufficiently labeled datasets poses a significant challenge. With limited training data, deep networks may lead to overfitting. To counter this, deep networks with pre-trained weights from ImageNet can be utilized. In this research, initially, five vision transformer networks with new and optimized architectures and pre-trained weights were selected and trained on the Stanford40 dataset. Given the nature of vision transformers to focus on global and comprehensive features, the accuracy obtained from these networks was analyzed and compared. To enhance accuracy and improve results, the concept of knowledge distillation was employed, where knowledge is transferred from a larger, more complex network, referred to as the teacher, to a smaller network, known as the student. This method facilitates the transfer of softer, probabilistic information from the teacher network's output, enabling the student network to leverage the teacher's knowledge and perform better. The ConvNext convolutional network was used as the teacher network, with twice the parameters of the student networks, allowing local features within still images to be injected into and transferred to the student networks' learning process. This approach led to an improvement in accuracy by 1 to 3 percent in human activity recognition with the student networks.