چكيده به لاتين
Object Detection is the process of classification and localization of objects in an image or video. This is a fundamental step in many high-level activities in the field of machine vision, such as activity recognition, scene analysis, scene description, summarization and semantic understanding. Depending on the use of the image or video, object detection is divided into two sub-sections called object detection in images and object detection in videos. Improving accuracy, speed and processing power has always been the focus of researchers, a large part of which is focused on the processing power of GPUs and devices based on powerful servers. Solutions based on GPUs and high processing power have many diverse applications in the real world. While in recent years, many applications of intelligent analytics on video have been presented in the form of edge-based computing and on embedded devices. Limited processing power, limitations on the size of the model to be placed in memory, and limitations on hardware power consumption are among the complexities of this field. In this thesis, an efficient method is presented based on deep neural networks to detect objects in video, in real time (processing speed higher than 15 frames per second) and with processing power that can be used on embedded devices. To have a robust video object detection method, first there must be a robust object detection method in images so that it can be generalized using techniques for use in video. In order to improve object detection in images, this research first presents a new backbone based on MobileNet with modifications such as depth separable convolutional operators and skip connections called MobileDenseNet. Next, a new neck structure based on pyramidal architecture called FCPNLite was designed and implemented, which has strengthened the network for feature extraction from input images. Also, the idea of half-sharing weights was implemented to share weights in the head, which has increased accuracy. Finally, the specifications of the initial boxes have also changed, which has improved for smaller objects in the data and overall final accuracy. By doing these things, the object detection in images in this research achieved an accuracy of 24.8% on COCO dataset, which is 0.8% better than other papers. In addition, by introducing a new recurrent cell called GCRU for feature propagation over time and other changes such as using dual networks and increasing the interval of previous frames, we achieved 67.5% accuracy and 62 frames per second on the MobileDenseNet architecture and 68.7% accuracy and 52 frames per second on the EfficientNet architecture on ImageNet VID dataset, which is the best performance among similar solutions in this field and 0.4% higher than the best similar solution.