چكيده به لاتين
With the introduction of the DETR architecture, a model based on convolutional networks and transformers, end-to-end object tracking research experienced a renaissance. Based on this architecture, which can be used to solve any set prediction problems, studies proposed novel solutions for the multi-object tracking problem, which, unlike classical methods, do not require hand-crafted components and can model this problem with a single neural network. Moving towards end-to-end models has usually led to the emergence of faster and more accurate models than classical approaches. However, end-to-end methods have not yet surpassed their classical competitors in object tracking. In this research project, influenced by the most significant findings in classic methods, we hypothesize that this impairment is due to the non-effective use of temporal features. In the course of this study, we first introduced a novel model with an adjustable temporal field of view, resulting in a 0.5% increase in IDF1 and a 0.3% increase in MOTA, but a steeper increase in computational cost. We then measured the effect of adjusting this hyperparameter on the accuracy of the model . Next, we proposed several aggregation methods for integrating visual features extracted from consecutive frames, with which we then achieved better performance levels than the baseline model, with a reduced computational cost. Finally, in order to better understand the inner workings of this architecture, specifically in this problem, we tried to reason about the behavior of the model, by illustrating the attention mechanism in the encoder and decoder layers. These studies revealed new research horizons that are discussed in detail at the end of this thesis.