چكيده به لاتين
Human vision system is faced with a large amount of input information each second. So in order to make the best use of the limited cognitive resources available, the human mind first, extracts the important regions of the image and then focuses its processing power on them. Inspired by this process that is referred to as visual attention, salient object detection methods are introduced in the field of computer vision, which aim to detect and extract these regions of interest (or salient objects) of a scene to further processing. Biological studies show that the depth information influences on human perception of a scene. Due to the widespread use of depth sensors and ease of access to this information, models for simultaneous utilization of color and depth information are presented. Because of different meanings of salient objects in each application, it is not enough to use low-level heuristic features. In recent years, the Fully Convolutional Networks(FCNs) have been used for the RGB-D salient object detection, because of their ability to extract multi-scale and semantic features of the image that have performed better than traditional heuristic methods. Despite the efforts made in this field, how to combine RGB and depth information and how to use cross-modal multi-scale features is still a challenging task. To address the aforementioned limitations, we propose Hierarchically aggregating cross-modal features (HACF) for fuse cross-modal features across layers. Specifically, we employ a two-stream architecture for RGB and depth data, in which the features were fused from either RGB or depth modality in different layers. Then we aggregate these cross-modal multi-level features from the deeper layers to the lower ones in multiple paths using the HACF strategy. Extensive experiments on six RGB-D datasets demonstrate the effectiveness and efficiency of the proposed method which has no need of any pre/post-processing compared with nine state-of-the-art approaches.