چكيده به لاتين
Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Our aim in this thesis is to propose new approaches to deal with challenges of ESC. As the main challenges of ESC, less static and more unstructured patterns, and lower signal-to-noise ratio than other audio signals (such as speech and music), and large inter- and intra-class variations in different sound classes can be mentioned, which make ESC more challenging than other audio-related classification tasks. Recently, utilizing deep learning approaches have taken the lead from traditional approaches and have produced promising results. However, the achieved improvements are often accompanied by increasing depth, computational complexity, and size of the network, and also require large amounts of labelled training data. In this thesis, we present a new small-size low-complexity model based on convolutional neural networks for ESC. Taking spectral and temporal characteristics of environmental sounds, and inspired by the human auditory system, our model jointly processes spectral and temporal patterns of a two-dimensional time-frequency input representation, which is extracted via using a log-scale frequency axis. Also, by considering the large variations of input patterns as one of the main obstacles to learn efficiently from input patterns, we propose a new global feature pooling method, called Sparse Salient Region Pooling (SSRP). Via imposing a regional bottleneck, the proposed SSRP guides the model to effectively learn from the more salient time-frequency regions. The experimental results demonstrate that the proposed model yields accuracies of 86.7% and 94.8% on ESC-50 and ESC-10, respectively, which are comparable to that of the state-of-the-art methods but are obtained under much less computational complexity and model size. Compared to the baseline model, our model strikingly achieves absolute improvement of 21.8% in accuracy on ESC-50, with 98% smaller model size.
In order to deal with the insufficient labelled data issue, we focus on transfer learning approach where a network pre-trained on a related large-scale dataset is adapted to the target task. We present a new adaptation method in which the main idea is to concentrate the fine-tuning process only on those neurons/kernels that do need changes and have the greatest impact on misclassifying target data. To identify these neurons/kernels, we pose a nested optimization problem for which we propose an effective evolutionary approach as solution. Compared to the conventional fine-tuning approach, our proposed method achieves absolute improvements of about 1.9% and 2.3% in accuracy on ESC-50 and DCASE-17, respectively; remarkable improvements produced not by adding augmented data but with a more efficient utilization of knowledge stored in the pre-trained network.