چكيده به لاتين
Outlier detection aims to identify samples in data that are statistically inconsistent with the majority of the dataset. Outlier detection is an important and challenging topic in data analytics (especially in the presence of high dimensional data) with a wide range of real-world applications (e.g. engineering, healthcare, security, finance, and management). A challenging research with outlier detection in high dimensional data is how to effectively deal with the issue of curse of dimensionality. In this thesis, new methods are proposed for outlier detection in high dimensional data by searching for outliers in lower-dimensional subspaces based on local and global approaches, as well as combining several global subspaces together. First, to address the problems of many irrelevant dimensions and exponential growth of search space, two global unsupervised subspace selection methods, namely Maximum-Relevance-to-Density (MRD) and minimum-Redundancy-Maximum-Relevance-to-Density (mRMRD), are proposed by measuring the dependency between different features and data density, and also, redundancy between features. Experimental results on both synthetic and real datasets show that these proposed algorithms increase the outlier detection accuracy while decreasing the computational complexity and execution time. Then, a subspace outlier detection algorithm is proposed using an ensemble of PCA-based subspaces (SODEP) that can detect several outlier types by combining outlier scores in multiple principal component analysis-based subspaces. In another proposed method, namely local entropy-based subspace outlier detection (LESOD), a local selection of relevant subspace algorithm is proposed that the relevant features for each sample are determined using the concepts of local entropy and local information, and also, an adaptive density-based outlier scoring is developed to reduce the false detection rate. Next, due to the dependency of local subspace selection on the neighborhood definition, a subspace outlier detection algorithm using linear programming and heuristic techniques (SODLPH) is proposed to simultaneously deal with the curse of dimensionality issue in both problems of nearest neighbors search and outlier detection. Experimental results on both synthetic and real datasets demonstrate the viability of the formulation of subspace selection and the effectiveness of the proposed algorithm. Assuming the number of outlier types is known, comparing these proposed methods, including MRD, mRMRD, SODEP, LESOD, and SODLPH, shows that in single outlier type datasets, it is preferable to use the global mRMRD method because of high efficiency in terms of computational complexity and high detection accuracy. In multiple outlier type datasets, the local SODLPH method is more effective. In general, the proposed SODLPH method has achieved the best performance in terms of AUC among the methods presented in this thesis, as well as several existing algorithms. Therefore, when there is no prior knowledge of the number of outlier types in the dataset, it is recommended to use the proposed SODLPH method.