محبوبه رياحي مدوار

عنوان

تشخيص نمونه هاي پرت در داده هاي با ابعاد بالا با استفاده از زيرفضاهاي داده و ويژگي

مقطع تحصيلي

دكتري تخصصي (PhD)

رشته تحصيلي

مهندسي كامپيوتر- هوش مصنوعي و رباتيكز

سال تحصيل

1394

تاريخ دفاع

1400/12/8

استاد راهنما

احمد اكبري

استاد مشاور

بيژن راحمي - بابك ناصرشريف

دانشكده

مهندسي كامپيوتر

چكيده

تشخيص نمونه‌هاي پرت مسئله‌اي مهم در داده‌كاوي است كه هدفش شناسايي نمونه‌هايي است كه غيرعادي و با اكثريت دادگان ناسازگارند و داراي طيف وسيعي از كاربردهاي دنياي واقعي است. چگونگي مقابله موثر با داده‌هاي با ابعاد بالا به‌دليل طلسم بعد، هنوز يك چالش در تشخيص نمونه‌هاي پرت محسوب مي‌شود. در اين رساله، راهكارهايي جديد براي مقابله با تشخيص نمونه‌هاي پرت در داده‌هاي با ابعاد بالا با استفاده از جستجوي نمونه‌هاي پرت در زيرفضاهاي با ابعاد پايين‌تر مبتني بر رويكردهاي محلي و سراسري و همچنين، تركيب چندين زيرفضاي سراسري پيشنهاد شده است. ابتدا، براي مقابله با مشكلات تعداد بالاي ابعاد بي‌ربط و فضاي نمايي جستجو، دو روش سراسري بدون‌نظارت انتخاب زيرفضا مبتني بر چگالي جهت تشخيص نمونه‌هاي پرت با استفاده از وابستگي بين ويژگي‌هاي مختلف به چگالي داده‌ها و افزونگي بين ويژگي‌ها پيشنهاد داده مي‌شود. نتايج تجربي روي هر دو دادگان‌‌هاي ساختگي و واقعي نشان مي‌دهد كه اين الگوريتم‌هاي پيشنهادي، دقت تشخيص نمونه پرت را افزايش مي‌دهند درحالي‌كه پيچيدگي محاسباتي و زمان اجرا را كاهش مي‌دهند.در ادامه، يك روش تركيبي پيشنهاد شده كه با استفاده از تركيب امتيازهاي پرت در چندين زيرفضاي مبتني بر تحليل مولفه اساسي، قادر به تشخيص چندين نوع نمونه پرت است. در روشي ديگر، يك الگوريتم محلي انتخاب زيرفضاي مرتبط پيشنهاد مي‌شود كه در آن با بكارگيري مفهوم آنتروپي محلي و اطلاعات محلي، ويژگي‌هاي مرتبط با هر نمونه تعيين مي‌شوند و همچنين، يك روش امتيازدهي مبتني بر چگالي تطبيقي به‌منظور كاهش نرخ تشخيص كاذب ارائه مي‌گردد. در ادامه، با توجه به وابستگي انتخاب محلي زيرفضا به تعريف همسايگي، يك روش تشخيص نمونه‌هاي پرت بر اساس الگوريتم هيوريستيك و تكينك برنامه‌ريزي خطي پيشنهاد شده تا به‌طور همزمان با مسئله طلسم بعد در انتخاب زيرفضاي مرتبط و جستجوي همسايه‌ها مقابله كند. نتايج تجربي روي داده‌هاي ساختگي و واقعي، عملي بودن فرمول‌بندي مسئله انتخاب زيرفضا و همچنين كارآمد بودن اين روش را نشان مي‌دهد.

تاريخ ورود اطلاعات

1401/05/31

عنوان به انگليسي

High dimensional outlier detection using data and feature subspaces

تاريخ بهره برداري

2/27/2023 12:00:00 AM

دانشجوي وارد كننده اطلاعات

محبوبه رياحي مدوار

Name: محبوبه رياحي مدوار
Author: محبوبه رياحي مدوار

چكيده به لاتين

Outlier detection aims to identify samples in data that are statistically inconsistent with the majority of the dataset. Outlier detection is an important and challenging topic in data analytics (especially in the presence of high dimensional data) with a wide range of real-world applications (e.g. engineering, healthcare, security, finance, and management). A challenging research with outlier detection in high dimensional data is how to effectively deal with the issue of curse of dimensionality. In this thesis, new methods are proposed for outlier detection in high dimensional data by searching for outliers in lower-dimensional subspaces based on local and global approaches, as well as combining several global subspaces together. First, to address the problems of many irrelevant dimensions and exponential growth of search space, two global unsupervised subspace selection methods, namely Maximum-Relevance-to-Density (MRD) and minimum-Redundancy-Maximum-Relevance-to-Density (mRMRD), are proposed by measuring the dependency between different features and data density, and also, redundancy between features. Experimental results on both synthetic and real datasets show that these proposed algorithms increase the outlier detection accuracy while decreasing the computational complexity and execution time. Then, a subspace outlier detection algorithm is proposed using an ensemble of PCA-based subspaces (SODEP) that can detect several outlier types by combining outlier scores in multiple principal component analysis-based subspaces. In another proposed method, namely local entropy-based subspace outlier detection (LESOD), a local selection of relevant subspace algorithm is proposed that the relevant features for each sample are determined using the concepts of local entropy and local information, and also, an adaptive density-based outlier scoring is developed to reduce the false detection rate. Next, due to the dependency of local subspace selection on the neighborhood definition, a subspace outlier detection algorithm using linear programming and heuristic techniques (SODLPH) is proposed to simultaneously deal with the curse of dimensionality issue in both problems of nearest neighbors search and outlier detection. Experimental results on both synthetic and real datasets demonstrate the viability of the formulation of subspace selection and the effectiveness of the proposed algorithm. Assuming the number of outlier types is known, comparing these proposed methods, including MRD, mRMRD, SODEP, LESOD, and SODLPH, shows that in single outlier type datasets, it is preferable to use the global mRMRD method because of high efficiency in terms of computational complexity and high detection accuracy. In multiple outlier type datasets, the local SODLPH method is more effective. In general, the proposed SODLPH method has achieved the best performance in terms of AUC among the methods presented in this thesis, as well as several existing algorithms. Therefore, when there is no prior knowledge of the number of outlier types in the dataset, it is recommended to use the proposed SODLPH method.

كليدواژه هاي فارسي

تشخيص نمونه‌هاي پرت , داده‌هاي با ابعاد بالا , انتخاب زيرفضاي مرتبط , چگالي داده‌ها , تنكي محلي

كليدواژه هاي لاتين

Outlier detection , High dimensional data , Relevant subspace selection , Data density , Local sparsity

Author

Mahboobe Riyahi

SuperVisor

Dr. Ahmad Akbari

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=26849&Field=0&DTC=6