چكيده به لاتين
Data mining is a systematic process and powerful tool for analyzing data and extracting latent information, patterns, and useful knowledge from a huge amount of raw data in order to solve business issues. Classification and Clustering are the main data mining techniques. The K-means algorithm is a popular clustering method, which is sensitive to the initialization of samples and selecting the number of clusters. Also, it has consistently failed to produce a balanced cluster structure and its performance on high-dimensional datasets has considerably influenced. Principal component analysis (PCA) is a linear dimensionless reduction method that is closely related to the K-means algorithm. Dimension reduction leads to the selection of initial centers in a smaller space, which is a solution to solve initialization problems. The present study investigates the reciprocal relationship between K-means and PCA and adopts an innovative approach of creating sub-datasets and applying step-by-step labeling. The clusters that are obtained from this approach are of high interpretability. The other application of clustering in generating random subspace has improved the accuracy and diversity of ensemble classification methods. If clusters are not balanced (unequal size of clusters) and not strong (unequal number of data from each class in each cluster), the results will deviate from classes with more samples in each cluster and thereby will be biased. While changes in cardinality, variance, and density have arisen due to the importance of balancing in different fields, balancing has never been viewed from both strong and qualitative viewpoints. Therefore, the present study takes a new look at cluster balancing by presenting: 1. novel strong balance-constrained clustering (SBCC) or hard-strong clustering (HSC), 2. Soft and hard hybrid qualitative balanced clustering (SHHQBC), 3. And an innovative hard balanced (Balance-Constrained) clustering method to establish clusters with the highest value (balancing criterion) with the least cardinality. Finally, the automatic labeling of numerical data by a hybrid of K-means and partitioning around medoids (PAM) clustering algorithms with image-processing of cluster plots by singular value decomposition (SVD) is presented that can revolutionize clustering. The research process is implemented as the CRISP-DM methodology to underline the fact that both business (especially in human resource and energy) and data mining objectives have been achieved successfully.