چكيده به لاتين
Abstract:
In the BigData area, the available data to perform the classification task have grown with
the high-speed rate. As a result, there is a lot of needs for algorithms that can make the
classifications of the huge data set. One possible solution is the use of parallelization to reduce
the amount of time spent. Some of the Ensemble methods has the ability of parallelism in
the training phase, which makes it a good tool for managing BigData. Ensemble learning is a
machine learning approach where multiple learners are trained to solve a particular problem.
Random Forest is an ensemble learning algorithm which comprises numerous decision trees
and nominates a class through majority voting for classification and averaging approach for
regression. The prior research affirms that the learning time of the Random Forest algorithm
linearly increases when the number of trees in the forest augments. This large number of
decision trees in the Random Forest can cause certain challenges. Firstly, it can enlarge
the model complexity, and secondly, it can negatively affect the efficiency of large-scale
datasets. Hence, ensemble pruning methods (e.g. Clustering-based) are devised to select a
subset of decision trees out of the forest. The main challenge is that the prior clustering-based
models require the number of clusters as input. To solve the problem, we devise an Automatic
clustering based pruning model (Auto BC) for Random Forest which can automatically
find the proper number of clusters. Our proposed model is able to obtain an optimal subset
of trees that can provide the same or even better effectiveness compared to the original set.
Auto BC has two components: clustering and selection. First, our algorithm utilizes
a new clustering technique to classify homogeneous trees. In selection part, it takes both
accuracy and diversity of the trees inside each of the exploited clusters into consideration to
choose the best tree.
Extensive experiments are conducted on five datasets. The results show that the out-put of
our pruning algorithm can perform the classification task more effectively than the state-ofthe-
art rival.
Keywords: machine learning, ensemble models, random forest, pruning method, ensemble
pruning